Skip to content

Azure/elenchus

Repository files navigation

Project Elenchus

This project provides tools and libraries to train deep learning models, accessing data directly from a SQL table.

Getting-Started

In this Getting-Started section, we will demonstrate how to fine tune a model on a Question Natural Language Inference (QNLI) tasks. However, the code can easily be adapted for fine-tuning transfomer models for other variants of text classification.

As described on Hugging Face, "QNLI is the task of determining if the answer to a certain question can be found in a given document. If the answer can be found the label is 'entailment'. If the answer cannot be found the label is 'not entailment'."

For example:

Question: What percentage of marine life died during the extinction?
Sentence: It is also known as the “Great Dying” because it is considered the largest mass extinction in the Earth’s history.
Label: 0 (not entailment)

Question: Who was the London Weekend Television’s Managing Director?
Sentence: The managing director of London Weekend Television (LWT), Greg Dyke, met with the representatives of the "big five" football clubs in England in 1990.
Label: 1 (entailment)

Key steps:

  1. Clone this repository
  2. (Optional) Create Azure SQL Managed Instance
  3. (Optional) Install ODBC driver for SQL Server
  4. Generate Conda environment
  5. Create table and insert tutorial data
  6. Fine-tune model

We will describe this in the following subsections. The training script is heavily influenced by the Hugging Face fine-tuning tutorial, which can be found here.

Create Azure SQL Managed Instance (MI)

Note: This Getting Started is built around using Azure SQL Server with ODBC and SQLAlchemy. If you are familiar with SQL, you should have no trouble modifying the code and configuration to fit your needs. This and the following section are offered to enable anybody else. We will add support for other SQL engines soon.

Go to the Azure Portal, and use the information below to create an Azure SQL Managed Instance on the Azure Marketplace:

  • Go to this link and press "Create" to go to "Basics" setting

  • Subscription: Select your Azure Subscription. If you do not have an Azure Subscription, you can create one on this link.

  • Resource group: If you have an existing resource-groups in your subscription for this project, you can select that. Otherwise, press "Create new" and select a name for your new resource-group.

  • Manage Instance name: Select a name for your Azure SQL Managed Instance

  • Region: Select a region. Your database will reside in the datacenter in this region. Note that for some regions you may get error for not having availablity for your subscription. Try other regions!

  • Compute + storage: Click Configure Managed Instance. For the purposes of this tutorial, you can select default options, except that you may want to make some changes to Compute + storage settings:

    • Reduce the number of vCores to the minumum
    • Reduce amount of storage to minimum
    • Select Locally-redundant backup storage

Detailed information on Azure SQL Managed Instance can be found here.

  • Authentication: Select Use SQL authentication and pick a secure username and password for database admin on the corresponding spaces. Keep the username/password somewhere safe; you will need that on the next steps.

  • Click Review + create and click Create.

Leave this tab open as deployment may take some time! You can do the next two subsections as you are waiting.

Install ODBC driver for SQL

You can find the driver here.

Please install version 18 of the driver for your operating system.

Generate Conda Environment

We provide a Conda environment definition (environment.yml), to enable you to install all software dependencies with one command: conda env create -f environment.yml.

Prerequisite: If you don't have the conda package manager installed yet, we recommend installing miniconda from here

Run conda activate elenchus

Create Database

If the Azure SQL Managed Instance deployment is still in progress, you need to wait for the completion.

After the deployment is completed, click on Go to resource button. This takes you to the the resource profile. You can see the Managed instance admin name you selected and the Host URL on the Essentials section of the page. Save these information for the next subsection.

On top left side, press + to add New database. Keep all the field as default and just enter a name for your database on Database name field. Press Review and Create. Then press Create.

Create table and insert data

Please update the provided template file config_template.json with the required information about your SQL server, listed below, and store the file under config.json:

  • username: admin username you selected on Step 2
  • password: admin password you selected on Step 2
  • driver: the default value is "ODBC Driver 18 for SQL Server", but if you installed a different ODBC driver, you can modify this field accordingly.
  • server: Go to the Azure SQL resource profile, under Server name you see your database server URL. Copy the URL and paste it here.
  • database: pick a name for your database
  • table_prefix: select a prefix for the tables name. For example, if your prefix would be "glue_", the table names would 'gule_train', 'glue_validation' and 'glue_test'.

Run python convert_dataset.py.

The script will download The Microsoft Research Paraphrase Corpus glue dataset and upload it to your database. You can find more info about the dataset on this link.

Fine-Tune the model

At this step you can fine-tune an example transformer with your data from database by running: python train.py.

Note that the performance improvement after each epoch.

For reference, when fine-tuning the model with the default settings on a Azure Standard NC6 (NVIDIA Tesla K80), this should take about 15 minutes.

If you run into memory capacity issues such as RuntimeError: CUDA out of memory., you can decrease the batch-size.

Note: We rely on storing sentence embeddings in Azure SQL, to use them as input to a tiny text classification model. If you prefer to not use embeddings, simply change the configuration file setting use_embeddings.

Delete and Cleanup

If you need to delete the tables, you can run python delete_dataset.py -tables

If you need to delete the whole database, you can run python delete_dataset.py -db

If you are done with the experiment, you can also go to the Azure Portal and delete the Azure SQL Managed Instance and/or the Resource Group.

Switching Classification Task Variant

You can switch to a different classification task (aka. glue subset), by editing the data section in the config.json configuration file. For example, you can set glue_subset to "mnli". Be careful to also update the number of labels num_labels, as well as the names of the train and validation splits for those task variants.

Don't forget to also run python convert_dataset.py before you try to fine-tune the model.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

About

Machine teaching with Azure SQL

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages