Skip to content

avoytkiv/azml_predict_salary_nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Predicting job salary using Neural Network model on Azure ML (NLP)

This projects shows how to use Azure ML to train a model and deploy it as a web service. Data is mostly unstructured, so we'll use NLP to extract features from text columns. The goal is to predict the salary based on the job description, title etc.

High-level steps:

  • Start with jupyter notebook as a prototype.
  • Refactor notebook to scripts.
    • Clean non-essential code.
    • Use functions.
    • Add logging.
  • Create all necessary Azure resources (resource group, workspace, compute cluster, etc.)
  • Create and run the pipeline.
  • Register the model.
  • Create the endpoint.
  • Test the endpoint.

Low-level steps:

Data science:

  • Split the data into train, validation, and test sets. This is the first to avoid data leakage.
  • Featurize categorical columns.
    • One-hot encoding for categorical columns.
    • Tokenize and vectorize text columns.
  • Transform the data into tensors.
  • Create a simple neural network.
    • Combine vectorized (one-hot encoded) and featurized text columns (bag of words).
  • Train the model.
  • Evaluate the model.

68747470733a2f2f6769746875622e636f6d2f79616e646578646174617363686f6f6c2f6e6c705f636f757273652f7261772f6d61737465722f7265736f75726365732f77325f636f6e765f617263682e706e67

Moving to Azure ML:

  • Create a resource group.
  • Create a workspace.
  • Create a compute cluster.
  • Load the data to Azure ML Datastore.
  • Create the components.
  • Create and run the pipeline.

Setup the environment

Install and activate the conda environment by executing the following commands:

conda env create -f environment.yml
conda activate azure_ml_sandbox

Training and deploying in cloud

Screenshot 2023-10-02 at 15 20 37

Upload the data to Azure ML Datastore. In my case, I'm using a single file, so the type=uri-file is used in data.yml.

Create compute cluster:

az ml compute create -f cloud/cluster-cpu.yml -g <resource-groupe-name> -w <workspace-name>

Create the dataset we'll use to train the model:

az ml data create -f cloud/data.yml -g <resource-groupe-name> -w <workspace-name>

NOTE: For file more than 100MB, compress or use azcopy

Create the components:

az ml component create -f cloud/01_split.yml
az ml component create -f cloud/02_preprocess.yml
az ml component create -f cloud/03_train.yml
az ml component create -f cloud/04_test.yml

Create and run the pipeline.

run_id=$(az ml job create -f cloud/pipeline-job.yml --query name -o tsv)

Download the trained model

az ml job download --name $run_id --output-name "model_dir"

Create the Azure ML model from the output.

az ml model create --name model-pipeline-cli --version 1 --path "azureml://jobs/$run_id/outputs/model_dir" --type mlflow_model
Screenshot 2023-10-02 at 15 41 21

Create the endpoint

az ml online-endpoint create -f cloud/endpoint.yml
az ml online-deployment create -f cloud/deployment.yml --all-traffic

Test the endpoint

az ml online-endpoint invoke --name endpoint-pipeline-cli --request-file test_data/images_azureml.json

Clean up the endpoint, to avoid getting charged.

az ml online-endpoint delete --name endpoint-pipeline-cli -y

Useful commands:

az version

# Output:
{
  "azure-cli": "2.53.0",
  "azure-cli-core": "2.53.0",
  "azure-cli-telemetry": "1.1.0",
  "extensions": {}
}

If ML extension is not installed:

az extension add -n ml

# Output:

{
  "azure-cli": "2.53.0",
  "azure-cli-core": "2.53.0",
  "azure-cli-telemetry": "1.1.0",
  "extensions": 
  {
    "ml": "2.20.0"
  }
}

Run the help command to verify your installation and see available subcommands:

az ml -h

References

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages