Skip to content

avinashpancham/hpo

Repository files navigation

Hyperparameter optimization

The right hyperparameters can make all the difference for the performance of your ML model, but finding the optimal set of hyperparameters is often a slow process. AutoML solutions have been made to assist in the search for the optimal set of hyperparameters, see for example Cloud AutoML This gives us an optimal model, but we do not know what the architecure and the corresponding optimal set of hyperparameters are.

Hyperparameter optimization (hpo) can luckily also be done using Python. In this repository it is demonstrated how to perform hpo in the:

  • Naive way using gridsearch with Scikit-Learn's GridSearchCV
  • Smart way using bayesian optimization with Optuna

As use case a binary sentiment classification task (positive vs negative) is chosen. The ML model architecture is built via Scikit Learn and consist of a Scikit Learn pipeline with an Anonymizer (custom transformer), TFIDF vectorizer and a classifier. The performance of each model is tracked using MLflow. The hyperparameter searchspace in this situation has two dimensions:

  • Different vectorizer settings: ngram ranges
  • Different classifiers and classifier settings: Naive Bayes, Random Forest and Support Vector Machines.

After evaluating all the models their performance, the best model is selected. This model then trained on the complete corpus.

Dependencies

This project has the following dependencies:

Setup

  1. Download Git LFS here or via brew:
$ brew install git-lfs
  1. Install Git LFS on your machine:
$ sudo git lfs install --system --skip-repo
  1. Clone the repo. If you have already cloned the repo before installing Git LFS, run the following to get all the large files (else only the pointers to the large files will be present on your machine):
$ git lfs pull origin
  1. Create a virtual environment with at least Python 3.7 via the tool of your choice (conda, venv, etc.)
  2. Install the Python dependencies

Using poetry:

$ poetry install

Not using poetry:

$ pip install -r requirements.txt
  1. Create the directories database, artifacts and trained_models in the data directory
$ cd data
$ mkdir database artifacts trained_models

Explore hyperparameter space

1. Start the MLflow application by running the docker-compose file. This will run the MLflow server and a Postgresql database. The MLflow server is accessible at localhost:5000.

$ docker-compose up --build

With the current configuration the statistics are stored in the Postgres database, whereas the artifacts are stored on your disk.

  1. Define the to be explore hyperparameter space. The default hyperparameters to be searched are:
  • Vectorizer:
    • TFIDF vectorizer
      • ngram_range: (1, 1), (1, 2)
  • Classifier:
    • SVM
      • C: [0.1, 0.2]
    • Multinomial Naive Bayes
      • alpha: [1e-2, 1e-1]
    • RandomForestClassifier
      • max depth: [2, 4]
  1. Explore the hyperparameter space using either gridsearch or bayesian optimization

Using gridsearch:

$ python hpo_gridsearch.py

The following arguments can be provided:

  • --size or -s: sample size of the dataset; default: 25000
  • --workers or -w: the number of CPU cores that can be used; default: 2
  • --random or -r: if provided a randomsearch instead of a gridsearch will be performed. The hyperparameter space is randomly sampled for these combinations; default: not specified

Using bayesian optimization:

$ python hpo_bayesian.py

The following arguments can be provided:

  • --size or -n: sample size of the dataset; default: 25000
  • --workers or -w: the number of CPU cores that can be used; default: 2
  • --trial or -t: the number of hyperparameters sets to explore; default: 20

4. After the run is finished the parameters and metrics (performance) of each model is visible in the corresponding experiment in the MLflow dashboard

  1. Train the best model on the complete dataset and evaluate performance on the test dataset
$ python train.py

6. The best model is stored in the directory data/trained_models in the subdirectory with the corresponding experiment name. The model.pkl is your trained ML model that can be utilized to make predictions!

About

Hyperparameter optimization for ML models

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published