Hyperparameter optimization

The right hyperparameters can make all the difference for the performance of your ML model, but finding the optimal set of hyperparameters is often a slow process. AutoML solutions have been made to assist in the search for the optimal set of hyperparameters, see for example Cloud AutoML This gives us an optimal model, but we do not know what the architecure and the corresponding optimal set of hyperparameters are.

Hyperparameter optimization (hpo) can luckily also be done using Python. In this repository it is demonstrated how to perform hpo in the:

Naive way using gridsearch with Scikit-Learn's GridSearchCV
Smart way using bayesian optimization with Optuna

As use case a binary sentiment classification task (positive vs negative) is chosen. The ML model architecture is built via Scikit Learn and consist of a Scikit Learn pipeline with an Anonymizer (custom transformer), TFIDF vectorizer and a classifier. The performance of each model is tracked using MLflow. The hyperparameter searchspace in this situation has two dimensions:

Different vectorizer settings: ngram ranges
Different classifiers and classifier settings: Naive Bayes, Random Forest and Support Vector Machines.

After evaluating all the models their performance, the best model is selected. This model then trained on the complete corpus.

Dependencies

This project has the following dependencies:

Dependencies
- Large Movie Review Dataset from Stanford (already included in the repo)
- Git LFS
- Python >= 3.7
- Docker
- Optional: Poetry

Setup

Download Git LFS here or via brew:

$ brew install git-lfs

Install Git LFS on your machine:

$ sudo git lfs install --system --skip-repo

Clone the repo. If you have already cloned the repo before installing Git LFS, run the following to get all the large files (else only the pointers to the large files will be present on your machine):

$ git lfs pull origin

Create a virtual environment with at least Python 3.7 via the tool of your choice (conda, venv, etc.)
Install the Python dependencies

Using poetry:

$ poetry install

Not using poetry:

$ pip install -r requirements.txt

Create the directories database, artifacts and trained_models in the data directory

$ cd data
$ mkdir database artifacts trained_models

Explore hyperparameter space

1. Start the MLflow application by running the docker-compose file. This will run the MLflow server and a Postgresql database. The MLflow server is accessible at localhost:5000.

$ docker-compose up --build

With the current configuration the statistics are stored in the Postgres database, whereas the artifacts are stored on your disk.

Define the to be explore hyperparameter space. The default hyperparameters to be searched are:

Vectorizer:
- TFIDF vectorizer
  
  ngram_range: (1, 1), (1, 2)
Classifier:
- SVM
  
  C: [0.1, 0.2]
- Multinomial Naive Bayes
  
  alpha: [1e-2, 1e-1]
- RandomForestClassifier
  
  max depth: [2, 4]

Explore the hyperparameter space using either gridsearch or bayesian optimization

Using gridsearch:

$ python hpo_gridsearch.py

The following arguments can be provided:

--size or -s: sample size of the dataset; default: 25000
--workers or -w: the number of CPU cores that can be used; default: 2
--random or -r: if provided a randomsearch instead of a gridsearch will be performed. The hyperparameter space is randomly sampled for these combinations; default: not specified

Using bayesian optimization:

$ python hpo_bayesian.py

The following arguments can be provided:

--size or -n: sample size of the dataset; default: 25000
--workers or -w: the number of CPU cores that can be used; default: 2
--trial or -t: the number of hyperparameters sets to explore; default: 20

4. After the run is finished the parameters and metrics (performance) of each model is visible in the corresponding experiment in the MLflow dashboard

Train the best model on the complete dataset and evaluate performance on the test dataset

$ python train.py

6. The best model is stored in the directory data/trained_models in the subdirectory with the corresponding experiment name. The model.pkl is your trained ML model that can be utilized to make predictions!

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
data		data
hpo		hpo
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
README.rst		README.rst
docker-compose.yaml		docker-compose.yaml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
wait-for-it.sh		wait-for-it.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

hpo

hpo

tests

tests

.gitattributes

.gitattributes

.gitignore

.gitignore

.pre-commit-config.yaml

.pre-commit-config.yaml

Dockerfile

Dockerfile

README.rst

README.rst

docker-compose.yaml

docker-compose.yaml

poetry.lock

poetry.lock

pyproject.toml

pyproject.toml

requirements-dev.txt

requirements-dev.txt

requirements.txt

requirements.txt

wait-for-it.sh

wait-for-it.sh

Repository files navigation

Hyperparameter optimization

Dependencies

Setup

Explore hyperparameter space

About

Releases

Packages

Languages

avinashpancham/hpo

Folders and files

Latest commit

History

Repository files navigation

Hyperparameter optimization

Dependencies

Setup

Explore hyperparameter space

About

Topics

Resources

Stars

Watchers

Forks

Languages