This project is used to run classification of news articles from Frontiers using huggingface environment.
The project is used for both model training and deployment.
The code is easily extensible and can be used as a template for other text classification tasks and beyond.
Table of Contents
With poetry: poetry install
With docker: docker-compose up -d
To run a command with docker, preceed the commands described below with docker-compose exec news_classifier
.
The entry-point for the project is the main.py
file, which is used to execute tasks defined in ./news_classifier/tasks
.
To run a task:
poetry run python main.py <task_name> [<optional_config_path>] [<optional_output_dir>]
It's easy to add new tasks by inherithing from news_classifier.tasks.BaseTask
.
The following tasks are defined:
scrape_data
: scrape data from Frontiers blog page.clean_data
: remove articles not scraped correctlyformat_data
: apply transformations and split in train/valid/testtrain
: train and validate a text classifier (with huggingface)- The default model is a distilled Bert Transformer.
- During training, the following metrics are monitored: loss, accuracy, macro-precision/recall/f1.
- All the parameters of the model are logged with mlflow.
evaluate
: evaluate the classifier on a dataset (the test set by default). It computes:- confusion metrics
- global metrics: accuracy, macro-precision/recall/f1
- accuracy/precision/recall/f1 for each class.
Each of the task can be configured via a yaml
configuration file. You can check the code documentation to see how to configure it. A default one is provided in config.yml
.
In the default configuration categories are merged in 2 macro-categories (HEALTH
and OTHER
). On the test set it achieves about 80% on all the metrics (accuracy, precision, recall).
To run the sequence of tasks with the default configuration and output folder, you can run:
poetry run python main.py scrape_data
poetry run python main.py clean_data
poetry run python main.py format_data
poetry run python main.py train
poetry run python main.py evaluate
Logging is managed with MLFlow, and results are stored to a local SQLite database. To see the mlflow UI you can run:
poetry run mlflow ui --backend-store-uri sqlite:///mlruns.db
With docker-compose, an mlflow service is already started and available on port 5000.
The port can be easily changed in the docker-compose.yml
file.
To execute the tests, run:
poetry run pytest tests
There are two options to deploy the app for inference.
Api exposing a predict
method, built with FastAPI.
Run the app with:
poetry run uvicorn api.app:app --port <PORT>
Once turned on, you can try the api on localhost:<PORT>/docs
.
When using docker, a service for the api is started and is available on port 5001
.
The port can be easily changed in the docker-compose.yml
file.
Lambda paired with an implicit API, built with AWS SAM.
Code and model are enclosed in the container image.
To build the app, first copy news_classifier
, poetry.lock
, pyproject.toml
, and the desired model.p
in sam-app/app
. This is needed to build the docker image for sam
. Then run:
# move into sam-app folder
cd sam-app
# build the image
sam build
# test the app locally
sam local start-api
To deploy the app on AWS, run:
sam deploy