Implement a machine learning pipeline to priorize the business licenses that are likely to die in 2 years in Chicago.
The full project report can be found here
cofigs
: a folder contains the configure files of different combinations of features. We use them to pass all the parameters that we need into pipelines.data
: contains an sh file to download the cleaned full data set.data_collector
: a folder contains all the code we use to collect and clean data.output
: a folder to save the results of our pipeline, including performance table, precision and recall curve plots and AUC-ROC curve plots.pipeline
: contains the modules of imputation, evaluation, discretization, get dummies and scaling.tests
: a set of test code for our piepline.main.py
: the main function to run models and get results.transformer.py
: the function to preprocess data set before modeling.
Get the full dataset
cd data
sh get_fullfiles.sh
All the packages' requirement is in the enviorment.yml
To clone the enviorment, simply run the following:
conda env create -f environment.yml
To activate the enviorment, simply run the following:
conda activate myenv
python setup.py install
py.test
python main.py --config ./cofigs/acs_geo.yml
In the configs file, there are different combination of features that from ACS, reported 311, reported Crime, business license that you can choose.
The results of the pipeline is saved in the output folder.
Under the performance foler, there would be csvs to keep all the performance of all models
Under the pr folder, there would be precison-recall graphs
Under the roc foler, there would be roc graphs
- temporal validation table
- feature list
- feature importance of the best model
- final list of the best model
This project is licensed under the MIT License - see the LICENSE.md file for details
This project is the final project of machine learning for public policy in University of Chicago.
- Supervised by Professor Rayid Ghani
- Inspired by Satej