https://www.kaggle.com/c/dato-native
-
Expected directories structure:
+ <parent directory> |---+ data | |--- <unpacked data from kaggle> | |---+ code (working directory) |--- <python files provided in code directory in this repository> -
Assuming that data are unpacked in
datafolder, run:mkdir ../data/json8 python process_html.py ../data/ ../data/json8to prepare JSONs with data.
-
Final model is an ensemble of boosted trees and logistic classifiers, but the most predictive power is from the first one. Logistic classifiers provide only very small improvement.
-
Build five logistic classifiers. One for each chunk of data:
mkdir models python chunk_classifier.py -
Build logistic classifier model on all data:
mkdir output python logistic_regression_bags.py -
Combine all logistic models:
python combine_lr.py -
Build boosted trees model:
python gbt2.pyBuilding this model requires around 90GB of memory, but it can also be performed efficiently on machine with 16GB of RAM if big enough swap file will be created - it will allow to allocate GLC required memory, but still paging won't be an issue as effectively it uses only 10 GB.
-
Ensemble boosted trees model and combined logistic classifiers:
python ensembler.py -
Final submission should be in
output/ensembler.csvfile. -
As a version 1.5.2 of graphlab doesn't support random_seed parameter for boosted trees classifier the results can be slightly different between runs.