Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

CHARLOTTESVILLE OPEN DATA CHALLENGE

Best Predictive Model Competition

Team Name: Love Thy K-Nearest Neighbor Team members: Alex P. Miller (alexmill@wharton.upenn.edu)

Final submission location:

Predictions for test period are in folder: Submission/my_predictions/ Predictions for each variable are in their own file:

  • Submission/my_predictions/
    • clients.csv
    • sessions.csv
    • usage.csv

Code Requirements:

My models are built in pure Python 3 and should work out of the box on the Anaconda Continuum Python 3 distribution. However, the only non-standard package dependencies are: sklearn, pandas, numpy. Any machine with Python 3 and those packages installed should be able to run my code.

Code Structure:

I built and trained my models using the code in NotebookWalkthrough.ipynb, with the addition of several custom functions and classes that are contained in the file 'alexs_models/imports.py'. Afer the models were trained, I saved them in pickle format in the alexs_models folder.

To run the predictive models:

Navigate to this directory in a bash shell. Run the following command, where VARIABLE is one of: clients, session, or usage (corresponding to the three models required in the data challenge).

$ python predict.py -v VARIABLE -x test

This will print out the model predictions for the variable provided. The -x test argument tells the model to make predictions for the test period (Dec 21-27). Predictions for the training data can be made by changing this argument to -x 10,20, where the 10 and 20 arguments tell the model to make predictions for this range of rows in the training data (i.e., between rows 10 and 20).

To train the predictive models:

The easiest way to see how the data are cleaned and the models are trained is to work through the NotebookWalkthrough.ipynb Jupyter notebook included in this submission directory.

The notebook has clear heading titles about what each block is doing. It approximately follows the following structure:

  • import statements
  • Data Cleaning
    • Formatting dependent variables
    • Formatting predictors
  • Brief mathematical description of models
  • Num Clients Model
    • Model definition
    • Data formatting
    • Cross-validation
    • Final model training
  • Num Sessions Model
    • Model definition
    • Data formatting
    • Cross-validation
    • Final model training
  • Num Sessions Model
    • Model definition
    • Data formatting
    • Final model training

Primary data sources:

All data has been stored offline for purposes of this challenge. If this model were to be put into production, it would require changes to fetch data from official APIs. Because I accessed/saved all pages manually, I did not violate any terms of service to obtain the data.

Weather data, obtained from Weather Underground:

UVA Basketball Game Data, obtained from ESPN

Downtown Charlottesville Local Event Data:

About

The code I used to win the AMLC's Open Data Challenge

Resources

Releases

No releases published

Packages

No packages published