# KDD Cup 98: Project Checklist

This notebook is based on [hands on Machine Learning with Scikit-Learn & TensorFlow by Aurélien Géron](http://shop.oreilly.com/product/0636920052289.do).

## Frame the Problem and Look at the Big Picture

### 1. Define objective in business terms

Maximize the net revenue generated from future renewal mailings to Lapsed donors.

**Note:** The typical outcome of predictive modeling in database marketing is an estimate of the expected response/return per customer in the database. A marketer will mail to a customer so long as the expected return from an order exceeds the cost invested in generating the order, i.e., the cost of promotion.

For our purpose, the package cost (including the mail cost) is $0.68 per piece mailed.

### 2. How will solution be used?

Only on validation dataset

### 3. What are current solutions / workarounds (if any)?

None are known. There exist solutions by the original participants, but they are deliberately ignored for this thesis.

### 4. How to frame problem? (supervised/unsupervised, online/offline, etc.)

* Supervised learning
* Offline

### 5. Performance measure?

RMSE, the cost function measured on the set of examples using hypothesis $h$.

$RMSE(\bf{X},h)=\sqrt{\frac{1}{m}*\sum^m_{i=1}(h(x^{(i)})-y^{(i)})^2}$

with:

* $m$: number of observations
* $x^{(i)}$: Vector of all feature values of observation $i$
* $y^{(i)}$: Label for the observation $i$
* $\bf{X}$: Matrix containing all feature values (excluding labels) of the dataset. There is one row per observation and the $i^{\text{th}}$ row is equal to the transpose of $x^{(i)}$
* $h$: The prediction function or *hypothesis*

### 6. Performance measure aligned with business objective?

Yes, 

### 7. What would be the min. performance needed to reach business objective?

### 8. What are comparable problems?

### 9. Human expertise available?

### 10. How would you solve problem manually?

### 11. List assumptions made so far.

### 12. Verify assumptions if possible.

## Get the data

<div class="alert alert-success">
<b>Note:</b> Automate as much as possible!
</div>

### 1. List data needed and how much is needed

### 2. Find and document where you can get that data

### 3. Check space requirements

### 4. Check legal obligations

### 5. Get access authorizations

### 6. Create workspace

### 7. Get data

### 8. Convert data tp a format easily manipulable (without changing data itself!)

### 9. Ensure sensitive info is deleted or protected

### 10. Check size and type of data (time series, sample, geographical etc)

### 11. Sample a test set, put it aside, and never look at it

## Explore the data

### 1. Create copy of the data for exploration (sampling down to manageable size if necessary)

### 2. Create Jupyter notebook to keep record of exploration

See notebook [EDA](./eda.ipynb)

### 3. Study each attribute and it's characteristics:

- Name
- Type (categorical, int/float, bounded/unbounded, text, structured, etc.)
- % of missing values
- Noisiness, type of noise (stochastic, outliers, rounding errors, etc.)
- Possibly useful?
- Type of distribution (gaussian, uniform, logarithmic, etc.)

### 4. For supervised learning tasks, identify the target attribute(s)

### 5. Visualize data

### 6. Study correlations between attributes

### 7. Study how to hypothetically solve problem manually

### 8. Identify promising transformations

### 9. Identify extra data that would be useful

### 10. Document what you have learned

## Prepare the data

<div class="alert alert-success">
    <b>Notes:</b>
    <ul>
        <li>Work on copies of the data</li>
        <li>Write functions for all data transformations applied, for these reasons:</li>
        <ul>
            <li>So data can easily be prepared next time on fresh dataset</li>
            <li>So transformations can be applied in future projects</li>
            <li>To clean / prepare test set</li>
            <li>To clean / prepare new data once solution is live</li>
            <li>To make it easy to treat preparation choices as hyperparameters</li>
        </ul>
    </ul>
</div>

### 1. Data cleaning

- Fix / remove outliers
- Fill in missing values (zero, mean, median, ...) or drop rows/columns

### 2. Feature selection (optional)

- Drop attributes that don't provide useful info for the task

### 3. Feature engineering, where appropriate:

- Discretize continuous features
- Decompose features (categorical, date/time, etc.)
- Add promising transformations (e.g. log(x), sqrt(x), x^2, etc.)

### 4. Feature scaling: Standardize or normalize features

## Short-List Promising Models

<div class="alert alert-success">
<b>Notes:</b>

<ul>
    <li>If data is huge, maybe sample smaller training sets to train many different models in a reasonable time</li>
    <li>Once again, try to automate these steps as much as possible</li>
</ul>
</div>

### 1. Train many quick and dirty models from different categories (linear, naive Bayes, SVM, Random Forests, NN, etc.) using *standard* parameters

### 2. Measure and compare performance.
* Use N-fold cross-validation for each model, compute mean and SD of the performance measure on the N folds

### 3. Analyze the most significant variables for each algorithm

### 4. Analyze the types of errors the models make
* What data would a huamn have used to avoid these errors?

### 5. Have a quick round of feature selection and engineering

### 6. Short-list the top three to five most promising models, preferring models that make different types of errors

## Fine-Tune the System

<div class="alert alert-success">
<b>Notes:</b>

<ul>
    <li>Use as much data as possible for this step, especially towards the end of fine-tuning.</li>
    <li>As always: Automate as much as possible</li>
</ul>

### 1. Fine tune hyperparameters using cross-validation
* Treat data transformation choices as hyperparameters, especially when not sure about them (e.g. should missing values be replaced with zero, median? Or just drop the rows?)
* Unless there are very few hyperparameter values to explore, prefer random search over grid search. If training is very long, you may prefer a Bayesian optimization approach (e.g using Gaussian process priors, see https://goo.gl/PEFfGr)

### 2. Try Ensemble methods. Combining your best models will often perform better than running them individually.

### 3. Once you are confident about your final model, measure its performance on the test set to estimate generalization error. (Do *not* tweak model after measuring generalization error &rarr; overfitting test set!!)

## Present Your Solution

### 1. Document what you have done.

### 2. Create a nice presentation
* Make sure you highlight big picture first

### 3. Explain why solution achieves business objective

### 4. Don't forget to present interesting points you noticed along the way
* Describe what worked and what did not
* List your assumptions and your system's limitations

### 5. Ensure key findings are communicated through beautiful visualizations or easy-to-remember statements (e.g. feature X is the number-one predictor for y)

##  Launch!

### 1. Get solution ready for production (plug into production data inputs, unit tests, etc.)

### 2. Write monitoring code to check your system's live performance at regular intervals and trigger alerts when it drops.
* Beware of slow degradation too: models tend to "rot" as data evolves
* Measuring performance may require a human pipeline (e.g. via a crowdsourcing pipeline)
* Also monitor input's quality (e.g. malfunctioning sensor, drying up data sources)

### 3. Retrain your models on a regular basis on fresh data (automate as much as possible)