# Machine Learning
-----


## Table of Contents
- [Introduction](#introduction)
- [Glossary of Terms](#glossary-of-terms)
- [Setup](#setup)
- [The Machine Learning Process](#the-machine-learning-process)
- [Problem Formulation](#model-formulation)
- [Feature Generation](#feature-generation)
- [Model Fitting](#model-fitting)
- [Model Evaluation](#model-evaluation)
- [Machine Learning Pipeline](#machine-learning-pipeline)
- [Deployment](#deployment)
- [Exercises](#exercises)
- [Resources](#resources)

## Introduction
In this tutorial, we'll discuss how to formulate a research question in the machine learning framework; how to transform raw data into something that can be fed into a model; how to build, evaluate, compare, and select models; and how to reasonably and accurately interpret model results. You'll also get hands-on experience using the `scikit-learn` package in Python to model the data you're familiar with from previous tutorials. 

As you'll see, you already know many of these machine learning concepts, but under a different name; you'll also learn some new concepts that will help you see and use the data you already work with in a new way.

This tutorial is based on chapter 6 of [Big Data and Social Science](https://github.com/BigDataSocialScience/).

## Glossary of Terms 
- **Learning**: In machine learning, you'll hear about "learning a model." This is what you probably know as 
*fitting* or *estimating* a function, or *training* or *building* a model. These terms are all synonyms and are 
used interchangeably in the machine learning literature.
- **Examples**: These are what you probably know as *data points* or *observations*. 
- **Features**: These are what you probably know as *independent variables*, *attributes*, *predictors*, 
or *explanatory variables.*
- **Underfitting**: This happens when a model is too simple and does not capture the structure of the data well 
enough.
- **Overfitting**: This happens when a model is too complex or too sensitive to the noise in the data; this can
result in poor generalization performance, or applicability of the model to new data. 
- **Regularization**: This is a general method to avoid overfitting by applying additional constraints to the model. 
For example, you can limit the number of features present in the final model, or the weight coefficients applied
to the (standardized) features are small.
- **Supervised learning** involves problems with one target or outcome variable (continuous or discrete) that we want
to predict, or classify data into. Classification, prediction, and regression fall into this category. We call the
set of explanatory variables $X$ **features**, and the outcome variable of interest $Y$ the **label**.
- **Unsupervised learning** involves problems that do not have a specific outcome variable of interest, but rather
we are looking to understand "natural" patterns or groupings in the data - looking to uncover some structure that 
we do not know about a priori. Clustering is the most common example of unsupervised learning. Another example is 
principal components analysis (PCA).


## Setup
---
*[Back to Table of Contents](#table-of-contents)*

Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with `numpy`, `pandas`, and `psycopg2` from previous tutorials. Here we'll also be using [`scikit-learn`](http://scikit-learn.org) and [`statsmodels`](http://statsmodels.sourceforge.net/), which are packages we use to fit models.

In [None]:
import numpy
import pandas
import psycopg2
import statsmodels
import sklearn

- Load in the data
- This dataset should be outputs from the database management workbook

In [None]:
db_name = "your_db_name_here"
data_frame = pandas.read_sql( 'SELECT * FROM schema.table;' db_name)

## The Machine Learning Process
*[Go back to Table of Contents](#table-of-contents)*

- [**Understand the problem and goal.**](#problem-formulation) *This sounds obvious but is often nontrivial.* Problems typically start as vague 
descriptions of a goal - improving health outcomes, increasing graduation rates, understanding the effect of a 
variable *X* on an outcome *Y*, etc. It is really important to work with people who understand the domain being
studied to dig deeper and define the problem more concretely. What is the analytical formulation of the metric 
that you are trying to optimize?
- [**Formulate it as a machine learning problem.**](#problem-formulation) Is it a classification problem or a regression problem? Is the 
goal to build a model that generates a ranked list prioritized by risk, or is it to detect anomalies as new data 
come in? Knowing what kinds of tasks machine learning can solve will allow you to map the problem you are working on
to one or more machine learning settings and give you access to a suite of methods.
- **Data exploration and preparation.** Next, you need to carefully explore the data you have. What additional data
do you need or have access to? What variable will you use to match records for integrating different data sources?
What variables exist in the data set? Are they continuous or categorical? What about missing values? Can you use the 
variables in their original form, or do you need to alter them in some way?
- [**Feature engineering.**](#feature-generation) In machine learning language, what you might know as independent variables or predictors 
or factors or covariates are called "features." Creating good features is probably the most important step in the 
machine learning process. This involves doing transformations, creating interaction terms, or aggregating over data
points or over time and space.
- **Method selection.** Having formulated the problem and created your features, you now have a suite of methods to
choose from. It would be great if there were a single method that always worked best for a specific type of problem, 
but that would make things too easy. Typically, in machine learning, you take a variety of methods and try them, empirically validating which one is the best approach to your problem.
- [**Evaluation.**](#evaluation) As you build a large number of possible models, you need a way choose the best among them. We'll cover methodology to validate models on historical data and discuss a variety of evaluation metrics. The next step is to validate using a field trial or experiment.
- [**Deployment.**](#deployment) Once you have selected the best model and validated it using historical data as well as a field
trial, you are ready to put the model into practice. You still have to keep in mind that new data will be coming in,
and the model might change over time.

<img src="https://s3.amazonaws.com/demo-datasets/traintest.png" />


You're probably used to fitting models in physical or social science classes. In those cases, you probably had a hypothesis or theory about the underlying process that gave rise to your data, chose an appropriate model based on prior knowledge and fit it using least squares, and used the resulting parameter or coefficient estimates (or confidence intervals) for inference. This type of modeling is very useful for *interpretation*. Machine learning models do not generally optimize for obtaining a structural form of the model; they can take many different structural forms (ranging from linear models to sets of rules to more complex forms), and it may not always be possible to write them down in a compact form as an equation. This does not, however, necessarily mean that they are incomprehensible or uninterpretable.  

In machine learning, our primary concern is *generalization*. This means that:
- **We (mostly) don't care about the structure of the model - we just want whatever works the best.** This means that we'll try out a whole bunch of models at a time and choose the one that works best, rather than determining which model to use ahead of time.
- **We don't (necessarily) want the model that best fits the data we've *already seen*, but rather the model that will perform the best on *new data*.** This means that we won't gauge our model's performance using the same data that we used to fit the model (e.g. sum of squared errors or $R^2$), and that "best fit" or accuracy will most often *not* determine the best model.  
- **We can put whatever variables we want, and as many as we like, into a model.** This may sound like the complete opposite of what you've heard in the past, and it can be hard to swallow. But many of the concerns that apply in other types of modeling don't apply in the ML context, and many are addressed in the model fitting process by a more automatic variable selection process.

## Problem Formulation
*[Go back to Table of Contents](#table-of-contents)*

First, turning something into a real objective function. What do you care about? Do you have data on that thing? What action can you take based on your findings? Do you risk introducing any bias based on the way you model something? 

### Four Main Types of ML Tasks for Policy Problems
- **Description**: [How can we identify and respond to the most urgent online government petitions?](https://dssg.uchicago.edu/project/improving-government-response-to-citizen-requests-online/)
- **Prediction**: [Which students will struggle academically by third grade?](https://dssg.uchicago.edu/project/predicting-students-that-will-struggle-academically-by-third-grade/)
- **Detection**: [Which police officers are likely to have an adverse interaction with the public?](https://dssg.uchicago.edu/project/expanding-our-early-intervention-system-for-adverse-police-interactions/)
- **Behavior Change**: [How can we prevent juveniles from interacting with the criminal justice system?](https://dssg.uchicago.edu/project/preventing-juvenile-interactions-with-the-criminal-justice-system/)
  

split our dataset up into **features** (predictors, or dependent variables, or $X$ variables) and **labels** (independent variables, or $Y$ variables).  For ease of reference, in subsequent examples, names of variables that pertain to predictors will start with "`X_`", and names of variables that pertain to outcome variables will start with "`y_`".

- Making data model-ready: dealing with nulls and missing values, feature generation, separate into training and test set. *Each row should be an individual coupled with a timestamp. They should bring in all available data about this person at this time.*

In [None]:
# Let's split our data into predictors (X) and predicted (Y)

# make a list of the column names not in dependent column name list (currently just "ORG_DEPT")
# one line - predictor_column_list = [ column_name for column_name in list( cleaned_data_frame.columns.values ) if column_name not in [ "ORG_DEPT" ] ]
X_column_list = []
y_column_list = [ "ORG_DEPT" ]

# loop over column names.
column_name_list = cleaned_data_frame.columns.values
for column_name in column_name_list:
    
    # if the name is not predicted_column_list, add it to predictor_column_list
    if ( column_name not in y_column_list ):
        
        # add to the predictor_column_list
        X_column_list.append( column_name )
        
    #-- END check to see if column is in predicted/IV/Y list --#
    
#-- END loop over columns. --#

# split columns into two DataFrames, those we are to predict,
#    and those that are predictors.
X_data_frame = cleaned_data_frame[ X_column_list ]
y_data_frame = cleaned_data_frame[ y_column_list ]

The Python machine learning libraries (and mathematical models in general) only accept *numerical* quantities; they can't understand words or categorical variables. To feed our data into a model, we need to convert all categorical variables to **dummy variables.** This means that for every possible value of the categorical variable, we need to add a binary feature that takes on the value 1 if the observation belongs to that category, or 0 if the observation does *not* belong to that category. Luckily, `pandas` has built-in functionality to do just that: we can easily convert all categorical variables in `X_data_frame` into dummy variables using the [`pandas.get_dummies()`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html) function.


In [None]:
# Test the output of get_dummies 
pandas.get_dummies(X_data_frame)

Does that look more like something a model would recognize? Note that in the example below we save the resulting "dummified" dataframe under the same name as the original, `X_data_frame`. This will overwrite whatever we had saved under the name `X_data_frame` before. If you do this, it's a good idea to test that the output matches what you expect (as we just did), so that you don't have to start from scratch.

In [None]:
# Save the data frame with dummy variables
X_data_frame = pandas.get_dummies( X_data_frame )

## Feature Generation
*[Go back to Table of Contents](#table-of-contents)*


Good features make machine learning systems effective. You generate features by a combination of domain knowledge and 
what has the most correlation. In general, it is better to have more complex features and a simpler model rather than vice versa. Keeping the model simple makes it faster to train and easier to understand. 
- "**Feature engineering** is the process of transforming raw data into features that better represent the underlying problem/data to the predictive models, resulting in improved model accuracy on unseen data." ( from [Discover Feature Engineering](http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/) ).  In text, for example, this might involve deriving traits of the text like word counts, verb counts, or topics to feed into a model rather than simply giving it the raw text.

- **Transformations**, such a log, square, and square root.
- **Dummy (binary) variables**, also known as *indicator variables*, often done by taking categorical variables
(such as city) which do not have a numerical value, and adding them to models as a binary value.
- **Discretization**. Several methods require features to be discrete instead of continuous. This is often done 
by binning, which you can do by equal width. 
- **Aggregation.** Aggregate features often constitute the majority of features for a given problem. These use 
different aggregation functions (*count, min, max, average, standard deviation, etc.*) which summarize several
values into one figure, aggregating over varying windows of time and space. For example, given urban data, 
we would want to calculate the *number* (and *min, max, mean, variance*, etc.) of crimes within an *m*-mile radius
of an address in the past *t* months for varying values of *m* and *t*, and then use all of them as features.

## Model Fitting
*[Go back to Table of Contents](#table-of-contents)*



In [None]:
# use train_test_split() to split our X and Y variables into separate 75% and 25%
#    DataFrames of training (X_train and y_train) and testing (X_test and y_test) data.
X_train, X_test, y_train, y_test = train_test_split( X_data_frame, y_data_frame, test_size = 0.25, random_state = 0 )

# Before we fit the model, we also need to change the datatype of the y_train variable.
# y_train currently is a Pandas Series, however, scikit-learn requires it to be a numpy array
# So all we need to do is extract the raw values of y_train, and pass them onto scikit-learn
y_train_values = y_train[ 'ORG_DEPT' ].values

Python's [`scikit-learn`](http://scikit-learn.org/stable/) is a commonly used, well documented Python library for machine learning. This library can help you split your data into training and test sets, fit models and use them to predict results on new data, and evaluate your results.

We will start with the simplest [`LogisticRegression`](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) model and see how well that does.

You can use any number of metrics to judge your models (see [model evaluation](#model-evaluation)), but we'll use [`accuracy_score()`](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) (ratio of correct predictions to total number of predictions) as our measure.

In [None]:
# Let's fit the model
model = LogisticRegression()
model.fit( X_train, y_train_values )
print(model)

When we print the model results, we see different parameters such as `C`, `class_weight`, and `penalty`. All of these parameters have default values that are automatically used when you just call `LogisticRegression()`, as above.  We can adjust these parameters as we refine the model, for example: 

    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)

To adjust these parameters, one would alter the call that creates the `LogisticRegression()` model instance, passing it one or more of these parameters with a value other than the default.  So, to re-fit the model with `max_iter` of 1000, `intercept_scaling` of 2, and `solver` of "lbfgs" (pulled from thin air as an example), you'd create your model as follows:

    model = LogisticRegression( max_iter = 1000, intercept_scaling = 2, solver = "lbfgs" )

Look at the documentation for each type of model to find out what all of these parameters do. You'll choose some of the parameters based on your knowledge of the problem you're trying to solve and the data that you have; for instance, if only 1% of your observations have positive labels, you might try using `class_weight` to make sure you don't fit a model that labels all examples as positive and obtains 99% accuracy while mislabeling *all* of the positive examples. For some of the parameters, the interpretation will be less intuitive, and you'll choose their values the same way you choose a model: fit the model to your training data with a variety of parameters, and see which set of parameters performs the best on the test set. An obvious drawback is that you can also *overfit* to your test set; in this case, you can alter your method of cross-validation.

### More Resources for Model Selection

- [Kaggle video on choosing and tuning a model](http://blog.kaggle.com/2015/05/14/scikit-learn-video-5-choosing-a-machine-learning-model/) 
- Explore different sets of models (from "predicting a category" in the "labeled data" branch of the [scikit learn tutorial on choosing a model](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)). *Remember to check that your data meet any assumptions required by any model you use.*:
   - [Other Linear Models](http://scikit-learn.org/stable/modules/linear_model.html)
   - [Decision Tree models](http://scikit-learn.org/stable/modules/tree.html)
   - [Ensemble classifiers](http://scikit-learn.org/stable/modules/ensemble.html)
   - [Nearest neighbors classifiers](http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification)
   - [Stochastic Gradient Descent](http://scikit-learn.org/stable/modules/sgd.html#classification) 
   - [Kernel Approximation](http://scikit-learn.org/stable/modules/kernel_approximation.html)

## Model Evaluation 
*[Go back to Table of Contents](#table-of-contents)*

Now that we know about a handful of methods, we need to be able to evaluate the respective models and choose one that we like best. Now we'll focus on model evaluation methods, with three main goals:
- **Model Selection**: How do we choose which model we should deploy among the many models we train?
- **Performance Estimation**: How well will our model do once it is deployed and applied to new data?
- **Deeper Understanding**: Are there inaccuracies in the predictions the model makes? Does the model uncover
inconsistencies in the data?


### In-sample Evaluation 
As social scientists, you already evaluate methods on how well they perform in-sample (on the set the model was trained on. For example, you're probably used to looking at residuals or $R^2$ to determine "goodness of fit." 

It's not enough to just build the model; we're going to need a way to know whether or not it's working. Convincing others of the quality of results is often the *most challenging* part of an analysis.  Making repeatable, well-documented work with clear success metrics makes all the difference.

To convince ourselves - and others - that our modeling results will generalize, we need to hold some data back (not using it to train the model), then apply our model to that hold-out set and "blindly" predict, comparing the model's predictions to what we actually observed. This is called **cross-validation**, and it's the best way we have to estimate how a model will perform on *entirely* novel data. We call the data used to build the model the **training set**, and the rest the **test set**.

In general, we'd like our training set to be as large as possible, to give our model more information. However, you also want to be as confident as possible that your model will be applicable to new data, or else the model is useless. In practice, you'll have to balance these two objectives in a reasonable way.  

There are also many ways to split up your data into training and testing sets. Since you're trying to evaluate how your model will perform *in practice*, it's best to emulate the true use case of your model as closely as possible when you decide how to evaluate it. A good [tutorial on cross-validation](http://scikit-learn.org/stable/modules/cross_validation.html) can be found on the `scikit-learn` site.

One simply commonly used method is ***k-fold* cross-validation**, which entails splitting up our dataset into *k* groups, holding out one group while training a model on the rest of the data, evaluating model performance on the held-out "fold," and repeating this process *k* times. Another method is **temporal cross-validation**, which involves building a model using all the data up until a given point in time, and then testing the model on observations that happened after that point. 

For the case of this tutorial, we'll use a very basic cross-validation methodology, simply splitting the data into two parts: 75% training set and 25% test set, divided randomly. We'll create our training and test sets using the `train_test_split()` function from scikit learn's [`sklearn.cross_validation`](http://scikit-learn.org/stable/modules/cross_validation.html) module.


Now let's use the model we just fit to make predictions on our test dataset, and see what our accuracy score is:

In [None]:
# store the true values, or the y variable for the test set, in "expected"
expected = y_test

# use the predictors for our test set to come up with model predictions, AKA "predicted"
predicted = model.predict(X_test)

# generate an accuracy score by comparing expected to predicted.
accuracy = accuracy_score(expected, predicted)
print( "Accuracy = " + str( accuracy ) )

### Comparing to a Baseline

We get an accuracy score of 0.45340... (45%). This is not a great score, however, it is much better than random guessing, which would have had a chance of 1/18 of succeeding. The other way to guess would be to take the mode, which in this case is MEDICINE with a frequency of 22497, which would give us an accuracy score of 22497/49013 = 45.9%. So logistic regression is about as good as just always assigning the mode when department is missing. Let's see if other classifiers can do any better.

### Metrics
Before we dive into metrics, it is important to highlight that classification models typically do not output 0/1 values directly; rather, they produce a score, usually between 0 and 1 (sometimes a probability), that is discretized into a 0 or 1 based on a user-specified threshold. A common default threshold is 0.5, which you can think of as classifying any example that is "most likely" (more than half) positive as positive, and any example that is "most likely" negative as negative. However, it's important to note that this threshold is arbitrary, and you should select your threshold based on the data, the model, and the problem you are solving. 

Once we've turned the real-valued predictions into 0/1 classification, we can create a *confusion matrix* from these predictions. Each data point has a *true value* which is either positive or negative, and the (thresholded) prediction of the classifier is either correct or incorrect. Therefore we have four buckets: true positives $(TP)$, false positives $(FP)$, true negatives $(TN)$, and false negatives $(FN)$; and the total number of positive examples is $TP + FN = P^\prime$, while the total number of negative examples is $TN + FP = N^\prime$.  

**Accuracy** is the ratio of correct predictions (both positive and negative) to all predictions:

$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} = \frac{TP + TN}{P + N} = \frac{ TP + TN}{P^\prime + N^\prime}. $

Accuracy is the most commonly described evaluation metric for classication, but is (perhaps surprisingly) one of the least useful in practical situations. For example, a model with 85% accuracy sounds like it's performing pretty well, but if we are trying to classify a population with 95% positive and 5% negative examples, we could make a *really stupid* classifier, labeling *every example* as positive, and still beat our 85% accurate model in terms of accuracy.

Two additional commonly used metrics are **precision** and **recall**, defined as follows:

$ Precision = \frac{TP}{TP + FP} = \frac{TP}{P}$

$Recall = \frac{TP}{TP + FN} = \frac{TP}{P^\prime}$

Precision measures the accuracy of the classifier *when it predicts an example to be positive*: how many of the examples that were labeled positive are truly positive? Recall measures the ability of the classifier to *find positive examples*: how many of the positive examples were labeled positive?

There is often a tradeoff between precision and recall. By selecting different classification thresholds, we can vary and tune the precision and recall of a given classifier. A highly conservative classifier that only predicts a 1 when it is very confident (say, threshold 0.999) will most often be correct when it predicts a 1 (high precision) but will miss most 1s (low recall). At the other extreme, a classifier that hands out 1s like candy (say, threshold 0.001) will have very high recall but very low precision. 

### Error Analysis

### Temporal Validation
Be wary of results that seem "too good to be true."

The cross-validation and holdout set approaches described above assume that the data have no time dependencies, and that the distributions of their values are stationary over time. This assumption is almost always violated in practice, and will affect performance estimates. In most practical problems, we want to use a validation strategy that emulates the way in which our models will be used and provides an accurate performance estimate. We'll call this *temporal validation*. For a given point in time, we train models only information that would have been available to us before that time to avoid training on data from the "future." Your test set would then be all the observations within a window after time $t$. 
- How far out in the future do the predictions need to be made? For example, if the set of students who need to be targeted for interventions has to be finalized at the beginning of the school year for the entire year, then $d$=1 year.
- How often will the model be updated? If the model is being updated daily, then we can move the window a day at a time to reflect this scenario.
- How often will the system get new data? If we are getting new data frequently, we can make predictions more frequently.

## Machine Learning Pipeline
*[Go back to Table of Contents](#table-of-contents)*

When working on machine learning projects, it is a good idea to structure your code as a modular **pipeline**, which contains all of the steps of your analysis, from the original data source to the results that you report, along with documentation. This has many advantages:
- **Reproducibility**. It's important that your work be reproducible. This means that someone else should be able
to see what you did, follow the exact same process, and come up with the exact same results. It also means that
someone else can follow the steps you took and see what decisions you made, whether that person is a collaborator, 
a reviewer for a journal, or the agency you are working with. 
- **Ease of model evaluation and comparison**.
- **Ability to make changes.** If you receive new data and want to go through the process again, or if there are 
updates to the data you used, you can easily substitute new data and reproduce the process without starting from scratch.

## Deployment
*[Go back to Table of Contents](#table-of-contents)*

- **Think about how your model results can and will be interpreted.** Misunderstanding of probability, causal relationships, do the test sets match the use case . Prediction and interpretation
- **How do you use this model in practice?** Give a new example or heldout data, generate predictions
- **How does this model generalize?** In what situations would you feel comfortable deploying it? What conclusions can you draw based on your sample and methods?

## Exercises
*[Go back to Table of Contents](#table-of-contents)*

Now you've gone through the process of fitting the model. In practice, you'll need to fit - and evaluate - many models to decide on a "best" model. 

Change the outcome variable you're using. Decide what your evaluation metric will be. Does this change the unit of observation? Does this change what features it makes sense to create for each individual? Does it change how you will evaluate the model's performance? 

*Research questions: All cohorts - predict stable employment - full-quarter employment status?
Ex-offenders - Predict recidivism*




- play around with different parameters for the models you try.
- experiment with different sets of X variables.
- _Advanced_ - You can try starting again from the top with an SQL query that uses JOINs to pull in columns from other tables, to add more variables to your pool of available predictors.
- _Advanced_ - You could also try to derive additional features from the data present in your query and add those features to your predictors.




## Resources
*[Go back to Table of Contents](#table-of-contents)*

- Hastie et al.'s [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) is a classic and is available online for free.
- James et al.'s [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/), also available online, includes less mathematics and is more approachable.
- Wu et al.'s [Top 10 Algorithms in Data Mining](http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf).