# Machine Learning
-----

## Introduction

This tutorial is based on chapter 6 of [Big Data and Social Science](https://github.com/BigDataSocialScience/).

## Learning Objectives
- Understand the basic concepts of supervised and unsupervised machine learning, how this differs from modeling for interpretation (which they are most likely more familiar with), and how it can be used for policy applications.
- Use ML packages in Python to bring in individual-level data combined across multiple sources; determine and generate appropriate features, outcome variables, evaluation methods and training/test splits; identify a best model and conduct error analysis and provide interpretation within context.

## Table of Contents



## Glossary of Terms 
- **Learning**: In machine learning, you'll hear about "learning a model." This is what you probably know as 
*fitting* or *estimating* a function, or *training* or *building* a model. These terms are all synonyms and are 
used interchangeably in the machine learning literature.
- **Examples**: These are what you probably know as *data points* or *observations*. 
- **Features**: These are what you probably know as *independent variables*, *attributes*, *predictors*, 
or *explanatory variables.*
- **Underfitting**: This happens when a model is too simple and does not capture the structure of the data well 
enough.
- **Overfitting**: This happens when a model is too complex or too sensitive to the noise in the data; this can
result in poor generalization performance, or applicability of the model to new data. 
- **Regularization**: This is a general method to avoid overfitting by applying additional constraints to the model. 
For example, you can limit the number of features present in the final model, or the weight coefficients applied
to the (standardized) features are small.

In [3]:
import numpy
from sql_alchemy import create_engine
import pandas
import statsmodels
import sklearn

ImportError: No module named sql_alchemy

### The Machine Learning Process

- **Understand the problem and goal. This sounds obvious but is often nontrivial.** Problems typically start as vague 
descriptions of a goal - improving health outcomes, increasing graduation rates, understanding the effect of a 
variable *X* on an outcome *Y*, etc. It is really important to work with people who understand the domain being
studied to dig deeper and define the problem more concretely. What is the analytical formulation of the metric 
that you are trying to optimize?
- **Formulate it as a machine learning problem.** Is it a classification problem or a regression problem? Is the 
goal to build a model that generates a ranked list prioritized by risk, or is it to detect anomalies as new data 
come in? Knowing what kinds of tasks machine learning can solve will allow you to map the problem you are working on
to one or more machine learning settings and give you access to a suite of methods.
- **Data exploration and preparation.** Next, you need to carefully explore the data you have. What additional data
do you need or have access to? What variable will you use to match records for integrating different data sources?
What variables exist in the data set? Are they continuous or categorical? What about missing values? Can you use the 
variables in their original form, or do you need to alter them in some way?
- **Feature engineering.** In machine learning language, what you might know as independent variables or predictors 
or factors or covariates are called "features." Creating good features is probably the most important step in the 
machine learning process. This involves doing transformations, creating interaction terms, or aggregating over data
points or over time and space.
- **Method selection.** Having formulated the problem and created your features, you now have a suite of methods to
choose from. It would be great if there were a single method that always worked best for a specific type of problem, 
but that would make things too easy. Typically, in machine learning, you take a collection 
- **Evaluation.** As you build a large number of possible models, you need a way to select the model that is the 
best. This part of the chapter will cover the validation methodology to first validate models on historical data
as well as discuss a variety of evaluation metrics. The next step is to validate using a field trial or experiment.
- **Deployment.** Once you have selected the best model and validated it using historical data as well as a field
trial, you are ready to put the model into practice. You still have to keep in mind that new data will be coming in,
and the model might change over time.


### Problem Formulation
- **Supervised learning.** These are problems with one target or outcome variable (continuous or discrete) that we want
to predict, or classify data into. Clasification, prediction, and regression fall into this category. We call the
set of explanatory variables $X$ **features**, and the outcome variable of interest the **label**.
- **Unsupervised learning** involves problems that do not have a specific outcome variable of interest, but rather
we are looking to understand "natural" patterns or groupings in the data - looking to uncover some structure that 
we do not know about a priori. Clustering is the most common example of unsupervised learning. Another example is 
principal components analysis (PCA).


In this lesson, we'll be using the [pandas package](http://pandas.pydata.org/) - to read in and manipulate data. Pandas provides an alternative to reading data directly from MySQL that stores the data in special table format called a "data frame" that allows for easy statistical analysis and can be directly used for machine learning. 
Pandas uses a database engine to connect to databases (via the SQLAlchemy Python package). In the code cell below, we will create a database engine conneted to our class MySQL database server for Pandas to use. In the code cell below, place your database username and password in the variables 'mysql_username' and 'mysql_password', then run the cell:

Next, we will use this database connection to have pandas read in the data stored in the 'MachineLearning2' table. Pandas has a set of [Input/Output tools](http://pandas.pydata.org/pandas-docs/stable/io.html) that let it read from and write to a large variety of tabular data formats, including CSV and Excel files, databases via SQL, JSON files, and SAS and Stata data files. In the example below, we'll use the pandas.read_sql() function to read the results of an SQL query into a pandas data frame.

In [None]:
data_frame = pandas.read_sql( 'SELECT * FROM homework.MachineLearning2;' pandas_db)

Now, let's look at what the data looks like. The pandas.DataFrame method 'data_frame.head( number_of_rows )' outputs the first number_of_rows rows in a data frame. Let's look at the first five rows in our data.
In the code cell below, there are two ways to output this information. If you just call the method, you'll get an HTML table output directly into the ipython notebook. If you pass the results of the method to the "print()" function, you'll get text output that works outside of jupyter/ipython.

In [None]:
# to get a pretty tabular view, just call the method.
data_frame.head( 5 )

# to get a text-based view, print() the call to the method.
#print( data_frame.head( 5 ) )

## Understanding the Data 
In pandas, our data is represented by a DataFrame. You can think of data frames as a giant spreadsheet which you can program, with the data for each column stored in its own list that pandas calls a Series (or vector of values), along with a set of methods (another name for functions that are tied to objects) that make managing data in pandas easy.

A Series is a list of values each of which can also have a label, which pandas calls an "index", and which generally is used to store names of columns when you retrieve a Series that represents a row, and IDs of rows when you retrieve a Series that represents a column of data in a table.

While DataFrames and Series are separate objects, they may share the same methods where those methods make sense in both a table and list context (head() and tail(), as used in examples in this notebook, for example).
More details on pandas data structures:
- [Data Structures Overview](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)
- [Series specifics](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#series)
- [DataFrame specifics](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe)

In [None]:
# get vector of "ORG_DEPT" column values from data frame
org_dept_column_series = data_frame[ "ORG_DEPT" ]

# see the last 5 values in the vector.
print( org_dept_column_series.tail( 5 ) )

# It is also OK to chain together, but I did not above for clarity's sake, and in
#    general, be wary of doing too many things on one line.
# data_frame[ "ORG_DEPT" ].tail( 5 )

# empty org_dept_column_series variable and garbage collect, to conserve memory
org_dept_column_series = None
gc.collect()

Decide what you’re modeling, and what will determine its success (what is your X, Y, and evaluation strategy?)
Getting data (inputs are from database management) and making it model-ready: dealing with nulls and missing values, feature generation, separate into training and test set. Each row should be an individual coupled with a timestamp. They should bring in all available data about this person at this time. 
Train models and choose the best based on your evaluation strategy
Error analysis: Categorizing errors, seeing if there are any identifiable patterns to the errors that you’re making and if you are OK with making those errors.  
Prediction and interpretation: apply the model to new data and say what we can conclude from it and what policy recommendations this suggests.
Now take the same individual level information and change outcome variable: recidivism, whether they have a job, etc. What else do you find? How does this change feature generation and evaluation?
Need to decide what we want them to actually do in this exercise to figure out the overall work.
If just play with scikit-learn, then probably will provide an already-flattened table that they can then load and use to try out different models. 
Another option is to have them build the data from a person’s records, but this would be a lot more work.
Research questions:
All cohorts - predict stable employment - full-quarter employment status?
Ex-offenders - Predict recidivism


In [None]:
data_frame.dtypes

## Features
Good features make machine learning systems effective. You generate features by a combination of domain knowledge and 
what has the most correlation. In general, it is better to have more complex features and a simpler model rather than vice versa. Keeping the model simple makes it faster to train and easier to understand. 

- **Transformations**, such a log, square, and square root.
- **Dummy (binary) variables**, also known as *indicator variables*, often done by taking categorical variables
(such as city) which do not have a numerical value, and adding them to models as a binary value.
- **Discretization**. Several methods require features to be discrete instead of continuous. This is often done 
by binning, which you can do by equal width. 
- **Aggregation.** Aggregate features often constitute the majority of features for a given problem. These use 
different aggregation functions (*count, min, max, average, standard deviation, etc.*) which summarize several
values into one figure, aggregating over varying windows of time and space. For example, given urban data, 
we would want to calculate the *number* (and *min, max, mean, variance*, etc.) of crimes within an *m*-mile radius
of an address in the past *t* months for varying values of *m* and *t*, and then use all of them as features.

## Evaluation
- ** Model Selection**: How do we select a method to use? What parameters should we select for that method?
- **Performance Estimation**: How well will our model do once it is deployed and applied to new data?
- **Deeper Understanding**: Are there inaccuracies in the predictions the model makes? Does the model uncover
inconsistencies in the data?

## Machine Learning Pipeline
When working on machine learning projects, it is a good idea to structure your code as a modular pipeline. This has 
many advantages:
- **Reproducibility**.
- **Comparison**.
- **Ability to make changes.**
- **Ability to collaborate.**

## Resources
- Hastie et al.'s [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) is a classic and is available online for free.
- James et al.'s [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) includes less mathematics and is more approachable. It is also available online.
- Wu et al.'s [Top 10 Algorithms in Data Mining](http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf).