# Machine Learning
-----
## Table of Contents


## Introduction
Decide what you’re modeling, and what will determine its success (what is your X, Y, and evaluation strategy?)
Getting data (inputs are from database management) and making it model-ready: dealing with nulls and missing values, feature generation, separate into training and test set. Each row should be an individual coupled with a timestamp. They should bring in all available data about this person at this time. 
Train models and choose the best based on your evaluation strategy
Error analysis: Categorizing errors, seeing if there are any identifiable patterns to the errors that you’re making and if you are OK with making those errors.  
Prediction and interpretation: apply the model to new data and say what we can conclude from it and what policy recommendations this suggests.
Now take the same individual level information and change outcome variable: recidivism, whether they have a job, etc. What else do you find? How does this change feature generation and evaluation?
Need to decide what we want them to actually do in this exercise to figure out the overall work.
If just play with scikit-learn, then probably will provide an already-flattened table that they can then load and use to try out different models. 
Another option is to have them build the data from a person’s records, but this would be a lot more work.
Research questions:
All cohorts - predict stable employment - full-quarter employment status?
Ex-offenders - Predict recidivism

This tutorial is based on chapter 6 of [Big Data and Social Science](https://github.com/BigDataSocialScience/).


## Learning Objectives
- Understand the basic concepts of supervised and unsupervised machine learning, how this differs from modeling for interpretation (which they are most likely more familiar with), and how it can be used for policy applications.
- Use ML packages in Python to bring in individual-level data combined across multiple sources; determine and generate appropriate features, outcome variables, evaluation methods and training/test splits; identify a best model and conduct error analysis and provide interpretation within context.

## Table of Contents



## Glossary of Terms 
- **Learning**: In machine learning, you'll hear about "learning a model." This is what you probably know as 
*fitting* or *estimating* a function, or *training* or *building* a model. These terms are all synonyms and are 
used interchangeably in the machine learning literature.
- **Examples**: These are what you probably know as *data points* or *observations*. 
- **Features**: These are what you probably know as *independent variables*, *attributes*, *predictors*, 
or *explanatory variables.*
- **Underfitting**: This happens when a model is too simple and does not capture the structure of the data well 
enough.
- **Overfitting**: This happens when a model is too complex or too sensitive to the noise in the data; this can
result in poor generalization performance, or applicability of the model to new data. 
- **Regularization**: This is a general method to avoid overfitting by applying additional constraints to the model. 
For example, you can limit the number of features present in the final model, or the weight coefficients applied
to the (standardized) features are small.


## Setup
Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment.

In [3]:
import numpy
from sql_alchemy import create_engine
import pandas
import statsmodels
import sklearn

ImportError: No module named sql_alchemy

## Data Basics
We'll be using criminal justice data. 
Data documentation
How to connect to the database


In [None]:
# set up database credentials
mysql_username = "<username>"
mysql_password = "<password>"
mysql_host = "cuspdev.local"
mysql_port = "3306"
mysql_database = "homework"
mysql_charset = "utf8"

# Create database connection for pandas.
pandas_db = create_engine( "mysql+pymysql://" + mysql_username + ":" + mysql_password + "@" + 
                          mysql_host + ":" + mysql_port + "/" + mysql_database + "?charset=" + mysql_charset )

### The Machine Learning Process

- **Understand the problem and goal.** *This sounds obvious but is often nontrivial.* Problems typically start as vague 
descriptions of a goal - improving health outcomes, increasing graduation rates, understanding the effect of a 
variable *X* on an outcome *Y*, etc. It is really important to work with people who understand the domain being
studied to dig deeper and define the problem more concretely. What is the analytical formulation of the metric 
that you are trying to optimize?
- **Formulate it as a machine learning problem.** Is it a classification problem or a regression problem? Is the 
goal to build a model that generates a ranked list prioritized by risk, or is it to detect anomalies as new data 
come in? Knowing what kinds of tasks machine learning can solve will allow you to map the problem you are working on
to one or more machine learning settings and give you access to a suite of methods.
- **Data exploration and preparation.** Next, you need to carefully explore the data you have. What additional data
do you need or have access to? What variable will you use to match records for integrating different data sources?
What variables exist in the data set? Are they continuous or categorical? What about missing values? Can you use the 
variables in their original form, or do you need to alter them in some way?
- **Feature engineering.** In machine learning language, what you might know as independent variables or predictors 
or factors or covariates are called "features." Creating good features is probably the most important step in the 
machine learning process. This involves doing transformations, creating interaction terms, or aggregating over data
points or over time and space.
- **Method selection.** Having formulated the problem and created your features, you now have a suite of methods to
choose from. It would be great if there were a single method that always worked best for a specific type of problem, 
but that would make things too easy. Typically, in machine learning, you take a collection 
- **Evaluation.** As you build a large number of possible models, you need a way to select the model that is the 
best. This part of the chapter will cover the validation methodology to first validate models on historical data
as well as discuss a variety of evaluation metrics. The next step is to validate using a field trial or experiment.
- **Deployment.** Once you have selected the best model and validated it using historical data as well as a field
trial, you are ready to put the model into practice. You still have to keep in mind that new data will be coming in,
and the model might change over time.


### Problem Formulation
- **Supervised learning.** These are problems with one target or outcome variable (continuous or discrete) that we want
to predict, or classify data into. Clasification, prediction, and regression fall into this category. We call the
set of explanatory variables $X$ **features**, and the outcome variable of interest the **label**.
- **Unsupervised learning** involves problems that do not have a specific outcome variable of interest, but rather
we are looking to understand "natural" patterns or groupings in the data - looking to uncover some structure that 
we do not know about a priori. Clustering is the most common example of unsupervised learning. Another example is 
principal components analysis (PCA).


In this lesson, we'll be using the [pandas package](http://pandas.pydata.org/) to read in and manipulate data. Pandas provides an alternative to reading data directly from MySQL that stores the data in special table format called a "data frame" that allows for easy statistical analysis and can be directly used for machine learning. 
Pandas uses a database engine to connect to databases (via the SQLAlchemy Python package). In the code cell below, we will create a database engine conneted to our class MySQL database server for Pandas to use. In the code cell below, place your database username and password in the variables 'mysql_username' and 'mysql_password', then run the cell:

Next, we will use this database connection to have pandas read in the data stored in the 'MachineLearning2' table. Pandas has a set of [Input/Output tools](http://pandas.pydata.org/pandas-docs/stable/io.html) that let it read from and write to a large variety of tabular data formats, including CSV and Excel files, databases via SQL, JSON files, and SAS and Stata data files. In the example below, we'll use the pandas.read_sql() function to read the results of an SQL query into a pandas data frame.

In [None]:
data_frame = pandas.read_sql( 'SELECT * FROM homework.MachineLearning2;' pandas_db)

Now, let's look at what the data looks like. The pandas.DataFrame method 'data_frame.head( number_of_rows )' outputs the first number_of_rows rows in a data frame. Let's look at the first five rows in our data.
In the code cell below, there are two ways to output this information. If you just call the method, you'll get an HTML table output directly into the ipython notebook. If you pass the results of the method to the "print()" function, you'll get text output that works outside of jupyter/ipython.

In [None]:
# to get a pretty tabular view, just call the method.
data_frame.head( 5 )

# to get a text-based view, print() the call to the method.
#print( data_frame.head( 5 ) )

## Understanding the Data 
In pandas, our data is represented by a DataFrame. You can think of data frames as a giant spreadsheet which you can program, with the data for each column stored in its own list that pandas calls a Series (or vector of values), along with a set of methods (another name for functions that are tied to objects) that make managing data in pandas easy.

A Series is a list of values each of which can also have a label, which pandas calls an "index", and which generally is used to store names of columns when you retrieve a Series that represents a row, and IDs of rows when you retrieve a Series that represents a column of data in a table.

While DataFrames and Series are separate objects, they may share the same methods where those methods make sense in both a table and list context (head() and tail(), as used in examples in this notebook, for example).
More details on pandas data structures:
- [Data Structures Overview](http://pandas.pydata.org/pandas-docs/stable/dsintro.html)
- [Series specifics](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#series)
- [DataFrame specifics](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe)

In [None]:
# get vector of "ORG_DEPT" column values from data frame
org_dept_column_series = data_frame[ "ORG_DEPT" ]

# see the last 5 values in the vector.
print( org_dept_column_series.tail( 5 ) )

# It is also OK to chain together, but I did not above for clarity's sake, and in
#    general, be wary of doing too many things on one line.
# data_frame[ "ORG_DEPT" ].tail( 5 )

# empty org_dept_column_series variable and garbage collect, to conserve memory
org_dept_column_series = None
gc.collect()

In [None]:
data_frame.dtypes

## Descriptive Statistics

Pandas provides some [great functions for descriptive statistics](http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics). Some examples:

- **`describe()`** - "computes a variety of summary statistics about a Series or the columns of a DataFrame (excluding NAs of course)" ( See [documentation](http://pandas.pydata.org/pandas-docs/stable/basics.html#summarizing-data-describe))

    - includes the count of values, mean, standard deviation, min, 25%, 50%, and 75% values, and the max.

- **`head()` and `tail()`**, shown above - "To view a small sample of a Series or DataFrame object, use the `head()` and `tail()` methods. The default number of elements to display is five, but you may pass a custom number." ( See [documentation](http://pandas.pydata.org/pandas-docs/stable/basics.html#head-and-tail).
- **`value_counts()`** - The `value_counts()` "series method and top-level function computes a histogram of a one-dimensional array of values." ( See [documentation](http://pandas.pydata.org/pandas-docs/stable/basics.html#value-counts-histogramming-mode) ).  This method returns a Series of the counts of the number of times each unique value in the column is present in the column (also known as frequencies), from largest count to least, with the value itself the label for each row.

For the first part of exercise 1, we will combine some of these methods for calculating descriptive statistics into a function that accepts a DataFrame of data from our `homework.MachineLearning2` table, then calculates and returns both the descriptives for the columns in the table and the top ten most frequently referenced departments.

We will be making a function that returns multiple values to help you understand how this works in Python, since some of the machine learning functions used below return multiple values.

In a Python function, if you want to return multiple values, you place each in the line with your return statement, separated by commas.  This is like a list of variables, but you don't need to put it in square brackets.  You just place the items after the `return` keyword.  So, if you wanted to write a function that returns the circumference and area of a circle when passed the radius of a circle (run the cell below):

## Model Selection
Now that we have a clean dataset, we can move on to the fun parts!! The python machine learning libraries do not accept categorical variables, so we need to convert all such variables to dummies first (boolean variables that capture the presence or absence in a given row of each of the potential values for each categorical variable). However, pandas makes it super easy! 

But before we do that, lets split our data variables into predictors (features, or dependent variables, or "X" variables) and variables to predict (independent variables, or "Y" variables).  For ease of reference, in subsequent examples, names of variables that pertain to predictors will start with "`X_`", and names of variables that pertain to variables we are to predict will start with "`y_`".

In [None]:
# Lets go ahead and split into predictors and predicted

# make a list of the column names not in dependent column name list (currently just "ORG_DEPT")
# one line - predictor_column_list = [ column_name for column_name in list( cleaned_data_frame.columns.values ) if column_name not in [ "ORG_DEPT" ] ]
X_column_list = []
y_column_list = [ "ORG_DEPT" ]

# loop over column names.
column_name_list = cleaned_data_frame.columns.values
for column_name in column_name_list:
    
    # if the name is not predicted_column_list, add it to predictor_column_list
    if ( column_name not in y_column_list ):
        
        # add to the predictor_column_list
        X_column_list.append( column_name )
        
    #-- END check to see if column is in predicted/IV/Y list --#
    
#-- END loop over columns. --#

# split columns into two DataFrames, those we are to predict,
#    and those that are predictors.
X_data_frame = cleaned_data_frame[ X_column_list ]
y_data_frame = cleaned_data_frame[ y_column_list ]

Now, we can easily convert all categorical variables in `X_data_frame` into dummy/binary variables using the `pandas.get_dummies()` function ( See [documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html)).

In [None]:
# Python's sckikit algorithms dont work on categorical variables. Fortunately, Pandas provides an easy way out!
X_data_frame = pandas.get_dummies( X_data_frame )

## Features
Good features make machine learning systems effective. You generate features by a combination of domain knowledge and 
what has the most correlation. In general, it is better to have more complex features and a simpler model rather than vice versa. Keeping the model simple makes it faster to train and easier to understand. 

- **Transformations**, such a log, square, and square root.
- **Dummy (binary) variables**, also known as *indicator variables*, often done by taking categorical variables
(such as city) which do not have a numerical value, and adding them to models as a binary value.
- **Discretization**. Several methods require features to be discrete instead of continuous. This is often done 
by binning, which you can do by equal width. 
- **Aggregation.** Aggregate features often constitute the majority of features for a given problem. These use 
different aggregation functions (*count, min, max, average, standard deviation, etc.*) which summarize several
values into one figure, aggregating over varying windows of time and space. For example, given urban data, 
we would want to calculate the *number* (and *min, max, mean, variance*, etc.) of crimes within an *m*-mile radius
of an address in the past *t* months for varying values of *m* and *t*, and then use all of them as features.

## Training a model

- Back to [Table of Contents](#Table-of-Contents)

If we're building a model, we're going to need a way to know whether or not it's working. Convincing others of the quality of results is often the most challenging part of an analysis.  In machine learning, making repeatable, well-documented work with clear success metrics makes all the difference.

For our classifier, we're going to use the following build methodology (Ghani, 2014):

<img src="https://s3.amazonaws.com/demo-datasets/traintest.png" />

In brief, this methodology involves:

- First [**splitting your data**](#1.-Split-data-into-training-and-testing-data-sets) into a training set (75% of your data) and a test set (25% of your data).
- "**Feature engineering** is the process of transforming raw data into features that better represent the underlying problem/data to the predictive models, resulting in improved model accuracy on unseen data." ( from [Discover Feature Engineering](http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/) ).  In text, for example, this might involve deriving traits of the text like word counts, verb counts, or topics to feed into a model rather than simply giving it the raw text.
- In the [**Model Build**](#2.-Model-Build---Fit-model) phase, you decide on a model then train or fit your model using your training data.
- In the [**Evaluate Performance**](#3.-Evaluate-Performance---accuracy) phase, you run your fitted model on your set of testing predictors, then assess the quality of the model by comparing the predicted values to the actual values for each record in your testing data set. 

Since we have a limited number of relatively basic features, we won't be going into any Feature Engineering examples for this exercise.  However, feature engineering is an essential part of implementing quality machine learning - to learn more, start with the "Discover Feature Engineering" tutorial: [http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/](http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/)

Let us now split our dataset into test and training using the `train_test_split()` function from scikit learn's sklearn.cross_validation module ( [http://scikit-learn.org/stable/modules/cross_validation.html](http://scikit-learn.org/stable/modules/cross_validation.html) ):

## 1. Split data into training and testing data sets

Let us now split our dataset into test and training using the `train_test_split()` function from scikit learn's sklearn.cross_validation module ( [http://scikit-learn.org/stable/modules/cross_validation.html](http://scikit-learn.org/stable/modules/cross_validation.html) ):

In [None]:
# use train_test_split() to split our X and Y variables into separate 75% and 25%
#    DataFrames of training (X_train and y_train) and testing (X_test and y_test) data.
X_train, X_test, y_train, y_test = train_test_split( X_data_frame, y_data_frame, test_size = 0.25, random_state = 0 )

# Before we fit the model, we also need to change the datatype of the y_train variable.
# y_train currently is a Pandas Series, however, scikit-learn requires it to be a numpy array
# So all we need to do is extract the raw values of y_train, and pass them onto scikit-learn
y_train_values = y_train[ 'ORG_DEPT' ].values

## 2. Model Build - Fit model

- Back to [Table of Contents](#Table-of-Contents)

Python's `scikit-learn` is a very well known machine library. It is also well documented and maintained. You can learn all about it here: [http://scikit-learn.org/stable/](http://scikit-learn.org/stable/). We will be using different classifiers from this library for our predictions in this workbook. 

We will start with the simplest `LogisticRegression` model ( [http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) ) and see how well that does.

You can use any number of metrics to judge your models, but we will be using scikit learn's provided `accuracy_score()` (ratio of correct predictions to total number of predictions, based on comparing a set of predicted to values to a set of actual values) as our measure ( [http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) ).

In [None]:
# Let's fit the model
model = LogisticRegression()
model.fit( X_train, y_train_values )
print(model)

When we print the model, we see different parameters we can adjust as we refine the model based on running it against test data (values such as `intercept_scaling`, `max_iters`, `penalty`, and `solver`).  Example output:

    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)

To adjust these parameters, one would alter the call that creates the `LogisticRegression()` model instance, passing it one or more of these parameters with a value other than the default.  So, to re-fit the model with `max_iter` of 1000, `intercept_scaling` of 2, and `solver` of "lbfgs" (pulled from thin air as an example), you'd create your model as follows:

    model = LogisticRegression( max_iter = 1000, intercept_scaling = 2, solver = "lbfgs" )

More details on what each of thee parameters mean is on the `LogisticRegression` documentation page: [http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

The basic way one would tune these parameters is to iterate over fitting your model to your training data with different parameters (hopefully chosen based on your knowledge of your data and the model you are fitting), then testing it against training data until the model's accuracy is as high as you can get it.  Unfortunately, this exposes one to the potential that the model is over-fitted to one's test data, and so won't perform as well when it is used to predict other sets of data.

Cross-validation is a good way to fine-tune the parameters with less risk of over-fitting.  It involves dividing your training data into 5 or so equal sets called folds, then choosing a different fold to serve as the test data set each time you test a new set of parameters. This sounds complicated, but scikit learn has functions to help, and a good tutorial on cross-validation can be found on the scikit learn site: [http://scikit-learn.org/stable/modules/cross_validation.html](http://scikit-learn.org/stable/modules/cross_validation.html).

## 3. Evaluate Performance - accuracy

- Back to [Table of Contents](#Table-of-Contents)

Now let's use the model we just fit to make predictions on our test dataset, and see what our accuracy score is:

In [None]:
# store our test "to predict" variables in "expected".
expected = y_test

# predict values from our "predictors" usin the model.
predicted = model.predict(X_test)

# generate an accuracy score by comparing expected to predicted.
accuracy = accuracy_score(expected, predicted)
print( "Accuracy = " + str( accuracy ) )

We get an accuracy score of 0.45340... (45%). This is not a great score, however, it is much better than random guessing, which would have had a chance of 1/18 of succeeding. The other way to guess would be to take the mode, which in this case is MEDICINE with a frequency of 22497, which would give us an accuracy score of 22497/49013 = 45.9%. So logistic regression is about as good as just always assigning the mode when department is missing. Let's see if other classifiers can do any better.

## Evaluation
- ** Model Selection**: How do we select a method to use? What parameters should we select for that method?
- **Performance Estimation**: How well will our model do once it is deployed and applied to new data?
- **Deeper Understanding**: Are there inaccuracies in the predictions the model makes? Does the model uncover
inconsistencies in the data?

# Exercise 3 - Train, fit and evaluate your model

- Back to [Table of Contents](#Table-of-Contents)

Complete the function below to train different classifiers from the scikit library.

The `classifier()` function that you will implement:

- Accepts X_train_IN, y_train_IN (should be type numpy.ndarray, not a Series or DataFrame - to convert, call "`.values`"), X_test_IN, and y_test_IN variables.
- creates a model, fits the model, tests the model, and calculates an accuracy score for the model (like we did above).
- returns the accuracy score as a percent (so a number between 1 and 100, not a decimal between 0 and 1 - ... so multiply by 100).

Your goal is to come up with a classifier that gives at least 70% accuracy on the test dataset.  To do this, you can:

- choose different models from those offered as part of scikit learn.

    - To start, here are some resources to help with choosing a model:

        - The scikit learn tutorial on choosing a model - [http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html)
        - This video on choosing and tuning a model - [http://blog.kaggle.com/2015/05/14/scikit-learn-video-5-choosing-a-machine-learning-model/](http://blog.kaggle.com/2015/05/14/scikit-learn-video-5-choosing-a-machine-learning-model/)
    
    - In particular, here are some sets of models you could explore from the "predicting a category" with "labeled data" branch of the [scikit learn tutorial on choosing a model](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html) (just make sure your data doesn't violate the assumptions of the model and that the model you choose is appropriate for your data - predicting a categorical variable using a mix of numeric and dummied categorical data):

        - other Linear Models like the `LogisticRegression` - [http://scikit-learn.org/stable/modules/linear_model.htmlFeel](http://scikit-learn.org/stable/modules/linear_model.html)
        - Decision Tree models - [http://scikit-learn.org/stable/modules/tree.html](http://scikit-learn.org/stable/modules/tree.html)
        - Ensemble classifiers - [http://scikit-learn.org/stable/modules/ensemble.html](http://scikit-learn.org/stable/modules/ensemble.html)
        - Nearest neighbors classifiers - [http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification](http://scikit-learn.org/stable/modules/neighbors.html#nearest-neighbors-classification)
        - Stocahstic Gradient Descent - [http://scikit-learn.org/stable/modules/sgd.html#classification](http://scikit-learn.org/stable/modules/sgd.html#classification)
        - Kernel Approximation + SGDClassifier - [http://scikit-learn.org/stable/modules/kernel_approximation.html](http://scikit-learn.org/stable/modules/kernel_approximation.html)

- play around with different parameters for the models you try.
- experiment with different sets of X variables.
- _Advanced_ - You can try starting again from the top with an SQL query that uses JOINs to pull in columns from other tables, to add more variables to your pool of available predictors.
- _Advanced_ - You could also try to derive additional features from the data present in your query and add those features to your predictors.

Again, in general, make sure that the model and parameters you choose are appropriate for both your X and Y variables.

In [None]:
def classifier(X_train_IN, y_train_IN, X_test_IN, y_test_IN):
    """
    Parameters
    ----------
    X_train_IN : A pandas DataFrame of features used for training the classifier
    y_train_IN : A numpy array of y values used for training the classifier
    X_test_IN, y_test_IN : Use these to test the accuracy of your classifier
    
    Returns
    -------
    accuracy score : a float giving the percent (0 to 100) of accurate predictions you made
    """
    
    ### BEGIN SOLUTION
    # declare variables
    y_train_type = None
    y_train_array = None
    my_accuracy_score = -1
   
    # check to see if y_train_IN is either a Series or a DataFrame
    y_train_type = type( y_train_IN )
    if ( ( y_train_type == pandas.core.series.Series ) or ( y_train_type == pandas.core.frame.DataFrame ) ):
        
        # Series or DataFrame - convert to pandas array
        y_train_array = y_train_IN.values
        
    elif ( y_train_type == numpy.ndarray ):
        
        # this is what it should be.
        y_train_array = y_train_IN
        
    else:
        
        # not Series or DataFrame numpy ndarray - just use it and see what happens...?
        print( "Unexpected y_train_IN type: " + str( y_train_type ) )
        y_train_array = y_train_IN
        
    #-- END check to see if y_train_IN is wrong type. --#
    
    # for fitting model, use y_train_array rather than y_train_IN.

    from sklearn.ensemble import RandomForestClassifier
    model = RandomForestClassifier()
    model.fit(X_train_IN, y_train_array)
    expected = y_test_IN
    predicted = model.predict(X_test_IN)
    
    # get accuracy score
    my_accuracy_score = accuracy_score( expected, predicted )
    
    # * 100 to turn into percentage value
    my_accuracy_score = my_accuracy_score * 100
    
    return my_accuracy_score
    ### END SOLUTION

In [None]:
# Remember,  first extract the values of y_train, before calling the classifier function
y_train_values = y_train["ORG_DEPT"].values

# TEST to see if your accuracy is greater than 70%. This might take several minutes to run!!!
test_accuracy_score = classifier( X_train, y_train_values, X_test, y_test )
print( "Accuracy Percentage: " + str( test_accuracy_score ) )

# TEST - is it greater than 70%?
assert test_accuracy_score >= 70

## Machine Learning Pipeline
When working on machine learning projects, it is a good idea to structure your code as a modular pipeline. This has 
many advantages:
- **Reproducibility**.
- **Comparison**.
- **Ability to make changes.**
- **Ability to collaborate.**

## Resources
- Hastie et al.'s [The Elements of Statistical Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) is a classic and is available online for free.
- James et al.'s [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) includes less mathematics and is more approachable. It is also available online.
- Wu et al.'s [Top 10 Algorithms in Data Mining](http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf).