# Case Study: School Budgeting with Machine Learning in Python

**DataDriven:** runs online data science challenges for non-profits, NGOs and social enterprises
* Goals of the case study:
    * **Use data to have a social impact**
    * **Build a machine learning algorithm that can automate the school budgeting process**
    
* Introduction to the challenge:
    * NLP
    * Feature Engineering
    * Efficiency Boosting hashing tricks
    * Supervised learning problem (labeled)
    * Over 100 target variables
    * Classification problem: we want to predict a category for each line item
    * Predictions will be probabilities for each label

* A **human-in-the-loop machine-learning system (HITL):** is a branch of artificial intelligence that leverages both human and machine intelligence to create machine learning models.

#### Exploring the Data:
* Encode labels as categories
* ML algorithms work on numbers, not strings
    * Need a numeric representation of these strings
* Strings can be slow compared to numbers
    * We never know ahead of time how long a string is, so our computer has to take more time to process strings than numbers (which have a precise number of bits)
* In pandas, `category` dtype encodes categorical data numerically
    * Can help speed up code
    * use `.astype('category')
    * to see the numerical representation of 'categories' use:
        * `dummies = pd.get_dummies(sample_df[['label']], prefix_sep=' ')`
        * Dummy encoding also called a **`binary indicator` representation**
        
#### Lambda functions
* Alternative to `def` syntax
* Easy way to make simple, one line functions
* `square = lambda x: x*x`
    * $\Uparrow$ We can define a lambda function that takes a paramater (the variable, *x*), the function itself just multiplies *x* by *x* and returns the result.
    * Call this function, just like any other function:
        * `square(2)`
        
* In the budget data, there are multiple columns that need to be made categorical
* To make multiple columns into categories, we need to apply the function to each column separately:
    * `categorize_label = lambda x: x.astype('category')`
    * `sample_df.label =sample_df[['label']].apply(categorize_label, axis = 0)`
* `df.dtypes.value_counts()`

```
# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')

# Convert df[LABELS] to a categorical type
df[LABELS] = df[LABELS].apply(categorize_label, axis=0)

# Print the converted dtypes
print(df[LABELS].dtypes)
```
* `num_unique_labels = df[LABELS].apply(pd.Series.nunique)`

```
# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Calculate number of unique values for each label: num_unique_labels
num_unique_labels = df[LABELS].apply(pd.Series.nunique)

# Plot number of unique values for each label
num_unique_labels.plot(kind='bar')

# Label the axes
plt.xlabel('Labels')
plt.ylabel('Number of unique values')

# Display the plot
plt.show()
```

#### How do we measure success?
* Choosing how to evaluate your machine learning model is one of the most important decisions an analyst makes. 
* The decision balances the real world use of the algorithm, the mathematical properties of the evaluation function, and the interpretability of the measure
* **Accuracy** Tells us what percentage of rows we got right
    * Accuracy can be misleading, especially when classes are imbalanced
    * Think of email spam example: accuracy not good for measuring success of this model
* Metric used for spam example: **log loss**
    * loss function/measure of error
    * In contrast to success measures, like accuracy, we want our measures of error to be as small as possible
    * Log loss penalizes confidently wrong over inconfidence
        * **Better to be less confident than wrong**  
        
#### Computing log loss with Python:

In [1]:
import numpy as np
def compute_log_loss(predicted, actual, eps=1e-14):
    """ Comuptes the logarithmic loss between predicted and actual when these are 1D arrays.
        
        :param predicted: The predicted probabilities as floats between 0-1
        :param actual: The actual binary labels. Either 0 or 1.
        :params eps (optional): log(0) is inf, so we need to offset our predicted values slightly by eps from 0 or 1.
    """
    predicted = np.clip(predicted, eps, 1-eps)
    loss = -1 * np.mean(actual * np.log(predicted) + (1 - actual) * np.log(1 - predicted))
    return loss

* **`.clip()`** function which sets a maximum and a minimum value for the elements in an array. 
* Since log of 0 is negative infinity, we want to offset our predictions ever so slightly from being exactly 1 or exactly 0 so that our score remains a real number. 
* In the above example we use eps to achieve this. 
* computing log loss with numpy:

* `compute_log_loss(predicted= 0.9, actual= 0)`
* `compute_log_loss(predicted= 0.5, actual= 1)`

```
# Compute and print log loss for 1st case
correct_confident_loss = compute_log_loss(correct_confident, actual_labels)
print("Log loss, correct and confident: {}".format(correct_confident_loss)) 

# Compute log loss for 2nd case
correct_not_confident_loss = compute_log_loss(correct_not_confident, actual_labels)
print("Log loss, correct and not confident: {}".format(correct_not_confident_loss)) 

# Compute and print log loss for 3rd case
wrong_not_confident_loss = compute_log_loss(wrong_not_confident, actual_labels)
print("Log loss, wrong and not confident: {}".format(wrong_not_confident_loss)) 

# Compute and print log loss for 4th case
wrong_confident_loss = compute_log_loss(wrong_confident, actual_labels)
print("Log loss, wrong and confident: {}".format(wrong_confident_loss)) 

# Compute and print log loss for actual labels
actual_labels_loss = compute_log_loss(actual_labels, actual_labels)
print("Log loss, actual labels: {}".format(actual_labels_loss)) 
```

### Creating a simple first model
* It's always a good approach to start with a very simple model 
* Creating a simple model first helps to give us a sense of how challenging the problem is and also where our baseline is
* Many more things can go wrong in complex models 
* How much signal can we pull out using basic methods

* Train basic model on numeric model 
    * We want to go from raw data to predictions as quickly as possible
* In this case we'll use multi-class logistic regression
    * treats each label column as independent
    * train classifier on each label separately and use those to predict
    * Format predictions and save to csv
* **`StratifiedShuffleSplit()`**
    * Only works if you have a single target variable
    
* **`multilabel_train_test_split()`** 

```
data_to_train = df[NUMERIC_COLUMNS].fill_na(-1000)
labels_to_use = pd.get_dummies(df[LABELS])
X_train, X_test, y_train, y_test = multilabel_train_test_split(data_to_train, labels_to_use, size =0.2, seed=123)

# Training the model 
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(LogisticRegression())
clf.fit(X_train, y_train)

```
* `OneVsRestClassifier` treats each column of y independently 
    * fits a separate classifier for each of the columns 

#### Make predictions
* Predicting on holdout data

```
holdout = pd.read_csv('HoldoutData.csv', index_col =0)
holdout = holdout[NUMERIC_COLUMNS].fillna(-1000)
predictions = clf.predict_proba(holdout)
predictions = clf.predict_proba(holdout)

```

* In data science competitions, its a standard practice to write your predictions to a csv and then upload that csv to the competition platform
* Read competition documentation for submission format for a particular challenge
* All formatting can be done with the pandas `to_csv()` function

* Format and submit predictions:

```
prediction_df = pd.DataFrame(columns = pd.get_dummies(df[LABELS], 
                                prefix_sep='__').columns, index=holdout.index, 
                                data=predictions)
prediction_df.to_csv('predictions.csv')
score = score_submission(pred_path='predictions.csv')
```

```
# Instantiate the classifier: clf
clf = OneVsRestClassifier(LogisticRegression())

# Fit it to the training data
clf.fit(X_train, y_train)

# Load the holdout data: holdout
holdout = pd.read_csv('HoldoutData.csv')

# Generate predictions: predictions
predictions = clf.predict_proba(holdout[NUMERIC_COLUMNS].fillna(-1000))
```

```
# Generate predictions: predictions
predictions = clf.predict_proba(holdout[NUMERIC_COLUMNS].fillna(-1000))

# Format predictions in DataFrame: prediction_df
prediction_df = pd.DataFrame(columns=pd.get_dummies(df[LABELS]).columns,
                             index=holdout.index,
                             data=predictions)


# Save prediction_df to csv
prediction_df.to_csv('predictions.csv')

# Submit the predictions for scoring: score
score = score_submission(pred_path='predictions.csv')

# Print score
print('Your model, trained with numeric data only, yields logloss score: {}'.format(score))
```

#### A very brief introduction to NLP
* When we have data that is text we often want to process this text to create features for our algorithm
* Data for NLP can be:
    * Text, documents, speech, etc. ...
* First step is: Tokenization
* **Tokenization**: is the process of splitting a string into segments, called "tokens"
    * Store segments as a list 
    
#### Tokens and token patterns
* Options for how to tokenizae;
    * Tokenize on whitespace (i.e. split every time there is a space, tab, or return)
    * Tokenize on punctuation
    * Tokenize on commas
    * **Tokenize on whitespace *and* punctuation**
    * etc...
* For some datasets, we may want to split on words based on characters other than whitespace

#### Bag of words representation
* We want to use these tokens as part of our machine learning algorithm; often, the first way to do this is to count the number of times a particular token appears in a row
* **Bag of words:** 
    * Count the number of times a word was pulled out of the "bag" 
    * One of the simplest ways to represent text in ML
    * Discards information about grammar and word order
    * Computes frequency of occurrence
* However, this approach discards information about word order
* A slightly more sophisticated approach is to create what are called **n-grams**
* **1-gram, 2-gram, ..., n-gram:** in addition to a column for every token we see (which is called a 1-gram), we may have an ordered pair of every ___ words.

#### Representing text numerically
* **Bag of words:** 
    * Count the number of times a word was pulled out of the "bag" 
    * One of the simplest ways to represent text in ML
    * Discards information about grammar and word order
    * Computes frequency of occurrence
    * Scikit tools for bag of words: 
        * **`CountVectorizer()`**: works by taking in an array of strings and doing three things:
            * 1) Tokenizes all the strings 
            * 2) Builds a "vocabulary": it makes a note of all of the words that appear
            * 3) Counts the occurences of each token in the vocabulary
            
#### Using CountVectorizer() on column of main dataset
* Define a regex that does the splits on whitespace

```
from sklearn.feature_extraction.text import CountVectorizer
TOKENS_BASIC = '\\\\S+(?=\\\\s+)'
df.Program_Description.fillna('', inplace=True)
vec_basic = CountVectorizer(token_pattern=TOKENS_BASIC)
vec_basic.fit(df.Program_Description)
print(len(vec_basic.get_feature_names()))
```
* `fit` creates a vocabulary 
* `transform` will tokenize the text and then produce an array of counts

```
# Import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Create the token pattern: TOKENS_ALPHANUMERIC
TOKENS_ALPHANUMERIC = '[A-Za-z0-9]+(?=\\s+)'

# Fill missing values in df.Position_Extra
df.Position_Extra.fillna('', inplace=True)

# Instantiate the CountVectorizer: vec_alphanumeric
vec_alphanumeric = CountVectorizer(token_pattern=TOKENS_ALPHANUMERIC)

# Fit to the data
vec_alphanumeric.fit(df.Position_Extra)

# Print the number of tokens and first 15 tokens
msg = "There are {} tokens in Position_Extra if we split on non-alpha numeric"
print(msg.format(len(vec_alphanumeric.get_feature_names())))
print(vec_alphanumeric.get_feature_names()[:15])
```

In [2]:
# Define combine_text_columns()
def combine_text_columns(data_frame, to_drop=NUMERIC_COLUMNS + LABELS):
    """ converts all text in each row of data_frame to single vector """
    
    # Drop non-text columns that are in the df
    to_drop = set(to_drop) & set(data_frame.columns.tolist())
    text_data = data_frame.drop(to_drop, axis= 'columns')
    
    # Replace nans with blanks
    text_data.fillna('', inplace = True)
    
    # Join all text items in a row that have a space in between
    return text_data.apply(lambda x: " ".join(x), axis=1)

NameError: name 'NUMERIC_COLUMNS' is not defined

#### Pipelines, feature, and text preprocessing
* Now its time to combine what we've learned about NLP with our model pipeline and incorporate the text data into our algorithm

#### The pipeline workflow
* Repeatable way to go from raw data to trained model 
* Pipeline object takes sequential list of steps
    * Output of one step is the input to the next step
* Each step is a tuple with two elements:
    * Name: string
    * Transform: obj implementing `.fit()` and `.transform()`
* Flexible: a step can itself be another pipeline
* The beauty of the pipeline is that it encapsulates every transformation from raw data to a trained model

#### Instantiate simple pipeline with one step
* We start with a one-step pipeline
    * Obviously, we don't need a pipeline for a single step, this is just an exercise as a simple example
* We create a pipeline by passing it a series of named steps
* In the case below, the name is the string `clf`
* The step is the OneVsRest-Logistic Regression Classifier

```
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier


```

#### Train and test with sample numeric data

```
pl = Pipeline([('clf', OneVsRestClassifier(LogisticRegression()))])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric']], 
                                   pd.get_dummies(smaple_df['label']),
                                   random_state =2)
pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)
```
* **Using series/dfs with `NaN` values:**

```
pl = Pipeline([('imp', Imputer()), ('clf', OneVsRestClassifier(LogisticRegression()))])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', 'with_missing']], 
                                   pd.get_dummies(smaple_df['label']),
                                   random_state =2)
pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)
```
* **Note:** `Imputer()` added to pipeline to fill in `NaN` values
* The default imputation in sklearn is to fill missing values with the mean of the column in question 


### Text features and feature unions
#### Preprocessing text features

```
pl = Pipeline([('imp', Imputer()), ('clf', OneVsRestClassifier(LogisticRegression()))])
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(sample_df['text'], 
                                   pd.get_dummies(smaple_df['label']),
                                   random_state =2)
pl.fit(X_train, y_train)
accuracy = pl.score(X_test, y_test)
```
#### Preprocessing multiple dtypes
* We want to use **all** available features in one pipeline
* Problem:
    * Pipeline steps for numeric and text preprocessing can't follow each other 
    * e.g., output of `CountVectorizer()` can't be input to `Imputer()`
    * `CountVectorizer()` won't know what to do with numerical data and `Imputer()` won't know what to do with text data
    * In order to build our pipeline, we need to separately operate on the text columns and on the numeric columns 
* Solution:
    * **`FunctionTransformer()`** and **`FeatureUnion()`**
    
#### FunctionTransformer()
* **Turns a Python function into an object that a sklearn pipeline can understand**
* Need to write two functions for pipeline preprocessing
    * 1) Take entire dataframe, return numeric columns 
    * 2) Take entire dataframe, return text columns
* Can then preprocess numeric and text data in separate pipelines

```
X_train, X_test, y_train, y_test = train_test_split(sample_df[['numeric', 'with_missing', 'text']], 
                                   pd.get_dummies(smaple_df['label']),
                                   random_state =2)
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion

get_text_data = FunctionTransformer(lambda x: x['text'], validate = False)
get_numeric_data = FunctionTransformer(lambda x: x[['numeric', 'with_missing']], validate=False)

union = FeatureUnion([('numeric', numeric_pipeline), ('text', text_pipeline)])

numreic_pipeline = Pipeline([
                        ('selector', get_numeric_data),
                        ('imputer', Imputer())
                    ])
text_pipeline = Pipeline([
                        ('selector', get_text_data),
                        ('vectorizer', CountVectorizer())
                    ])
pl = Pipeline([
        ('union', FeatureUnion([
            ('numeric', numeric_pipeline),
            ('text', text_pipeline)
         ])),
         ('clf', OneVsRestClassifier(LogisticRegression()00
          ])
```
* **Note:** That we've passed `FunctionTransformer` the argument `validate=False`: this simply tells sklearn it doesn't need to check for `NaN`s or validate the dtypes of the input

