# Case Study: School Budgeting with Machine Learning in Python

**DataDriven:** runs online data science challenges for non-profits, NGOs and social enterprises
* Goals of the case study:
    * **Use data to have a social impact**
    * **Build a machine learning algorithm that can automate the school budgeting process**
    
* Introduction to the challenge:
    * NLP
    * Feature Engineering
    * Efficiency Boosting hashing tricks
    * Supervised learning problem (labeled)
    * Over 100 target variables
    * Classification problem: we want to predict a category for each line item
    * Predictions will be probabilities for each label

* A **human-in-the-loop machine-learning system (HITL):** is a branch of artificial intelligence that leverages both human and machine intelligence to create machine learning models.

#### Exploring the Data:
* Encode labels as categories
* ML algorithms work on numbers, not strings
    * Need a numeric representation of these strings
* Strings can be slow compared to numbers
    * We never know ahead of time how long a string is, so our computer has to take more time to process strings than numbers (which have a precise number of bits)
* In pandas, `category` dtype encodes categorical data numerically
    * Can help speed up code
    * use `.astype('category')
    * to see the numerical representation of 'categories' use:
        * `dummies = pd.get_dummies(sample_df[['label']], prefix_sep=' ')`
        * Dummy encoding also called a **`binary indicator` representation**
        
#### Lambda functions
* Alternative to `def` syntax
* Easy way to make simple, one line functions
* `square = lambda x: x*x`
    * $\Uparrow$ We can define a lambda function that takes a paramater (the variable, *x*), the function itself just multiplies *x* by *x* and returns the result.
    * Call this function, just like any other function:
        * `square(2)`
        
* In the budget data, there are multiple columns that need to be made categorical
* To make multiple columns into categories, we need to apply the function to each column separately:
    * `categorize_label = lambda x: x.astype('category')`
    * `sample_df.label =sample_df[['label']].apply(categorize_label, axis = 0)`
* `df.dtypes.value_counts()`

```
# Define the lambda function: categorize_label
categorize_label = lambda x: x.astype('category')

# Convert df[LABELS] to a categorical type
df[LABELS] = df[LABELS].apply(categorize_label, axis=0)

# Print the converted dtypes
print(df[LABELS].dtypes)
```
* `num_unique_labels = df[LABELS].apply(pd.Series.nunique)`

```
# Import matplotlib.pyplot
import matplotlib.pyplot as plt

# Calculate number of unique values for each label: num_unique_labels
num_unique_labels = df[LABELS].apply(pd.Series.nunique)

# Plot number of unique values for each label
num_unique_labels.plot(kind='bar')

# Label the axes
plt.xlabel('Labels')
plt.ylabel('Number of unique values')

# Display the plot
plt.show()
```

#### How do we measure success?
* Choosing how to evaluate your machine learning model is one of the most important decisions an analyst makes. 
* The decision balances the real world use of the algorithm, the mathematical properties of the evaluation function, and the interpretability of the measure
* **Accuracy** Tells us what percentage of rows we got right
    * Accuracy can be misleading, especially when classes are imbalanced
    * Think of email spam example: accuracy not good for measuring success of this model
* Metric used for spam example: **log loss**
    * loss function/measure of error
    * In contrast to success measures, like accuracy, we want our measures of error to be as small as possible
    * Log loss penalizes confidently wrong over inconfidence
        * **Better to be less confident than wrong**  
        
#### Computing log loss with Python:

In [1]:
import numpy as np
def compute_log_loss(predicted, actual, eps=1e-14):
    """ Comuptes the logarithmic loss between predicted and actual when these are 1D arrays.
        
        :param predicted: The predicted probabilities as floats between 0-1
        :param actual: The actual binary labels. Either 0 or 1.
        :params eps (optional): log(0) is inf, so we need to offset our predicted values slightly by eps from 0 or 1.
    """
    predicted = np.clip(predicted, eps, 1-eps)
    loss = -1 * np.mean(actual * np.log(predicted) + (1 - actual) * np.log(1 - predicted))
    return loss

* **`.clip()`** function which sets a maximum and a minimum value for the elements in an array. 
* Since log of 0 is negative infinity, we want to offset our predictions ever so slightly from being exactly 1 or exactly 0 so that our score remains a real number. 
* In the above example we use eps to achieve this. 
* computing log loss with numpy:

* `compute_log_loss(predicted= 0.9, actual= 0)`
* `compute_log_loss(predicted= 0.5, actual= 1)`

```
# Compute and print log loss for 1st case
correct_confident_loss = compute_log_loss(correct_confident, actual_labels)
print("Log loss, correct and confident: {}".format(correct_confident_loss)) 

# Compute log loss for 2nd case
correct_not_confident_loss = compute_log_loss(correct_not_confident, actual_labels)
print("Log loss, correct and not confident: {}".format(correct_not_confident_loss)) 

# Compute and print log loss for 3rd case
wrong_not_confident_loss = compute_log_loss(wrong_not_confident, actual_labels)
print("Log loss, wrong and not confident: {}".format(wrong_not_confident_loss)) 

# Compute and print log loss for 4th case
wrong_confident_loss = compute_log_loss(wrong_confident, actual_labels)
print("Log loss, wrong and confident: {}".format(wrong_confident_loss)) 

# Compute and print log loss for actual labels
actual_labels_loss = compute_log_loss(actual_labels, actual_labels)
print("Log loss, actual labels: {}".format(actual_labels_loss)) 
```

### Creating a simple first model
* It's always a good approach to start with a very simple model 
* Creating a simple model first helps to give us a sense of how challenging the problem is and also where our baseline is
* Many more things can go wrong in complex models 
* How much signal can we pull out using basic methods

* Train basic model on numeric model 
    * We want to go from raw data to predictions as quickly as possible
* In this case we'll use multi-class logistic regression
    * treats each label column as independent
    * train classifier on each label separately and use those to predict
    * Format predictions and save to csv
* **`StratifiedShuffleSplit()`**
    * Only works if you have a single target variable
    
* **`multilabel_train_test_split()`** 

```
data_to_train = df[NUMERIC_COLUMNS].fill_na(-1000)
labels_to_use = pd.get_dummies(df[LABELS])
X_train, X_test, y_train, y_test = multilabel_train_test_split(data_to_train, labels_to_use, size =0.2, seed=123)

# Training the model 
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(LogisticRegression())
clf.fit(X_train, y_train)

```
* `OneVsRestClassifier` treats each column of y independently 
    * fits a separate classifier for each of the columns 