#### GESIS Fall Seminar in Computational Social Science 2022
### Introduction to Computational Social Science with Python
# Day 4-3: Machine Learning with sklearn

## Overview

* Machine learning (a very brief intro)
* Scikit-learn
* Training data vs test data
* Random forests
* Feature importance
* Hyper-parameter tuning

## Machine learning (a very brief intro)
* The use of algorithms to "learn" patterns in large datasets and make predictions, without being explicitly programmed.
* Contrast it with traditional programming, where we manually code structures like if/else statements.
* Supervised vs unsupervised learning:
    * Supervised - supply computer with labelled data, computer learns by comparing output against labels.
    * Unsupervised - no labels provided, computer must identify emergent patterns itself.
* Can be used for regression, classification, clustering tasks, many many more applications.
* It's not magic, it's just statistics!

![Supervised Learning](figs/supervised.png "Supervised_Learning")


## Scikit-learn
* Popular machine learning (ML) library for Python.
* Vast functionality for processing data, building unsupervised and supervised ML models, scoring models.
* Integrates well with pandas, NumPy, matplotlib.
* More cutting-edge ML research is done using libraries such as TensorFlow, PyTorch, and Keras, but sklearn still forms the basis of much ML work.

In [None]:
import sklearn
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Training data vs test data
* We must randomly split our full dataset into data the ML model trains on, and the test data we evaluate the model on.
* The model will learn the patterns in the training data that are associated with a particular label (supervised learning).
* Our data consists of observations with **features** (independent / explanatory variables, "X") and a **target** variable (dependent variable, label, "y").
* We can use sklearn's `train_test_split` to get our training features, testing features, training targets, and testing targets.
* NEVER TRAIN YOUR MODEL ON YOUR TEST DATA.

In [None]:
# Load our dataset about passengers aboard the Titanic.
# Based on data available here: https://www.kaggle.com/competitions/titanic/overview
# We want to predict whether someone died or survived based on features like age, sex, etc

data = pd.read_csv('data/titanic.csv')

# Some info about the columns:
# 'PassengerId': Unique passenger ID
# 'Name': Passenger name
# 'Pclass': Ticket class (1st=1, 2nd=2, 3rd=3)
# 'Sex': Male=1, Female=0
# 'Age': Age in years
# 'SibSp': Number of siblings/spouses also aboard the Titanic
# 'ParCh': Number of parents/children also aboard the Titanic
# 'Fare': Ticket price
# 'Embarked_C': Embarked in Cherbourg, Yes=1, No=0
# 'Embarked_Q': Embarked in Queenstown, Yes=1, No=0
# 'Embarked_S': Embarked in Southampton, Yes=1, No=0
# 'Survived': Yes=1, No=0 (The target variable)

# preview the data
data.head()

In [None]:
from sklearn.model_selection import train_test_split

# Split dataset into features and target
X = data[['Sex', 'Age', 'Embarked_C', 'Embarked_Q', 'Embarked_S']]  # Let's build a model only based on passenger class, sex, and age
y = data['Survived']        

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.70) # 70% training and 30% test


## Random forests
* Supervised model, can be used for classifcation and regression.
* An "ensemble" method - it takes the average or aggregated result of many "decision trees".
* Advantages:
    - Versatile "out of the box" ML method.
    - Handles non-linear and non-scaled data well.
    - Tends to resist overfitting.
    - Feature importance metrics enable better interpretability than black-box models.
* Disadvantages:
    - Less interpretable than a single decision tree.
    - More sophisticated, better performing models often available (e.g. gradient boosted trees).
* Analogy: You want to know if your expensive new shoes are fashionable, so you ask a group of friends. Each friend is a "decision tree" that has their own rules about what they think is fashionable, based on their knowledge of shoe fashion and features like colour, material, style. Together, your friends form a "random forest" of decision trees, and you take the majority opinion from them on whether the shoes are fashionable or not.

An example of a decision tree:
![Decision_Tree](figs/Decision_Tree.jpg "Decision_Tree")

A full forest of decision trees is automatically generated by sampling random subsets of our training data, and creating a tree for each subset that best classifies its respective data.

In [None]:
from sklearn.ensemble import RandomForestClassifier

#Create a Random Forest Classifier with 5 trees
clf = RandomForestClassifier(n_estimators=5)

#Train the model using the training set (.fit(), .predict(), .score() are the main methods to know)
clf.fit(X_train, y_train)
training_score = clf.score(X_train, y_train)

# prediction on test set
y_pred = clf.predict(X_test)
testing_score = clf.score(X_test, y_test)

print(training_score, testing_score) # print accuracy (proportion correctly predicted)
pd.DataFrame({'Test':y_test, 'Predicted':y_pred})

## Feature importance
* Machine learning models frequently receive criticism for their uninterpretability / unexplainability.
    - What features contributed to the machine's decision? What would need to change in order to be classifed as x rather than y?
    - No neat mathematical formula like a simple linear regression model.
* With random forests, it is possible to get a measure of what features contribute most to the classification.
* Sometimes the removal of unimportant features can improve model accuracy!

In [None]:
feature_imp = pd.Series(clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)
feature_imp

## Hyper-parameter tuning
* Not every machine learning algorithm has a 'one size fits all' solution to every problem.
* Machine learning methods can vary in success depending on methodological choices of train/test split, relevant features, algorithm hyper-parameter values.
* Good to make informed choices and experiment with differently specified classifiers/regressors, but beware of overfitting!

In [None]:
# Let's fit 2 models with a different number of trees

for n_trees in [5, 100]:
    clf = RandomForestClassifier(n_estimators=n_trees)
    clf.fit(X_train, y_train)
    train_score = clf.score(X_train, y_train)
    test_score = clf.score(X_test, y_test)
    print('Model with %d trees: training accuracy = %f. test accuracy = %f.'
          %(n_trees, train_score, test_score))

### Overfitting
* If we tune hyperparameters incorrectly (such as making the trees too deep) we are in danger of overfitting out model to the training data.
* This means that we get very high training accuracy, but lower testing accuracy.
* Our model has learnt peculiarities of our training data that do not generalise to the testing data.
* We can help prevent this with *cross-validation*.

In [None]:
# Demonstration of overfitting
fit = []
for run in range(50): # average over 50 runs
    for n in range(1,20):
        clf = RandomForestClassifier(n_estimators=10, max_depth=n)
        clf.fit(X_train, y_train)
        train_score = clf.score(X_train, y_train)
        test_score = clf.score(X_test, y_test)
        fit.append([n, train_score, test_score])
        
fitdf = pd.DataFrame(fit, columns=['Depth', 'Train Score', 'Test Score'])
fitdf.groupby('Depth').mean().plot()
plt.show()

# More depth has improved the training score, but worsened the test score!

### K-fold cross validation
Cross validation helps us design a model that does not overfit the training data, and can predict labels on test data accurately.
1. Take the **training** data, divide it into K equal parts.
2. Train a model on K-1 parts of the training data, test it on the remaining part, repeat for each "train"/"test" combination.
3. Take the average of these "test" scores as the model training accuracy.

Optional:

4. Repeat the K-fold cross validation process for models with different hyper-parameters.
5. Select the model with the best (averaged) training accuracy, refit on full training data.
6. Test this model on the test data for the final test accuracy.

![crossval](figs/grid_search_cross_validation.png "crossval")

In [None]:
from sklearn.model_selection import cross_val_score

# Model selection, we look for the model with the highest cross-validated training score
for n in ["sqrt", "log2", None]:
    clf = RandomForestClassifier(n_estimators=10, max_features=n)    
    scores = cross_val_score(clf, X_train, y_train, cv=5) # 5-fold cross validation
    clf.fit(X_train, y_train)
    print(n, np.mean(scores))


In [None]:
# We select the best performing hyper-parameters, train again on the *full* training data, and test on our final data

clf = RandomForestClassifier(n_estimators=10, max_features="log2") # best hyper-parameters from cross validation
clf.fit(X_train, y_train) # fit model
print(clf.score(X_train, y_train)) # print training/test scores
print(clf.score(X_test, y_test))


### Grid Search
We have evaluated model score by changing one hyper-parameter, but you may want to change other hyper-parameters, or see how different combinations of hyper-parameters work. You can manually accomplish this with a for loop (like above), nested for loops, or the sklearn [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) function. This performs cross validation on models built with all combinations of hyper-parameters specified. Beware that too many parameter combinations will take a long time to fit!

In [None]:
from sklearn.model_selection import GridSearchCV

# We want to see performance with varying n_estimators and max_features
param_grid = {'n_estimators':[2, 5, 10, 50, 100], 'max_features':["sqrt", "log2", None]}

clf = RandomForestClassifier() # Create base classfier
grid_search = GridSearchCV(clf, param_grid=param_grid, cv=5) # specify the model, search 'grid', and folds
grid_search.fit(X_train, y_train) # Fit all the models in the grid search and identify the best

# Let's look at the results for the best models
pd.DataFrame(grid_search.cv_results_).sort_values('mean_test_score', ascending=False).head()

In [None]:
# The grid search automatically refits the model on the full training data after doing cross-validation
# We can get our best performing model like so, and score it on the test data

bestmodel = grid_search.best_estimator_
print(bestmodel.score(X_test, y_test))

# This should be better than our initial model accuracy.

## 🏋️‍♀️ PRACTICE

In [None]:
# Q1: Attempt to create a random forest classifier that improves on the accuracy of the example above.
# You may use different features, different hyper-parameters, but please keep the train/test split below (random state 42)
# More about the hyperparameters here:
# https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
# Show us the (cross-validated) training accuracy, testing accuracy, and feature importances.
