# Random Forests

In this practical we will look at using random forests for a classification task: predicting the outcome of crowdfunded projects on Kickstarter.

# Classification - Predicting Kickstarter Success

`kickstarter_sample.csv` contains a random sample of 5000 Kickstart projects. 

The features are as follows:

| Feature       	| Description                                             	|
|---------------	|---------------------------------------------------------	|
| main_category 	| Major category the project belongs to                   	|
|      category 	| More detailed category label                            	|
|          goal 	| How much money the project wants to raise               	|
|       backers 	| How many people had donated, when time was up           	|
|       country 	| Country the project came from                           	|
|       outcome 	| Whether the project was successful, failed or cancelled 	|

The code below loads in the data then splits it into features (`X`) and labels (`y`).

In [None]:
import pandas as pd

data = pd.read_csv('data/kickstarter_sample.csv')

X = data.drop(['outcome'], axis=1)
y = data['outcome']

## Exploring the data

Have a quick look at `data`.

How balanced are the three classes in `state`? (Tip: use `.value_counts()` and specify `normalize=True`)

For each outcome, what was the mean goal? (Tip: use `.groupby()` and then `.mean()` on the relevant column.)

What do the numbers suggest?

In [None]:
# Your code here...



# Getting the data ready

One issue with the data is that it contains categorical variables - `country`, `category`, and `main_category`. We need to encode these so that they can be used in a machine learning model.

We will use `pandas`'s `get_dummies()` function to one-hot encode the data. Assign the output of `get_dummies()` to `X`.

In [None]:
# Your code here...


We will also need to split the data up into training/testing sets - evaluating on the same data we learned from is generally a bad idea.

`sklearn.model_selection` has a function `train_test_split()` which will split up data for us.

Set `random_state=5`, and call the new variables `X_train`, `X_test`, `y_train`, and `y_test`.

In [None]:
# Your code here...


# Setting a baseline

Before we move on to using a random forest, it will be useful to know how well a single model performs. This will set a baseline for us to improve upon.

The code below creates a single Naive Bayes and a single Decision Tree model.

Use the `.fit()` and `.score()` methods of each model to train them using the training data and evaluate them on the test set.

In [None]:
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier

In [None]:
nb_model = GaussianNB()

dtc_model = DecisionTreeClassifier(random_state=5)

# Your code here...


Accuracy only gives one impression. We have three classes here, so print a classification report for each of the baseline models.

`sklearn.metrics.classification_report` takes two arguments: the true y labels and a model's predictions.

You can get predictions for `X_test` by using the `.predict()` method of a trained model.

Do you observe any differnces in the models besides their accuracy scores?

In [None]:
from sklearn.metrics import classification_report

# Your code here...


# Training a random forest

Training a random forest uses the same steps as above.

Train the model below and get a classification report, as you did for the baseline models.

How does it compare to a single Decision Tree?

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=5)

# Your code below


# Optimising a random forest

So far we have just used the default hyperparameters of the random forest class in `sklearn`: 100 trees, and trees can go as deep as they like.

One convenient way to find the best hyperparameters in `sklearn` is through grid search. This will try all the combinations you request and return the results.

The `GridSearchCV` class takes a model, and a dictionary of hyperparameter and values. Then you just fit/train it as usual, using the training data from before.

We'll try the following:

1. Different depths of the trees - deeper trees can completely overfit to the training data, which can impact generalisability
2. `min_samples_split` and `min_samples_leaf` - hyperparameters for deciding how the decision trees should be built
3. `class_weight` - we can try assigning different weights to the classes, to help with class imbalance
4. `max_samples` - what proportion of the available data each tree should train on
5. `n_estimators` - the number of trees to train

This is 3645 combinations! To speed up the process, we will restrict the hyperparameter search for max_depth, min_samples_split and max_samples, but feel free to run the full search on your local machine.

Below, create a `GridSearchCV` in the same way you would a model: assign it to a variable named `gcv`, pass it the `rf` as your basic model and include the parameters we wish to investigate with `param_grid=params`. Fit it to the training data.

To speed things up, set `n_jobs=-1` to use all available CPU cores. Set `verbose=2` so you get updates as it proceeds - useful for making sure it is actually working!

This may take some time...

In [None]:
from sklearn.model_selection import GridSearchCV

params = dict(
    max_depth=[None], #, 1, 2],
    min_samples_split=[2], #, 3, 4],
    min_samples_leaf=[1, 2, 3],
    class_weight=[{'canceled': 2, 'failed': 1, 'successful':1},
                  {'canceled': 1, 'failed': 2, 'successful':1},
                  {'canceled': 1, 'failed': 1, 'successful':2}],
    max_samples=[0.75, 0.25], #, 0.9],
    n_estimators=[100, 200, 300]
)

rf = RandomForestClassifier(random_state=5, n_jobs=-1)

# Your code here...


# What was the best model?

Note that the `GridSearchCV` evaluated each possible model using the accuracy metric.

The best model is stored inside `gcv` as `best_estimator_`. Its score is in `gcv.best_score_` and the actual hyperparameters used are in `gcv.best_params_`.

Take a look at these and for the best model create a `classification_report` as you did before, using the test set. How does it compare to the non-optimised model?

In [None]:
# Your code here...


# Most useful features

Now, using the best model, let's look at the `feature_importances_` attribute. This is an array of importance scores for each feature.

In this case, it will align with the columns of our training data, `X_train`.

Create a DataFrame from a dictionary that has two keys `feature` (with values from `X_train.columns`) and `score` (with values from `feature_importances_`).

Sort the DataFrame by the `score` column.

which are the most useful features for the model?

In [None]:
# Your code here...


# Conclusion

Ensemble methods generally perform better than single models. Although they are not as fully interpretable as some individual models (e.g. decision trees), it is still possible to gain some insight into what features are most useful for your task.

The improvements over the baselines were quite good. Fine-tuning the hyperparameters gave only very modest improvements, but this is more likely due to the features used. Think about what kind of features you would ideally have for modeling project funding success/failure. Overall, this illustrates how powerful random forests can be right out of the box.