####  Here we will develop and tune models for credit scoring and movies reviews sentiment prediction. (https://docs.google.com/forms/d/1MS3kW_bjZQAkwwlAjX9G8khj1owq1qc5NQtjzJUvKVo).


#### The [dataset](https://github.com/Yorko/mlcourse.ai/tree/master/data/credit_scoring_sample.csv) looks like this:

##### Target variable
* SeriousDlqin2yrs - the person had long delays in payments during 2 years; binary variable

##### Features
* age - Age of the loan borrower (number of full years); type - integer
* NumberOfTime30-59DaysPastDueNotWorse - the number of times a person has had a delay in repaying other loans more than 30-59 days (but not more) during last two years; type - integer
* DebtRatio - monthly payments (loans, alimony, etc.) divided by aggregate monthly income, percentage; float type
* MonthlyIncome - monthly income in dollars; float type
* NumberOfTimes90DaysLate - the number of times a person has had a delay in repaying other loans for more than 90 days; type - integer
* NumberOfTime60-89DaysPastDueNotWorse - the number of times a person has had a delay in repaying other loans more than 60-89 days (but not more) in the last two years; type - integer
* NumberOfDependents - number of people in the family of the borrower; type - integer

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Let us implement a function that will replace the NaN values by the median in each column of the table.

In [None]:
def impute_nan_with_median(table):
    for col in table.columns:
        table[col]= table[col].fillna(table[col].median())
    return table   

Reading the data:

In [None]:
data = pd.read_csv('../../data/credit_scoring_sample.csv', sep=";")
data.head()

View data types of the features:

In [None]:
data.dtypes

Look at the distribution of classes in target:

In [None]:
ax = data['SeriousDlqin2yrs'].hist(orientation='horizontal', color='red')
ax.set_xlabel("number_of_observations")
ax.set_ylabel("unique_value")
ax.set_title("Target distribution")

print('Distribution of target:')
data['SeriousDlqin2yrs'].value_counts() / data.shape[0]

We'll select all the features and drop the target:

In [None]:
independent_columns_names = data.columns.values
independent_columns_names = [x for x in data if x != 'SeriousDlqin2yrs']
independent_columns_names

We apply a function that replaces all values of NaN by the median value of the corresponding column.

In [None]:
table = impute_nan_with_median(data)
table.head()

In [None]:
overdue = table[table['NumberOfTime30-59DaysPastDueNotWorse'] + table['NumberOfTimes90DaysLate']
                + table['NumberOfTime60-89DaysPastDueNotWorse'] > 0 ]
overdue = np.array(overdue['MonthlyIncome'])
overdue

In [None]:
on_time = table[table['NumberOfTime30-59DaysPastDueNotWorse'] + table['NumberOfTimes90DaysLate']
                + table['NumberOfTime60-89DaysPastDueNotWorse'] == 0 ]
on_time = np.array(on_time['MonthlyIncome'])
on_time

Split the target and features - now we get a training sample.

In [None]:
X = table[independent_columns_names]
y = table['SeriousDlqin2yrs']

We'll make an interval estimate based on the bootstrap of the average income (MonthlyIncome)  of customers who had overdue loan payments, and of those who paid in time, make 90% confidence interval. We'll also find the difference between the lower limit of the derived interval for those who paid in time and the upper limit for those who are overdue.

We'll use the example from the [article](https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-5-ensembles-of-algorithms-and-random-forest-8e05246cbba7). Set `np.random.seed (17)`. 

In [None]:
def get_bootstrap_samples(data, n_samples, seed=0):
    # Function to generate subsamples with bootstrap
    np.random.seed(seed)
    indices = np.random.randint(0, len(data), (n_samples, len(data)))
    samples = data[indices]
    return samples

def stat_intervals(stat, alpha):
    # Function for interval estimates
    boundaries = np.percentile(stat, [100 * alpha / 2., 100 * (1 - alpha / 2.)])
    return boundaries

# Save data about overdues in different numpy arrays
churn = data[data['SeriousDlqin2yrs'] == 1]['MonthlyIncome'].values
not_churn = data[data['SeriousDlqin2yrs'] == 0]['MonthlyIncome'].values

# Generate bootstrap samples and calculate the means
churn_mean_scores = [np.mean(sample) 
                     for sample in get_bootstrap_samples(churn, 1000, seed=17)]
not_churn_mean_scores = [np.mean(sample) 
                         for sample in get_bootstrap_samples(not_churn, 1000, seed=17)]

#  Derive interval estimate of the mean
print("Mean interval",  stat_intervals(churn_mean_scores, 0.1))
print("Mean interval",  stat_intervals(not_churn_mean_scores, 0.1))
print("Difference is", stat_intervals(not_churn_mean_scores, 0.1)[0] - 
      stat_intervals(churn_mean_scores, 0.1)[1])

# Decision tree, hyperparameter tuning

One of the main performance metrics of a model is the area under the ROC curve. The ROC-AUC values lay between 0 and 1. The closer the value of ROC-AUC to 1, the better the classification is done.

We'll find the values of `DecisionTreeClassifier` hyperparameters using the `GridSearchCV`, which maximize the area under the ROC curve.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

We'll use the `DecisionTreeClassifier` class to create a decision tree. Due to the imbalance of the classes in the target, we add the balancing parameter. We also use the parameter `random_state = 17` for the reproducibility of the results.

In [None]:
dt = DecisionTreeClassifier(random_state=17, class_weight='balanced')

We will look through such values of hyperparameters:

In [None]:
max_depth_values = [5, 6, 7, 8, 9]
max_features_values = [4, 5, 6, 7]
tree_params = {'max_depth': max_depth_values,
               'max_features': max_features_values}

Fix cross-validation parameters: stratified, 5 partitions with shuffle, 
`random_state`.

In [None]:
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=17)

**We'll run GridSearch with the ROC AUC metric using the hyperparameters from the `tree_params` dictionary. We'll find what is the maximum ROC AUC value. We call cross-validation stable if the standard deviation of the metric on the cross-validation is less than 1%.** 

In [None]:
dt_grid_search = GridSearchCV(dt, tree_params, n_jobs=-1, scoring ='roc_auc', cv=skf)
dt_grid_search.fit(X, y)

In [None]:
round(float(dt_grid_search.best_score_), 2)

In [None]:
dt_grid_search.best_params_

In [None]:
dt_grid_search.cv_results_["std_test_score"][np.argmax(dt_grid_search.cv_results_["mean_test_score"])]

# Simple RandomForest implementation

**<font color='red'>Task 4.</font>**
We'll implement our own random forest using `DecisionTreeClassifier` with the best parameters from the previous task.

Brief specification:
 - In the `fit` method in the loop (`i` from 0 to `n_estimators-1`), fix the seed equal to (`random_state + i`). The idea is that at each iteration there's a new value of random seed to add more "randomness", but at hte same time results are reproducible
 - After fixing the seed, select `max_features` features **without replacement**, save the list of selected feature ids in `self.feat_ids_by_tree`
 - Also make a bootstrap sample (i.e. **sampling with replacement**) of training instances. For that, resort to `np.random.choice` and its argument `replace`
 - Train a decision tree with specified (in a constructor) arguments `max_depth`, `max_features` and `random_state` (do not specify `class_weight`) on a corresponding subset of training data. 
 - The `fit` method returns the current instance of the class `RandomForestClassifierCustom`, that is `self`
 - In the `predict_proba` method, we need to loop through all the trees. For each prediction, obviously, we need to take only those features which we used for training the corresponding tree. The method returns predicted probabilities (`predict_proba`), averaged for all trees

We'll perform cross-validation, and find what is the average ROC AUC for cross-validation. 

In [None]:
from sklearn.base import BaseEstimator
from sklearn.model_selection import cross_val_score

class RandomForestClassifierCustom(BaseEstimator):
    def __init__(self, n_estimators=10, max_depth=10, max_features=10, 
                 random_state=17):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.max_features = max_features
        self.random_state = random_state
        
        self.trees = []
        self.feat_ids_by_tree = []
        
    def fit(self, X, y):
        for i in range(self.n_estimators):
            
            np.random.seed(i + self.random_state)
            
            feat_to_use_ids = np.random.choice(range(X.shape[1]), self.max_features, 
                                              replace=False)
            examples_to_use = list(set(np.random.choice(range(X.shape[0]), X.shape[0],
                                              replace=True)))
            
            self.feat_ids_by_tree.append(feat_to_use_ids)
            
            dt = DecisionTreeClassifier(
                                        max_depth=self.max_depth, 
                                        max_features=self.max_features, 
                                        random_state = self.random_state)

            dt.fit(X[examples_to_use, :][:, feat_to_use_ids], y[examples_to_use])
            self.trees.append(dt)
        return self
    
    def predict_proba(self, X):
        predictions = []
        for i in range(self.n_estimators):
            feat_to_use_ids = self.feat_ids_by_tree[i]
            predictions.append(self.trees[i].predict_proba(X[:,feat_to_use_ids]))
        return np.mean(predictions, axis=0)

In [None]:
rf = RandomForestClassifierCustom(max_depth=7, max_features=6).fit(X.values, y.values)

In [None]:
%%time
cv_aucs = cross_val_score(RandomForestClassifierCustom(max_depth=7, max_features=6), 
                          X.values, y.values, scoring="roc_auc", cv=skf)
print("Mean ROC AUC:", np.mean(cv_aucs))

Let us compare our own implementation of a random forest with `sklearn` version of it. To do this, we'll use `RandomForestClassifier (class_weight='balanced', random_state=17)` and specify all the same values for `max_depth` and `max_features` as before. After that we'll find what the average value of ROC AUC on cross-validation we got.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
cv_aucs = cross_val_score(RandomForestClassifier(n_estimators=10, max_depth=7, 
                                               max_features=6,
                                               random_state=17, n_jobs=-1,
                                              class_weight='balanced'), 
                        X.values, y.values, scoring="roc_auc", cv=skf)
print("Mean ROC AUC for sklearn RF:", np.mean(cv_aucs))

# `sklearn` RandomForest, hyperparameter tuning

We extend the value of `max_depth` up to 15, because the trees need to be deeper in the forest (more information can be seen from this [article](https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-5-ensembles-of-algorithms-and-random-forest-8e05246cbba7)). We'll find out what are the best values of hyperparameters.

In [None]:
max_depth_values = range(5, 15)
max_features_values = [4, 5, 6, 7]
forest_params = {'max_depth': max_depth_values,
                'max_features': max_features_values}

In [None]:
%%time
rf = RandomForestClassifier(random_state=17, n_jobs=-1, 
                            class_weight='balanced')
rf_grid_search = GridSearchCV(rf, forest_params, n_jobs=-1, 
                              scoring='roc_auc', cv=skf)
rf_grid_search.fit(X.values, y.values)

In [None]:
rf_grid_search.best_score_

In [None]:
rf_grid_search.best_params_

In [None]:
rf_grid_search.cv_results_["std_test_score"][np.argmax(rf_grid_search.cv_results_["mean_test_score"])]

# Logistic regression, hyperparameter tuning

Now let's compare our results with logistic regression (we indicate `class_weight='balanced'` and `random_state = 17`). We'll do a full search by the parameter `C` from a wide range of values `np.logspace(-8, 8, 17)`.
Now we will build a pipeline - first apply scaling, then train the model.

We'll find what is the best average ROC AUC.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

scaler = StandardScaler()
logit = LogisticRegression(random_state=17, class_weight='balanced')

logit_pipe = Pipeline([('scaler', scaler), ('logit', logit)])
logit_pipe_params = {'logit__C': np.logspace(-8, 8, 17)}

In [None]:
%%time
logit_pipe_grid_search = GridSearchCV(logit_pipe, logit_pipe_params, n_jobs=-1, 
                           scoring ='roc_auc', cv=skf)
logit_pipe_grid_search.fit(X.values, y.values)

In [None]:
logit_pipe_grid_search.best_score_

# Logistic regression and RandomForest on sparse features

In case of a small number of features, random forest was proved to be better than logistic regression. However, one of the main disadvantages of trees is how they work with sparse data, for example, with texts. Let's compare logistic regression and random forest in a new task.
Download dataset with reviews of movies [here](http://d.pr/f/W0HpZh). 

In [None]:
# Download data
df = pd.read_csv("../../data/movie_reviews_train.csv", nrows=50000)

# Split data to train and test
X_text = df["text"]
y_text = df["label"]

# Classes counts
df.label.value_counts()

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

# Split on 3 folds
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=17)

# In Pipeline we will modify the text and train logistic regression
classifier = Pipeline([
    ('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))),
    ('clf', LogisticRegression(random_state=17))])

For Logistic Regression: we'll iterate parameter `C` with values from the list [0.1, 1, 10, 100] and find the best ROC AUC in cross-validation.

In [None]:
%%time
parameters = {'clf__C': (0.1, 1, 10, 100)}
grid_search = GridSearchCV(classifier, parameters, n_jobs=-1, scoring ='roc_auc', cv=skf)
grid_search = grid_search.fit(X_text, y_text)

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_score_

Now we'll try to perform the same operation with random forest. Similarly, we'll look over all the values and get the maximum ROC AUC. 

In [None]:
classifier = Pipeline([
    ('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))),
    ('clf', RandomForestClassifier(random_state=17, n_jobs=-1))])

min_samples_leaf = [1, 2, 3]
max_features = [0.3, 0.5, 0.7]
max_depth = [None]

In [None]:
%%time
parameters = {'clf__max_features': max_features,
              'clf__min_samples_leaf': min_samples_leaf,
              'clf__max_depth': max_depth}
grid_search = GridSearchCV(classifier, parameters, n_jobs=-1, scoring ='roc_auc', cv=skf)
grid_search = grid_search.fit(X_text, y_text)

In [None]:
grid_search.best_params_

In [None]:
grid_search.best_score_