## Question 2: Applied ML

We are going to build a classifier of news to directly assign them to 20 news categories. Note that the pipeline that you will build in this exercise could be of great help during your project if you plan to work with text!

1. Load the 20newsgroup dataset. It is, again, a classic dataset that can directly be loaded using sklearn ([link](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)).  
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), short for term frequency–inverse document frequency, is of great help when if comes to compute textual features. Indeed, it gives more importance to terms that are more specific to the considered articles (TF) but reduces the importance of terms that are very frequent in the entire corpus (IDF). Compute TF-IDF features for every article using [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Then, split your dataset into a training, a testing and a validation set (10% for validation and 10% for testing). Each observation should be paired with its corresponding label (the article category).


2. Train a random forest on your training set. Try to fine-tune the parameters of your predictor on your validation set using a simple grid search on the number of estimator "n_estimators" and the max depth of the trees "max_depth". Then, display a confusion matrix of your classification pipeline. Lastly, once you assessed your model, inspect the `feature_importances_` attribute of your random forest and discuss the obtained results.



In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV ,StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix



In [2]:
#We fix a seed for the rest of the code
seed = 1


In [3]:
#We load the "20newsgroup" dataset
ng = fetch_20newsgroups(subset = 'all')

#Extract the features
x = ng.data
#Extract the targets 
y = ng.target
#Extract the names of the targets
names = ng.target_names



In [4]:
#We split our dataset in 4 subsets.
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.1, random_state=3, stratify = y)

#Creatingc the vectorizer in order to compute the TF_IDF.
vect_tf = TfidfVectorizer().fit(x_train)
#TF_IDF for the training set.
x_train = vect_tf.transform(x_train)
#TF_IDF for the testing set.
x_test =  vect_tf.transform(x_test)

#### Now that we have splited our dataset, let us set the model and train it on the training set and check the accuracy of our prediction over the testing set.

In [5]:
#We set the model.
alg = RandomForestClassifier()
#Training.
alg.fit(x_train,y_train)
#Prediction.
pred_y = alg.predict(x_test)
#We calculate our accuracy.
score = accuracy_score(pred_y,y_test)
score

0.65888594164456238

#### Now we want to go and try to tune the parameter in order to make better prediction.For that we will do a simple grid search on the number of estimator "n_estimators" and the max depth of the trees "max_depth".

In [8]:
#We set the model.
alg = RandomForestClassifier()

#We give an initial value to the parameters we want to play on.
n_estimators = 70
max_depth = 30

#We also set a range of variation that we want our parameter to take
n_estimators = range(1,n_estimators,10)
max_depth = range(1,max_depth,5)

#We create a list of the parameters we want to tune.
tuned_parameters = {
    'n_estimators':n_estimators,
    'max_depth':max_depth
}


#We define how many fold we want to use for the cross validation.
k = 10

#We set the cross-validation
cross_fold = StratifiedKFold(10,random_state=seed,shuffle=True)

#We set the number of cores that we want to dedicate to the grid search so we can parallelize the task
num_cores = 4

#We initiate a grid and use grid search to find the optimal parameters for our model and we refit the model at the end with the obtained parameters.
best_alg = GridSearchCV(alg,tuned_parameters,cv=cross_fold,scoring='accuracy',n_jobs=num_cores, refit=True,verbose=1)

best_alg.fit(x_train,y_train)

Fitting 10 folds for each of 42 candidates, totalling 420 fits


[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:   17.2s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:  2.3min
[Parallel(n_jobs=4)]: Done 420 out of 420 | elapsed: 10.3min finished


GridSearchCV(cv=StratifiedKFold(n_splits=10, random_state=1, shuffle=True),
       error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params=None, iid=True, n_jobs=4,
       param_grid={'max_depth': range(1, 30, 5), 'n_estimators': range(1, 70, 10)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=1)

In [10]:
#Now that we obtained the optimal parameters thanks to our grid search we use them to get our new prediction.
opti_alg  = best_alg.best_estimator_

#Predition over the testing set.
pred_y = opti_alg.predict(x_test)

#We calculate our new accuracy.
score = accuracy_score(pred_y,y_test)


In [11]:
print('New accuracy with the model trained with optimal parameters:',score)

New accuracy with the model trained with optimal parameters: 0.761803713528


In [13]:
#We store all the results in a dataframe
df = pd.DataFrame(best_alg.cv_results_)
df.head()

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_max_depth,param_n_estimators,params,rank_test_score,split0_test_score,split0_train_score,...,split7_test_score,split7_train_score,split8_test_score,split8_train_score,split9_test_score,split9_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.35379,0.007488,0.064265,0.064396,1,1,"{'max_depth': 1, 'n_estimators': 1}",42,0.078638,0.076948,...,0.053846,0.0535,0.060427,0.058207,0.070665,0.070629,0.039886,0.001624,0.00951,0.009817
1,0.471916,0.047363,0.131773,0.134098,1,11,"{'max_depth': 1, 'n_estimators': 11}",39,0.125,0.119486,...,0.137278,0.136009,0.114336,0.11923,0.142518,0.146364,0.043997,0.006494,0.008932,0.009991
2,0.795412,0.102072,0.177053,0.179876,1,21,"{'max_depth': 1, 'n_estimators': 21}",37,0.18838,0.2043,...,0.171006,0.180407,0.169431,0.165783,0.179929,0.189697,0.165067,0.022584,0.016151,0.017185
3,0.907049,0.171338,0.217794,0.221714,1,31,"{'max_depth': 1, 'n_estimators': 31}",34,0.225939,0.225732,...,0.206509,0.21662,0.216232,0.214496,0.204869,0.214898,0.112115,0.1078,0.017997,0.018187
4,1.032803,0.23896,0.252756,0.259945,1,41,"{'max_depth': 1, 'n_estimators': 41}",33,0.251174,0.247558,...,0.269822,0.262,0.260071,0.2674,0.252375,0.275512,0.139822,0.073155,0.024063,0.022804


In [15]:
#We extract the row for which the rank_test_score ==1
df_one = df[df['rank_test_score']==1]
df_one.head()

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_max_depth,param_n_estimators,params,rank_test_score,split0_test_score,split0_train_score,...,split7_test_score,split7_train_score,split8_test_score,split8_train_score,split9_test_score,split9_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
41,17.485287,0.194268,0.771947,0.911372,26,61,"{'max_depth': 26, 'n_estimators': 61}",1,0.789906,0.918005,...,0.777515,0.91605,0.771919,0.912787,0.761283,0.905021,1.161235,0.032696,0.012539,0.004456


In [19]:
#We create our confusion matrix using the dataframe we extracted before.
matrix = confusion_matrix(y_test,pred_y)
df_conf = pd.DataFrame(matrix, index = names, columns = names)
df_conf = df_conf.div(df_conf.sum(axis=1),axis=0)
#We set the window size
plt.figure(figsize=(20,7))
#We create the heatmap using our dataframe
sns.heatmap(df_conf, square=True, linecolor='w',linewidths=2,annot=False)
plt.title('Confusion Matrix')
plt.show()