# Imports

In [25]:
import numpy as np
import pandas as pd

# Question 1

# Question 2

In [29]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_val_predict, cross_validate
from sklearn.metrics import accuracy_score
from sklearn.decomposition import PCA, TruncatedSVD, IncrementalPCA

We are going to build a classifier of news to directly assign them to 20 news categories. Note that the pipeline that you will build in this exercise could be of great help during your project if you plan to work with text!

1. Load the 20newsgroup dataset. It is, again, a classic dataset that can directly be loaded using sklearn ([link](http://scikit-learn.org/stable/datasets/twenty_newsgroups.html)).  
[TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf), short for term frequency–inverse document frequency, is of great help when if comes to compute textual features. Indeed, it gives more importance to terms that are more specific to the considered articles (TF) but reduces the importance of terms that are very frequent in the entire corpus (IDF). Compute TF-IDF features for every article using [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html). Then, split your dataset into a training, a testing and a validation set (10% for validation and 10% for testing). Each observation should be paired with its corresponding label (the article category).

2. Train a random forest on your training set. Try to fine-tune the parameters of your predictor on your validation set using a simple grid search on the number of estimator "n_estimators" and the max depth of the trees "max_depth". Then, display a confusion matrix of your classification pipeline. Lastly, once you assessed your model, inspect the `feature_importances_` attribute of your random forest and discuss the obtained results.

## Solution

### Explanation and assumptions

#### Explanation

#### Assumptions

### Data retrieving

We directly load the data using sklearn

In [5]:
newsgroups_train = fetch_20newsgroups(subset='all')

We check that the data is well balanced

In [None]:
for i in range(19):
    print(np.count_nonzero(newsgroups_train.target == i))

And we convert the text to vectors taking care of setting a *max_features* parameter in order to have the same number of features in the train and test set.

In [8]:
vectorizer = TfidfVectorizer()
newsgroups_train.vectors = vectorizer.fit_transform(newsgroups_train.data)
print('Set shape:', newsgroups_train.vectors.shape)

Set shape: (18846, 173762)


### PCA for dimensionality reduction

In [12]:
X_train, X_test, y_train, y_test = train_test_split(newsgroups_train.vectors, newsgroups_train.target, test_size=0.1)

In [15]:
X_train.shape

(16961, 173762)

### Random forest

In [16]:
SCORING = ['accuracy', 'neg_mean_squared_error']

In [17]:
def runCV(clf, X_train, y_train, k):
    scores = cross_validate(clf, X_train, y_train, cv=k, scoring=SCORING, return_train_score=False)
    print_scores(scores)
    return scores
        
def print_scores(scores):
    print('Scores')
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores['test_accuracy'].mean(), scores['test_accuracy'].std() * 2))
    print("RMSE: %0.2f (+/- %0.2f)" % (np.sqrt(-scores['test_neg_mean_squared_error']).mean(), scores['test_neg_mean_squared_error'].std() * 2))

With **k-fold cross validation** and **grid_search**, we found that good parameters are:

* n_estimators = 350
* max_depth = 80

Further fine-tuning can be done but we estimated that those parameters were enough considering the long running-time involved in finding the best parameters.

In [18]:
def fine_tuning(depths, estimators):
    best_depth = 0;
    best_estimators = 0;
    best_acc = 0;

    for max_depth in depths:
        for n_estimators in estimators:
            clf = RandomForestClassifier(
                n_estimators=n_estimators,
                max_depth=max_depth,
                random_state=42,
                n_jobs=-1
            )
            scores = runCV(clf, X_train, newsgroups_train.target, 7)
            acc = scores['test_accuracy'].mean()

            if acc > best_acc:
                best_depth = max_depth
                best_estimators = n_estimators
                best_acc = acc
                
    print('Best parameters are (accuracy of', best_acc, '):')
    print('Depth:', best_depth, 'n_estimators:', best_estimators)
    
    return best_depth, best_estimators

In [None]:
fine_tuning([10], [50])

In [35]:
max_depth = 100
n_estimators = 400

In [None]:
clf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, n_jobs=-1)
clf.fit(X_train, y_train)

### Model assessment

In [22]:
acc = accuracy_score(y_test, clf.predict(X_test))
print('Accuracy:', acc * 100)

array([16,  7, 13, ..., 14, 10,  3])