<!--NOTEBOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="./figures/LogoOpenclassrooms.png">
<font size="4">
<p>
Cette activité est réalisée dans le cadre du cours **``Analysez vos données textuelles``** diffusé en MOOC par
**<font color='blus'>Openclassrooms</font>**.
</p>
...   
<p>
...   
</p>

**Consignes**: 

* Charger les données
* Créer différents classifieurs (au moins 3)
* Effectuer une validation croisée sur les différents classifieurs
* Afficher les différentes performances

<p>
Le jeu de données est relativement lourd pour un travail en local, avec 650MB compressé de données. Il est conseillé de travailler sur un échantillon dans un premier temps pour s’assurer que tout fonctionne comme prévu pour ensuite traiter tout le jeu de données et obtenir les résultats finaux.    
</p>

In [2]:
#-------------------------------------------------------------------------------
# Constants for this notebook
#-------------------------------------------------------------------------------
file_name_train = './data/rcv1_train.dump'
IS_LOCAL_DATASET=True
N_JOBS=-1


In [3]:
import numpy as np

from sklearn import preprocessing
from sklearn import metrics

#-------------------------------------------------------------------------------
#
#-------------------------------------------------------------------------------
def split_train_test(X,y,train_ratio):
    train_range= int(X.shape[0]*train_ratio)
    X_train=X[:train_range]
    X_test=X[train_range:]
    
    y_train=y[:train_range]
    y_test=y[train_range:]
    return X_train, X_test, y_train, y_test
#-------------------------------------------------------------------------------


#-------------------------------------------------------------------------------
#
#-------------------------------------------------------------------------------
def get_encoded_target(data_target):
    list_str_target=list()
    for row in range(0,data_target.shape[0]):
        list_str=[str(weight) for weight in np.array(data_target[row].todense())[0]]
        str_target = ''.join(list_str)
        list_str_target.append(str_target)    

    le = preprocessing.LabelEncoder()
    list_encoded_target=le.fit_transform(list_str_target)
    return list_encoded_target, le
#-------------------------------------------------------------------------------


#-------------------------------------------------------------------------------
#
#-------------------------------------------------------------------------------
def compute_accuracy_per_target(y_test, y_pred, list_target):
   """Computes and display global accurency predictions and per target 
      accurency predictions.
      
      Function used for accurency is metrics.accuracy_score
      Input : 
         * y_test : vector to be tested
         * y_pred : vector issues from prediction model
         * list_cluster : list of market segments found with unsupervised M.L.
         algorithm.
      Output : none
   """

   #----------------------------------------------------------
   # Global accuracy is computed
   #----------------------------------------------------------
   score_global=metrics.accuracy_score(y_test, y_pred)


   dict_score_target=dict()
   for i_target in list_target :
       #----------------------------------------------------------
       # Get tuple of array indexes matching with target
       #----------------------------------------------------------
       index_tuple=np.where( y_pred==i_target )

       #----------------------------------------------------------
       # Extract values thanks to array of indexes 
       #----------------------------------------------------------
       y_test_target=y_test[index_tuple[0]]
       y_pred_target=y_pred[index_tuple[0]]
       
       nb_elt_target=len(y_test_target)
       
       #----------------------------------------------------------
       # Accuracy is computed and displayed
       #----------------------------------------------------------
       score_target=metrics.accuracy_score(y_test_target, y_pred_target)
       dict_score_target[i_target]=score_target
       #print("Segment "+str(i_segment)+" : "+str(nb_elt_segment)\
       #+" elts / Random forest / Précision: {0:1.2F}".format(score))
   return score_global,dict_score_target
#-------------------------------------------------------------------------------


# <font color='blus'>1. Data acquisition</font>

**From http://scikit-learn.org/stable/datasets/rcv1.html**

``data``: The feature matrix is a scipy CSR sparse matrix, with 804414 samples and 47236 features. 

Non-zero values contains cosine-normalized, log TF-IDF vectors. 

A nearly chronological split is proposed in [1]: 

* The first 23149 samples are the training set. 
* The last 781265 samples are the testing set. 

This follows the official LYRL2004 chronological split. 

The array has 0.16% of non zero values:

## <font color='blus'>1.1. Loading train dataset </font>

In [4]:
# RCV1 : Reuters Corpus Volume I 
from sklearn.datasets import fetch_rcv1

import p5_util

if IS_LOCAL_DATASET is False:
    #------------------------------------------------------------------------
    # Loading train dataset
    #------------------------------------------------------------------------
    rcv1_train = fetch_rcv1(subset='train')
    file_name = file_name_train
    #------------------------------------------------------------------------
    # Dumping train dataset
    #------------------------------------------------------------------------
    p5_util.object_dump(rcv1_train, file_name)
else:
    print("\n*** Loading local train dataset ...")
    print(file_name_train+str("\n"))
    rcv1_train = p5_util.object_load(file_name_train)
    print("\n*** Local train dataset loaded!")

print("\n*** Train target : "+str(rcv1_train.target_names.shape))
print("\n*** Train data : "+str(rcv1_train.data.shape))

X_train = rcv1_train.data
y_train = rcv1_train.target


*** Loading local train dataset ...
./data/rcv1_train.dump

p5_util.object_load : fileName= ./data/rcv1_train.dump

*** Local train dataset loaded!

*** Train target : (103,)

*** Train data : (23149, 47236)


## <font color='blus'>1.2. Loading test dataset from corpus</font>

In [5]:
import p5_util

data_path = "./data"
core_name = "rcv1_test"

# RCV1 : Reuters Corpus Volume I 
from sklearn.datasets import fetch_rcv1
data_path = "./data"
core_name = "rcv1_test"

if IS_LOCAL_DATASET is False:
    #------------------------------------------------------------------------
    # Loading test dataset
    #------------------------------------------------------------------------
    rcv1_test = fetch_rcv1(subset='test')
    print(rcv1_test.keys(),rcv1_test.data.shape)
    
    #------------------------------------------------------------------------
    # Dumping test dataset
    #------------------------------------------------------------------------
    p5_util.bunch_dump(rcv1_test, 100000, data_path, core_name)    
else:
    print("\n*** Loading local test dataset ...")
    list_key = ['data', 'target', 'sample_id', 'target_names',]
    data_len=781265
    row_packet=100000
    dict_rcv1_test = p5_util.bunch_load(list_key, data_len, row_packet, data_path, core_name)
    print("\n*** Local test dataset loaded!")

print("\n*** Test dataset dictionary keys : "+str(dict_rcv1_test.keys()))
print("\n*** Test target sizing : "+str(dict_rcv1_test['target_names'].shape))
print("\n*** Test data sizing : "+str(dict_rcv1_test['data'].shape))

X_test = dict_rcv1_test['data']
y_test = dict_rcv1_test['target']


*** Loading local test dataset ...
p5_util.object_load : fileName= ./data/rcv1_test_data_0.dump
./data/rcv1_test_data_0.dump (100000, 47236)
p5_util.object_load : fileName= ./data/rcv1_test_data_1.dump
./data/rcv1_test_data_1.dump (200000, 47236)
p5_util.object_load : fileName= ./data/rcv1_test_data_2.dump
./data/rcv1_test_data_2.dump (300000, 47236)
p5_util.object_load : fileName= ./data/rcv1_test_data_3.dump
./data/rcv1_test_data_3.dump (400000, 47236)
p5_util.object_load : fileName= ./data/rcv1_test_data_4.dump
./data/rcv1_test_data_4.dump (500000, 47236)
p5_util.object_load : fileName= ./data/rcv1_test_data_5.dump
./data/rcv1_test_data_5.dump (600000, 47236)
p5_util.object_load : fileName= ./data/rcv1_test_data_6.dump
./data/rcv1_test_data_6.dump (700000, 47236)
p5_util.object_load : fileName= ./data/rcv1_test_data_7.dump
./data/rcv1_test_data_7.dump (781258, 47236)

 Loading splited file done!

p5_util.object_load : fileName= ./data/rcv1_test_target_0.dump
./data/rcv1_test_target

# <font color='blus'>2. Applying classifiers over dataset</font>

**Classsifiers score are stored into dictionary**

In [None]:
dict_classifer_score=dict()

## <font color='blus'>2.1. Multinomial Naive Bayes classifier with binary relevance</font>

Due to multiple targets classes, `OneVsRestClassifier` classifier is applied on Multinomial Naive Bayes classifier. 

One classifier per class is fitted.

**Note : change N_JOBS value if required. Default value is fixed to 8.** 

*See constants assignation into fisrt cells from this notebook.*

### <font color='blus'>2.1.1. Training Multinomial Naive Bayes classifier</font>

One versus Rest transformation ispallied Naive Bayes classifier.

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.multiclass import OneVsRestClassifier

classifier_mnb = OneVsRestClassifier(MultinomialNB()).fit(X_train, y_train)

### <font color='blus'>2.1.2. Classifier evaluation </font>

In [None]:
from sklearn.metrics import accuracy_score
y_pred = classifier_mnb.predict(X_test)

classifier_mnb_score = accuracy_score(y_test,y_pred)

print("Accuracy score for MNB classifier composed with OvR: {0:1.2F} %".format(classifier_mnb_score*100))

In [None]:
dict_classifer_score['MNB'] = classifier_mnb_score

## <font color='blus'>2.2. Chained Multinomial Naive Bayes classifier</font>

### <font color='blus'>2.2.1. Training Chained Naive Bayes classifier</font>

In [None]:
#from nltk.classify import NaiveBayesClassifier
#from sklearn.multioutput import ClassifierChain
from skmultilearn.problem_transform import ClassifierChain

from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression


In [None]:
chained_classifier_mnb = ClassifierChain(MultinomialNB())

In [None]:
chained_classifier_mnb.fit(X_train , y_train)

In [None]:
import p5_util
file_name='./data/chained_classifier_mnb.dump'
p5_util.object_dump(chained_classifier_mnb, file_name)

### <font color='blus'>2.2.2. Chained classifier evaluation </font>

In [None]:
import p5_util
file_name='./data/chained_classifier_mnb.dump'
chained_classifier_mnb = p5_util.object_load(file_name)

In [None]:
print(X_test.shape,y_test.shape)
test_size=int(X_test.shape[0]/50)
X_test_size = X_test[:test_size,:]
y_test_size = y_test[:test_size,:]
print(X_test_size.shape, y_test_size.shape)

In [None]:
chained_classifier_mnb_score = chained_classifier_mnb.score(X_test_size ,y_test_size)

print("Mean score for Chained Multinomial Naive Bayes classifier : {0:1.2F} %".format(chained_classifier_mnb_score*100))

In [None]:
from sklearn.metrics import accuracy_score
y_pred = classifier_mnb_chained.predict(X_test_size)

chained_classifier_mnb_score = accuracy_score(y_test,y_pred)

print("Accuracy score for chained MNB classifier : {0:1.2F} %".format(score*100))

This result is not better then the one with OvR (binary relevance) transformation.

It then can be suspected that Labels are independants from each other.

In [None]:
dict_classifer_score['Chained MNB'] = chained_classifier_mnb_score

## <font color='blus'>2.3. Evaluation of Multinomial Bernouilli Naive Bayes classifier</font>

Data have binaries values? Bernouilli naive Bayes classifier is expected to provide good results.

It is revealed that `y_train` is encoded as a binary array.

It is then possible to arrange classifier as chained Binaries classifiers.

Chaining binaries classifiers leads to take into account correlations between Labels assigned to samples.


### <font color='blus'>2.3.1. Training classifier</font>

In [None]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.multiclass import OneVsRestClassifier

classifier_ber = OneVsRestClassifier(BernoulliNB(), n_jobs=N_JOBS).fit(X_train, y_train)

### <font color='blus'>2.3.2. Classifier evaluation </font>

In [None]:
from sklearn.metrics import accuracy_score
y_pred = classifier_ber.predict(X_test)

classifier_ber_score = accuracy_score(y_test,y_pred)

print("Accuracy score for Bernoilli NB classifier : {0:1.2F} %".format(classifier_ber_score*100))

In [None]:
dict_classifer_score['Bernouilli NB'] = classifier_ber_score

## <font color='blus'>2.4. Evaluation of SGD classifier</font>

Stochastic Gradient Descent Classifier

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.multiclass import OneVsRestClassifier

classifier_sgd = OneVsRestClassifier(SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None)).fit(X_train, y_train)

In [None]:
classifier_sgd_score = classifier_sgd.score(X_test ,y_test)
print("Mean score for Stochastic Gradient Descent classifier : {0:1.2F} %".format(classifier_sgd_score*100))

In [None]:
dict_classifer_score['SGD'] = classifier_sgd_score

In [None]:
import p5_util
file_name='./data/dict_classifer_score.dump'
p5_util.object_dump(dict_classifer_score, file_name)

## <font color='blus'>2.5. Chained evaluation of SGD classifier</font>

Stochastic Gradient Descent Classifier

In [6]:
from sklearn.linear_model import SGDClassifier
from skmultilearn.problem_transform import ClassifierChain

#
chained_classifier_sgd = ClassifierChain(SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None))

In [7]:
X_train.shape

(23149, 47236)

In [8]:
chained_classifier_sgd.fit(X_train , y_train)

ValueError: The number of classes has to be greater than one; got 1 class

In [None]:
print(X_test.shape,y_test.shape)
test_size=int(X_test.shape[0]/50)
X_test_size = X_test[:test_size,:]
y_test_size = y_test[:test_size,:]
print(X_test_size.shape, y_test_size.shape)

In [None]:
chained_classifier_sgd_score = classifier_sgd.score(X_test ,y_test)
print("Mean score for Stochastic Gradient Descent classifier : {0:1.2F} %".format(chained_classifier_sgd_score*100))

In [None]:
import p5_util
file_name='./data/dict_classifer_score.dump'
dict_classifer_score = p5_util.object_load(file_name)

In [None]:
dict_classifer_score['Chained SGD'] = chained_classifier_sgd_score

## <font color='blus'>2.5. SGD classfier transformed with Label powerset</font>

### <font color='blus'>2.5.1. Training classifier</font>

In [9]:
from sklearn.linear_model import SGDClassifier
from skmultilearn.problem_transform import LabelPowerset
power_classifier_sgd = LabelPowerset(SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None))

In [10]:
power_classifier_sgd.fit(X_train , y_train)

MemoryError: 

In [None]:
print(X_test.shape,y_test.shape)
test_size=int(X_test.shape[0]/50)
X_test_size = X_test[:test_size,:]
y_test_size = y_test[:test_size,:]
print(X_test_size.shape, y_test_size.shape)

In [None]:
from sklearn.metrics import accuracy_score
y_pred = classifier_ber.predict(X_test_size)

power_classifier_sgd_score = accuracy_score(y_test_size,y_pred)

print("Accuracy score for Power labelled SGD classifier : {0:1.2F} %".format(power_classifier_sgd_score*100))

In [None]:
dict_classifer_score['Power SGD'] = power_classifier_sgd_score

## <font color='blus'>3. Classifiers results</font>

In [None]:
import matplotlib.pyplot as plt

#-------------------------------------------------------------------------------
#
#-------------------------------------------------------------------------------
def ser_item_occurency_plot(ser_item_name, ser_item_count, item_count=None, title=None):
    """Plot values issued form 2 Series as following : 
    First Series contains items names
    Second Series contains items occutencies.
    
    """
    df_item_dict={item:count for item, count \
    in zip(ser_item_name, ser_item_count)}

    list_item_sorted \
    = sorted(df_item_dict.items(), key=lambda x: x[1], reverse=False)

    dict_item_sorted = dict()
    for tuple_value in list_item_sorted :
        dict_item_sorted[tuple_value[0]] = tuple_value[1]


    X = list(dict_item_sorted.keys())
    y = list(dict_item_sorted.values())

    fig, ax = plt.subplots(figsize=(20,10))

    if item_count is not None:
        X_plot = X[:item_count]
        y_plot = y[:item_count]
    else:
        X_plot = X.copy()
        y_plot = y.copy()
    
    ax.plot(X_plot,y_plot)
    ax.set_xticklabels(X[:item_count], rotation=90)
    ax.set_xlabel('Accuracy')
    ax.set_ylabel('Classifiers')
    if title is not None : 
        ax.set_title(title)
    ax.grid(linestyle='-', linewidth='0.1', color='grey')
    fig.patch.set_facecolor('#E0E0E0')

    plt.show()
#-------------------------------------------------------------------------------


**Dictionary is sorted.**

In [None]:
import pandas as pd
df_result = pd.DataFrame.from_dict( dict_classifer_score, orient='index')
df_result.reset_index(inplace=True)
df_result.rename(columns={'index':'Classifier',0:'Score'}, inplace=True)
df_result

In [None]:
title = "Classifiers accuracy"
ser_item_occurency_plot(df_result.Classifier, df_result.Score, item_count=None, title=title)