<!--NOTEBOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="./figures/LogoOpenclassrooms.png">
<font size="4">
<p>
Cette activité est réalisée dans le cadre du cours **``Analysez vos données textuelles``** diffusé en MOOC par
**<font color='blus'>Openclassrooms</font>**.
</p>
...   
<p>
...   
</p>

**Consignes**: 

* Charger les données
* Créer différents classifieurs (au moins 3)
* Effectuer une validation croisée sur les différents classifieurs
* Afficher les différentes performances

<p>
Le jeu de données est relativement lourd pour un travail en local, avec 650MB compressé de données. Il est conseillé de travailler sur un échantillon dans un premier temps pour s’assurer que tout fonctionne comme prévu pour ensuite traiter tout le jeu de données et obtenir les résultats finaux.    
</p>

In [1]:
import numpy as np

from sklearn import preprocessing
from sklearn import metrics

#-------------------------------------------------------------------------------
# Constants for this notebook
#-------------------------------------------------------------------------------
file_name_train = './data/rcv1_train.dump'

#-------------------------------------------------------------------------------
#
#-------------------------------------------------------------------------------
def split_train_test(X,y,train_ratio):
    train_range= int(X.shape[0]*train_ratio)
    X_train=X[:train_range]
    X_test=X[train_range:]
    
    y_train=y[:train_range]
    y_test=y[train_range:]
    return X_train, X_test, y_train, y_test
#-------------------------------------------------------------------------------


#-------------------------------------------------------------------------------
#
#-------------------------------------------------------------------------------
def get_encoded_target(data_target):
    list_str_target=list()
    for row in range(0,data_target.shape[0]):
        list_str=[str(weight) for weight in np.array(data_target[row].todense())[0]]
        str_target = ''.join(list_str)
        list_str_target.append(str_target)    

    le = preprocessing.LabelEncoder()
    list_encoded_target=le.fit_transform(list_str_target)
    return list_encoded_target, le
#-------------------------------------------------------------------------------


#-------------------------------------------------------------------------------
#
#-------------------------------------------------------------------------------
def compute_accuracy_per_target(y_test, y_pred, list_target):
   """Computes and display global accurency predictions and per target 
      accurency predictions.
      
      Function used for accurency is metrics.accuracy_score
      Input : 
         * y_test : vector to be tested
         * y_pred : vector issues from prediction model
         * list_cluster : list of market segments found with unsupervised M.L.
         algorithm.
      Output : none
   """

   #----------------------------------------------------------
   # Global accuracy is computed
   #----------------------------------------------------------
   score_global=metrics.accuracy_score(y_test, y_pred)


   dict_score_target=dict()
   for i_target in list_target :
       #----------------------------------------------------------
       # Get tuple of array indexes matching with target
       #----------------------------------------------------------
       index_tuple=np.where( y_pred==i_target )

       #----------------------------------------------------------
       # Extract values thanks to array of indexes 
       #----------------------------------------------------------
       y_test_target=y_test[index_tuple[0]]
       y_pred_target=y_pred[index_tuple[0]]
       
       nb_elt_target=len(y_test_target)
       
       #----------------------------------------------------------
       # Accuracy is computed and displayed
       #----------------------------------------------------------
       score_target=metrics.accuracy_score(y_test_target, y_pred_target)
       dict_score_target[i_target]=score_target
       #print("Segment "+str(i_segment)+" : "+str(nb_elt_segment)\
       #+" elts / Random forest / Précision: {0:1.2F}".format(score))
   return score_global,dict_score_target
#-------------------------------------------------------------------------------


# <font color='blus'>1. Data acquisition</font>

**From http://scikit-learn.org/stable/datasets/rcv1.html**

``data``: The feature matrix is a scipy CSR sparse matrix, with 804414 samples and 47236 features. Non-zero values contains cosine-normalized, log TF-IDF vectors. 

A nearly chronological split is proposed in [1]: 

* The first 23149 samples are the training set. 
* The last 781265 samples are the testing set. 

This follows the official LYRL2004 chronological split. 

The array has 0.16% of non zero values:

**Get train dataset corpus**

In [None]:
# RCV1 : Reuters Corpus Volume I 
from sklearn.datasets import fetch_rcv1
rcv1_train = fetch_rcv1(subset='train')

**Dump train dataset corpus**

In [None]:
import p5_util
file_name = file_name_train
p5_util.object_dump(rcv1_train, file_name)

In [None]:
rcv1_train.target_names.shape

**Get test dataset corpus**

In [None]:
# RCV1 : Reuters Corpus Volume I 
from sklearn.datasets import fetch_rcv1
rcv1_test = fetch_rcv1(subset='test')

In [None]:
rcv1_test.keys(),rcv1_test.data.shape

**Dump test dataset corpus**

In [None]:
import p5_util
data_path = "./data"
core_name = "rcv1_test"

p5_util.bunch_dump(rcv1_test, 100000, data_path, core_name)

**Load dumped test dataset corpus**

In [None]:
import p5_util
data_path = "./data"
core_name = "rcv1_test"
list_key = ['data', 'target', 'sample_id', 'target_names',]
data_len=781265
row_packet=100000
dict_rcv1_test = p5_util.bunch_load(list_key, data_len, row_packet, data_path, core_name)

# <font color='blus'>2. Applying Multinomial Bayes model</font>

In [2]:
from sklearn.naive_bayes import MultinomialNB
model = MultinomialNB()

In [3]:
import p5_util
print(file_name_train+str("\n"))
rcv1_train = p5_util.object_load(file_name_train)

print(rcv1_train.keys())
print(rcv1_train.data.shape)

./data/rcv1_train.dump

p5_util.object_load : fileName= ./data/rcv1_train.dump
dict_keys(['data', 'target', 'sample_id', 'target_names', 'DESCR'])
(23149, 47236)


In [4]:
target_names = rcv1_train.target_names
target_names

array(['C11', 'C12', 'C13', 'C14', 'C15', 'C151', 'C1511', 'C152', 'C16',
       'C17', 'C171', 'C172', 'C173', 'C174', 'C18', 'C181', 'C182',
       'C183', 'C21', 'C22', 'C23', 'C24', 'C31', 'C311', 'C312', 'C313',
       'C32', 'C33', 'C331', 'C34', 'C41', 'C411', 'C42', 'CCAT', 'E11',
       'E12', 'E121', 'E13', 'E131', 'E132', 'E14', 'E141', 'E142',
       'E143', 'E21', 'E211', 'E212', 'E31', 'E311', 'E312', 'E313',
       'E41', 'E411', 'E51', 'E511', 'E512', 'E513', 'E61', 'E71', 'ECAT',
       'G15', 'G151', 'G152', 'G153', 'G154', 'G155', 'G156', 'G157',
       'G158', 'G159', 'GCAT', 'GCRIM', 'GDEF', 'GDIP', 'GDIS', 'GENT',
       'GENV', 'GFAS', 'GHEA', 'GJOB', 'GMIL', 'GOBIT', 'GODD', 'GPOL',
       'GPRO', 'GREL', 'GSCI', 'GSPO', 'GTOUR', 'GVIO', 'GVOTE', 'GWEA',
       'GWELF', 'M11', 'M12', 'M13', 'M131', 'M132', 'M14', 'M141',
       'M142', 'M143', 'MCAT'], dtype=object)

In [5]:
print(rcv1_train.target.shape)

(23149, 103)


In [8]:
rcv1_train.target[20].A

array([[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)

In [21]:
import numpy as np
rcv1_train.target[:10,0].A.reshape(1,-1)

#target = np.arange(0, rcv1_train.target.shape[0])
target.shape

(1000, 103)

In [23]:
data1 =rcv1_train.data[:1000,:500]
target1= target

In [27]:
target1[10].A

array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], dtype=uint8)

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.multiclass import OneVsRestClassifier

model = OneVsRestClassifier(MultinomialNB())
#model = MultinomialNB()


In [32]:
model

OneVsRestClassifier(estimator=MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
          n_jobs=1)

In [None]:
rcv1_train.target.shape, rcv1_train.target_names.shape

# <font color='blus'>XX. Train dataset pre-processing</font>

**Load dumped train dataset corpus**

In [None]:
import p5_util
print(file_name_train+str("\n"))
rcv1_train = p5_util.object_load(file_name_train)

print(rcv1_train.keys())
print(rcv1_train.data.shape)

## <font color='blue'>2.1 Target processing</font>

Target from loaded dataset is a matrix with more then 100 features.

The goal of the process here under is to reduce target dimensions while creating a single vector as target.

In [None]:
rcv1_train.target.shape

In [None]:
import numpy as np
list_new_feature_target_name = list()
print( rcv1_train.target.A.shape)
for ind, array_tag in enumerate(rcv1_train.target.A):
    #---------------------------------------------------------------------------------
    # For any raw, an index filter is created from columns with value flag fixed to 1.
    #---------------------------------------------------------------------------------
    index_filter = np.where(array_tag==1)

    #---------------------------------------------------------------------------------
    # Index filter is applied to target_names and the list of target names for this 
    # raw is returned.
    #---------------------------------------------------------------------------------
    list_raw_target_name = np.array(rcv1_train.target_names)[index_filter]

    #---------------------------------------------------------------------------------
    # New target feature is created for the current raw.
    #---------------------------------------------------------------------------------
    feature_target_name = ' '.join(list_raw_target_name)

    #---------------------------------------------------------------------------------
    # This new feature is stored
    #---------------------------------------------------------------------------------
    list_new_feature_target_name.append(feature_target_name)

In [None]:
list_new_feature_target_name[0]

In [None]:
import numpy as np

le = preprocessing.LabelEncoder()
list_new_feature_target_encoded=le.fit_transform(list_new_feature_target_name)
len(np.unique(list_new_feature_target_encoded)),len(rcv1_train.target.A)

## <font color='blue'>2.1 Dataset processing</font>

A PCA analysis is conducted.

In [None]:
rcv1_train.data.shape[0]


In [None]:
import numpy as np

index_filter=np.random.choice(np.arange(0,rcv1_train.data.shape[0]), size=1000)
part_rcv1_train = rcv1_train.data[index_filter]

In [None]:
type(part_rcv1_train)

In [None]:
import p3_util_plot
z__ = p3_util_plot.pca_all_plot(part_rcv1_train.todense(), plot=True)

Train corpus is splited into 2 parts : train and test.

In [None]:
train_ratio=0.7
X_train, X_test, y_train, y_test= split_train_test(rcv1_train.data,rcv1_train.target, train_ratio)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
import numpy as np

In [None]:
y_train, le_train = get_encoded_target(y_train)
y_test, le_test = get_encoded_target(y_test)

**Random Forest classifier**

In [None]:
from sklearn.ensemble import RandomForestClassifier

nb_estimators = 10
rfc = RandomForestClassifier(n_estimators=nb_estimators)
rfc_model = rfc.fit(X_train, y_train)

In [None]:
y_predict = rfc_model.predict(X_test)

In [None]:
len(y_predict),X_test.shape

In [None]:
import numpy as np

type(y_train)
list_target = np.unique(y_train)
len(list_target), len(np.unique(y_test))

**Performances **:
* score_global : this is the mean accuracy 
* dict_score_target: is the accuracy prediction per target.

In [None]:
score_global, dict_score_target = compute_accuracy_per_target(y_test, y_predict,list_target)

In [None]:
import numpy as np

np.unique(y_predict)

In [None]:
print("Mean accuracy= {}".format(score_global))
dict_score_target

**SGD classifier**

Stochastic Gradient Descent Classifier

In [None]:
from sklearn.linear_model import SGDClassifier
import p5_util

sgdcl = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None)
sgdcl_model = sgdcl.fit(X_train, y_train)

y_predict = sgdcl_model.predict(X_test)
score_global, dict_score_target = p5_util.compute_precision_per_segment(y_test, y_predict,list_target)

In [None]:
print("Mean accuracy= {}".format(score_global))
dict_score_target

**Naive Baysien Classifier**

**Loading test bunch**

In [None]:
import p5_util
bunch_data = p5_util.object_load_split(800000,100000,'./data','rcv1_test')