# LIVRIA : multilabel classification


Ce notebook comprend le code implementant notre modèle de régression linéaire pour la prédiction des thèmes susceptibles de plaire à l'utilisateur en fonction des critères d'entrée qu'il aura renseignés. Je précise que cela ne compose qu'une partie des prédictions? En effet, ce modèle sera complété par du filtrage collaboratif pour les thèmes mais aussi pour les livres du set de données de Goodbooks-10k.

### Import des librairies et des données 

In [1]:
# Librairies
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import math
%matplotlib inline

# Données
dataCrit = pd.read_csv('data/df_entree.csv', sep='\t')
dataTheme = pd.read_csv('data/df_sortie.csv', sep='\t')
# On supprime la colonne inutile :
del dataTheme['Unnamed: 0']
del dataCrit['Unnamed: 0']

In [2]:
dataCrit.head(10)

Unnamed: 0,Agite,Altruiste,Ambitieux,Amusant,Autoritaire,Aventurier,Calme,Connaissance,Consciencieux,Creatif,...,Reserve,Rien faire,Sexe,Sociable,Sport,Sportif,Style,Theatre,Tout,Voyage
0,0,0,0,0,0,0,1,0,1,1,...,0,0,1,0,0,0,0,0,1,1
1,0,1,0,0,0,1,1,0,1,0,...,0,1,1,0,1,0,0,0,0,1
2,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,1,1,0,0,0,1
3,0,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,1,0,0,0,1
4,0,0,0,1,0,0,0,1,0,0,...,0,0,1,1,1,1,0,0,0,0
5,0,0,1,0,0,0,1,0,0,0,...,1,0,0,0,1,0,0,0,0,1
6,1,0,0,1,0,0,0,1,1,1,...,0,0,0,1,1,0,0,0,0,1
7,0,0,1,1,0,1,1,0,0,1,...,0,0,1,0,0,0,0,0,0,0
8,0,1,1,0,0,0,0,0,0,1,...,0,0,0,1,0,0,1,0,0,1
9,0,1,0,0,0,0,0,1,1,0,...,0,1,0,1,1,0,1,1,0,0


In [3]:
dataTheme.head(10)

Unnamed: 0,ArtsCulture,BdComics,DocMedia,Erotisme,Esoterisme,HistGeo,Jeunesse,LittEtrangere,LoisirVie,Philosophie,RomanFiction,SHS,SanteBE,ScienceTechnique
0,0,1,0,0,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,1,0,0,1,0,0,0
2,0,0,0,0,0,0,1,1,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,1,0,0,0
4,0,1,1,0,0,1,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,1,1,0,0
6,1,0,1,1,0,0,0,1,0,1,1,0,0,0
7,0,0,0,0,1,0,0,0,0,1,0,1,0,0
8,0,0,0,0,0,0,0,0,0,0,1,0,0,0
9,0,1,0,0,0,1,0,1,0,0,1,1,0,0


On créer un set qui réunit toutes les données

In [20]:
df_entier = pd.concat([dataCrit,dataTheme], axis=1)
df_entier.head(10)

Unnamed: 0,Agite,Altruiste,Ambitieux,Amusant,Autoritaire,Aventurier,Calme,Connaissance,Consciencieux,Creatif,...,Esoterisme,HistGeo,Jeunesse,LittEtrangere,LoisirVie,Philosophie,RomanFiction,SHS,SanteBE,ScienceTechnique
0,0,0,0,0,0,0,1,0,1,1,...,0,0,0,0,0,0,1,0,0,0
1,0,1,0,0,0,1,1,0,1,0,...,0,0,0,1,0,0,1,0,0,0
2,0,1,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,1,0,0,0
3,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,1,0,0,0,1,0,0,...,0,1,0,0,0,0,0,0,0,0
5,0,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,1,0,0
6,1,0,0,1,0,0,0,1,1,1,...,0,0,0,1,0,1,1,0,0,0
7,0,0,1,1,0,1,1,0,0,1,...,1,0,0,0,0,1,0,1,0,0
8,0,1,1,0,0,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
9,0,1,0,0,0,0,0,1,1,0,...,0,1,0,1,0,0,1,1,0,0


In [23]:
# On vérifie qu'il n'y a pas de valeur manquante dans notre dataFrame
df_entier.isnull().sum()

Agite               0
Altruiste           0
Ambitieux           0
Amusant             0
Autoritaire         0
Aventurier          0
Calme               0
Connaissance        0
Consciencieux       0
Creatif             0
Cuisine             0
Curieux             0
Dessin              0
Esprit              0
FacileLire          0
Geek                0
Intellectuel        0
Introverti          0
Jaloux              0
Jeux videos         0
Meditation          0
Pantouflard         0
Personnage          0
Reflechir           0
Reserve             0
Rien faire          0
Sexe                0
Sociable            0
Sport               0
Sportif             0
Style               0
Theatre             0
Tout                0
Voyage              0
ArtsCulture         0
BdComics            0
DocMedia            0
Erotisme            0
Esoterisme          0
HistGeo             0
Jeunesse            0
LittEtrangere       0
LoisirVie           0
Philosophie         0
RomanFiction        0
SHS       

Pas de NaN values dans nos sets donc nous pouvons continuer.

## Entrainement du modèle

On créer les sets de test et d'entrainement

In [7]:
from sklearn.model_selection import train_test_split

In [25]:
train, test = train_test_split(df_entier, test_size=0.30, shuffle=True)

x_train = train.drop(['ArtsCulture','BdComics','DocMedia','Erotisme','HistGeo','HistGeo','Jeunesse','LittEtrangere','LoisirVie','Philosophie','RomanFiction','SHS','SanteBE','ScienceTechnique'],axis=1)
y_train = train.drop(['Agite','Altruiste','Ambitieux','Amusant','Autoritaire','Aventurier','Calme','Connaissance','Consciencieux','Creatif','Cuisine','Curieux','Dessin','Esprit','FacileLire','Geek','Intellectuel','Introverti','Jaloux','Jeux videos','Meditation','Pantouflard','Personnage','Reflechir','Reserve','Rien faire','Sexe','Sociable','Sport','Sportif','Style','Theatre','Tout','Voyage'], axis=1)
x_test = test.drop(['ArtsCulture','BdComics','DocMedia','Erotisme','HistGeo','HistGeo','Jeunesse','LittEtrangere','LoisirVie','Philosophie','RomanFiction','SHS','SanteBE','ScienceTechnique'],axis=1)
y_test = test.drop(['Agite','Altruiste','Ambitieux','Amusant','Autoritaire','Aventurier','Calme','Connaissance','Consciencieux','Creatif','Cuisine','Curieux','Dessin','Esprit','FacileLire','Geek','Intellectuel','Introverti','Jaloux','Jeux videos','Meditation','Pantouflard','Personnage','Reflechir','Reserve','Rien faire','Sexe','Sociable','Sport','Sportif','Style','Theatre','Tout','Voyage'], axis=1)

## Regression logistique

In [34]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier

In [35]:
# On utilise un pipeline pour utiliser la régression logistique sur ce problème de classification multitache
LogReg_pipeline = Pipeline([
                ('clf', OneVsRestClassifier(LogisticRegression(solver='sag'), n_jobs=-1)),
            ])
themes = list(dataTheme.columns.values)

for theme in themes :
    print('**Processing {} comments...**'.format(theme))
    
    # Training logistic regression model on train data
    LogReg_pipeline.fit(x_train, train[theme])
    
    # calculating test accuracy
    prediction = LogReg_pipeline.predict(x_test)
    print('Test accuracy is {}'.format(accuracy_score(y_test[theme], prediction)))
    print("\n")

**Processing ArtsCulture comments...**
Test accuracy is 0.8385416666666666


**Processing BdComics comments...**
Test accuracy is 0.7838541666666666


**Processing DocMedia comments...**
Test accuracy is 0.8541666666666666


**Processing Erotisme comments...**
Test accuracy is 0.8776041666666666


**Processing Esoterisme comments...**
Test accuracy is 1.0


**Processing HistGeo comments...**
Test accuracy is 0.7890625


**Processing Jeunesse comments...**
Test accuracy is 0.8046875


**Processing LittEtrangere comments...**
Test accuracy is 0.6510416666666666


**Processing LoisirVie comments...**
Test accuracy is 0.90625


**Processing Philosophie comments...**
Test accuracy is 0.8255208333333334


**Processing RomanFiction comments...**
Test accuracy is 0.8958333333333334


**Processing SHS comments...**
Test accuracy is 0.7552083333333334


**Processing SanteBE comments...**
Test accuracy is 0.8645833333333334


**Processing ScienceTechnique comments...**
Test accuracy is 0.95572916