# Projet 5 : Catégorisez automatiquement des questions
## Contexte et objectifs
Le site Stackoverflow permet de poser des questions sur le thème de la programmation informatique. Afin de classifier les questions, les utilisateurs doivent renseigner des tags afin de retrouver plus facilement les questions. Afin d'aider les utilisateurs, le but du projet est de proposer des suggestions de tags en fonction du contenu de la question.  
Après avoir exploré les données et tester différents modèles pour segmenter les données, un code sera déployer afin de créer une API utilisable par Stackoverflow
## Notebook de tests de segmantations
Dans ce notebook différents tests de segmentation seront réalisés, de manière supervisée ou non, avec différentes méthodes de pré-traitement des données.

## Modules Python

In [13]:
from google.colab import drive
drive.mount('/content/drive')

ModuleNotFoundError: No module named 'google.colab'

In [14]:
import numpy as np
import pandas as pd
import nltk
import re
import os
import pickle
import time
from collections import defaultdict
import matplotlib.pyplot as plt

from sklearn.decomposition import LatentDirichletAllocation, NMF, PCA
from sklearn.manifold import TSNE
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import sklearn.model_selection
from sklearn.preprocessing import MultiLabelBinarizer, LabelEncoder
from sklearn.multioutput import MultiOutputClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, log_loss,jaccard_score

try : 
    import pyLDAvis.sklearn
except : 
    !pip install pyLDAvis
    import pyLDAvis.sklearn

#!pip install gensim==4.1.2
#import gensim

import tensorflow as tf
import tensorflow.keras
from tensorflow.keras import backend as K
import tensorflow.keras.models
import tensorflow_hub as hub

import sys
sys.path.append("/content/drive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/")
import tokenization
#tokenization.py
import logging
logging.disable(logging.WARNING)

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('omw-1.4')
pyLDAvis.enable_notebook()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\erwan\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\erwan\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\erwan\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\erwan\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\erwan\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


## fonctions utiles

In [15]:
def tokenize_lemmat(txt) :
    tag_map = defaultdict(lambda : nltk.corpus.wordnet.NOUN)
    tag_map['J'] = nltk.corpus.wordnet.ADJ
    tag_map['V'] = nltk.corpus.wordnet.VERB
    tag_map['R'] = nltk.corpus.wordnet.ADV

    lemmatizer = nltk.stem.WordNetLemmatizer()
    tag_tokenizer = nltk.RegexpTokenizer(r'</?(?:b|p)>', gaps=True)
    txt_tokenizer = nltk.RegexpTokenizer(r'\w+')

    txt = ''.join([i for i in txt if not i.isdigit()])
    txt = re.sub(r'_+', ' ', txt)
    words = txt_tokenizer.tokenize(' '.join(tag_tokenizer.tokenize(txt.lower())))
    out = [lemmatizer.lemmatize(token, tag_map[tag[0]]) for token, tag in nltk.pos_tag(words)]
    return ' '.join(out)

In [16]:
def dummy(doc) :
    return doc

In [17]:
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic {}:".format(topic_idx))
        print(" ".join([feature_names[i] for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [223]:
def do_lda(docs, max_df=1, min_df=1., max_features=1000, n_topics=5):
    tf_vectorizer = CountVectorizer(max_df=max_df, min_df=min_df, max_features=max_features, stop_words='english')
    tf = tf_vectorizer.fit_transform(docs)
    lda = LatentDirichletAllocation(n_components=n_topics, max_iter=50, learning_method='online', learning_offset=50., random_state=42)
    lda.fit(tf)
    n_top_words = 20
    try : 
        display_topics(lda, tf_vectorizer.get_feature_names(), n_top_words)
    except :
        display_topics(lda, tf_vectorizer.get_feature_names_out(), n_top_words)
    return lda, tf, tf_vectorizer

In [247]:
def do_nmf(docs, max_df=1, min_df=1., max_features=1000, n_topics=5):
    tf_vectorizer = TfidfVectorizer(max_df=max_df, min_df=min_df, max_features=max_features, stop_words='english')
    tf = tf_vectorizer.fit_transform(docs)
    nmf = NMF(n_components=n_topics,random_state=42)
    nmf.fit(tf)
    n_top_words = 20
    try :
        display_topics(nmf, tf_vectorizer.get_feature_names_out(), n_top_words)
    except :
        display_topics(nmf, tf_vectorizer.get_feature_names(), n_top_words)
    return nmf, tf, tf_vectorizer

In [20]:
def mymetrics(y_true, y_pred) :
    ma = np.where(y_true != 0) 
    ntrue = np.count_nonzero(y_pred[ma] == 1)
    #faux negatifs : nombre de négatifs mais positifs en réalité
    nfalseneg = np.count_nonzero(y_pred[ma] == 0)
    #faux positifs : nombre de positifs mais négatifs en réalité
    ma = ma = np.where(y_true == 0) 
    nfalsepos = np.count_nonzero(y_pred[ma] == 1)
    return ntrue, nfalseneg, nfalsepos

## Chargement des données
les données nettoyées dans le notebook précédent sont rechargées.  
Le Chargement est effectué depuis le fichier pickle pour éviter le traitement nécessaire pour considérer les colonnes de type list

In [21]:
#DIR = "/content/drive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/"
DIR = "./features"

In [22]:
with open(DIR+"/database_cleaned.pkl", 'rb') as ifile :
    DATA = pickle.load(ifile)
with open(DIR+"/database_20tags_cleaned.pkl", 'rb') as ifile :
    DATA_20tags = pickle.load(ifile)

In [221]:
DATA.head()

Unnamed: 0,Title,Body,Tags,Id,Score,ViewCount,FavoriteCount,AnswerCount,Tags_list,Body_words,Title_words,Body_nwords,Body_words_lemmat,Body_nwords_lemmat,Body_words_noSW,Body_words_lemmat_noSW,Body_sentence_lemmat
0,SQL Server 2008 Full Text Search (FTS) versus ...,<p>I know there have been questions in the pas...,<sql-server><sql-server-2008><full-text-search...,499247,40,18582,26,5,"[<sql-server-2008>, <full-text-search>, <lucen...","[i, know, there, have, been, questions, in, th...","[sql, server, 2008, full, text, search, fts, v...",42,"[i, know, there, have, be, question, in, the, ...",42,"[know, questions, past, sql, versus, lucene, n...","[know, question, past, sql, versus, lucene, ne...",i know there have be question in the past abou...
1,XML Serialization and Inherited Types,"<p>Following on from my <a href=""https://stack...",<c#><xml><inheritance><serialization><xml-seri...,20084,86,56816,42,7,"[<serialization>, <c#>, <xml>, <inheritance>]","[following, on, from, my, a, href, https, stac...","[xml, serialization, and, inherited, types]",279,"[follow, on, from, my, a, href, http, stackove...",279,"[following, href, https, stackoverflow, com, q...","[follow, href, http, stackoverflow, com, quest...",follow on from my a href http stackoverflow co...
2,MyISAM versus InnoDB,<p>I'm working on a projects which involves a ...,<mysql><database><performance><innodb><myisam>,20148,887,301985,390,25,"[<performance>, <database>, <mysql>]","[i, m, working, on, a, projects, which, involv...","[myisam, versus, innodb]",146,"[i, m, work, on, a, project, which, involve, a...",146,"[working, projects, involves, lot, database, w...","[work, project, involve, lot, database, write,...",i m work on a project which involve a lot of d...
3,Recommended SQL database design for tags or ta...,<p>I've heard of a few ways to implement taggi...,<sql><database-design><tags><data-modeling><ta...,20856,325,118552,307,6,"[<sql>, <database-design>, <data-modeling>, <t...","[i, ve, heard, of, a, few, ways, to, implement...","[recommended, sql, database, design, for, tags...",82,"[i, ve, heard, of, a, few, way, to, implement,...",82,"[heard, ways, implement, tagging, using, mappi...","[heard, way, implement, tag, use, mapping, tab...",i ve heard of a few way to implement tag use a...
4,Specifying a mySQL ENUM in a Django model,<p>How do I go about specifying and using an E...,<python><mysql><django><django-models><enums>,21454,99,61572,21,9,"[<django-models>, <python>, <enums>, <django>,...","[how, do, i, go, about, specifying, and, using...","[specifying, a, mysql, enum, in, a, django, mo...",14,"[how, do, i, go, about, specify, and, use, an,...",14,"[go, specifying, using, enum, django, model]","[go, specify, use, enum, django, model]",how do i go about specify and use an enum in a...


In [24]:
DATA.describe()

Unnamed: 0,Id,Score,ViewCount,FavoriteCount,AnswerCount,Body_nwords,Body_nwords_lemmat
count,27338.0,27338.0,27338.0,27338.0,27338.0,27338.0,27338.0
mean,16520780.0,113.008523,109089.1,42.724888,7.117236,205.265784,205.265784
std,15073700.0,346.850157,241333.3,146.18605,6.73504,230.235349,230.235349
min,4.0,6.0,261.0,11.0,1.0,4.0,4.0
25%,4144024.0,29.0,21266.25,14.0,3.0,79.0,79.0
50%,11803220.0,51.0,47806.0,19.0,5.0,139.0,139.0
75%,25597520.0,98.0,109441.5,35.0,9.0,246.0,246.0
max,70926800.0,26377.0,9893978.0,11586.0,126.0,4192.0,4192.0


## segmentation non supervisée

### préparation des données

In [25]:
DATA['Body_sentence_lemmat'] = DATA["Body"].apply(tokenize_lemmat)

In [26]:
DATA_20tags['Body_sentence_lemmat'] = DATA_20tags["Body"].apply(tokenize_lemmat)

### LDA

#### Recherche des paramètres optimaux

##### Influence du max_df

In [224]:
max_df = 1.
print("="*50)
print(f"max_df = {max_df}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=1, max_features=1000, n_topics=5)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

max_df = 1.0
Topic 0:
code android pre image com self view stack src http imgur png img lib layout app alt python file height
Topic 1:
code pre quot java string public class new return id method object org use value data void private null int
Topic 2:
gt lt code pre int class amp div function type std id value html td foo text return const item
Topic 3:
code file use http user pre server error com app application request project run try work web strong build client
Topic 4:
code li strong use http href em rel noreferrer com question like ul time way data example need know work


  default_term_info = default_term_info.sort_values(


In [225]:
max_df = 0.9
print("="*50)
print(f"max_df = {max_df}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA['Body_sentence_lemmat'].values, max_df=max_df, min_df=1, max_features=1000, n_topics=5)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

max_df = 0.9
Topic 0:
code android pre image com self view stack src http imgur png img lib layout app alt python file height
Topic 1:
code pre quot java string public class new return id method object org use value data void private null int
Topic 2:
gt lt code pre int class amp div function type std id value html td foo text return const item
Topic 3:
code file use http user pre server error com app application request project run try work web strong build client
Topic 4:
code li strong use http href em rel noreferrer com question like ul time way data example need know work


  default_term_info = default_term_info.sort_values(


In [226]:
max_df = 0.5
print("="*50)
print(f"max_df = {max_df}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA['Body_sentence_lemmat'].values, max_df=max_df, min_df=1, max_features=1000, n_topics=5)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

max_df = 0.5
Topic 0:
file error run build project python server lib try module version test command app node user package path local work
Topic 1:
string public return class int new value data object function strong type method like void id amp list array time
Topic 2:
gt lt quot android java class org id div layout com spring td xml http lang item springframework width content
Topic 3:
user app request view work strong page http self image like try function new application want data url var server
Topic 4:
li http strong com href rel noreferrer em ul question stack imgur image like stackoverflow ol org img png alt


  default_term_info = default_term_info.sort_values(


On observe que peut de changement intervient sur la segmentation jusqu'à max_df=0.9. Cette valeur permet de créer des clusters distincts.  
Diminuer plus le max_df ne permet pas d'obtenir un meilleur filtrage des mots. (moins de verbes ou adjectifs)

##### influence du min_df

In [227]:
max_df=0.9
min_df = 1 
print("="*50)
print(f"min_df = {min_df}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=1000, n_topics=5)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

min_df = 1
Topic 0:
code android pre image com self view stack src http imgur png img lib layout app alt python file height
Topic 1:
code pre quot java string public class new return id method object org use value data void private null int
Topic 2:
gt lt code pre int class amp div function type std id value html td foo text return const item
Topic 3:
code file use http user pre server error com app application request project run try work web strong build client
Topic 4:
code li strong use http href em rel noreferrer com question like ul time way data example need know work


  default_term_info = default_term_info.sort_values(


In [228]:
max_df=0.9
min_df = 5
print("="*50)
print(f"min_df = {min_df}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=1000, n_topics=5)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

min_df = 5
Topic 0:
code android pre image com self view stack src http imgur png img lib layout app alt python file height
Topic 1:
code pre quot java string public class new return id method object org use value data void private null int
Topic 2:
gt lt code pre int class amp div function type std id value html td foo text return const item
Topic 3:
code file use http user pre server error com app application request project run try work web strong build client
Topic 4:
code li strong use http href em rel noreferrer com question like ul time way data example need know work


  default_term_info = default_term_info.sort_values(


In [229]:
max_df=0.9
min_df = 50
print("="*50)
print(f"min_df = {min_df}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=1000, n_topics=5)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

min_df = 50
Topic 0:
code file pre error run use python build app try server lib module user version project command path test node
Topic 1:
gt lt quot java pre class code org div id public html lang type td spring xml http value springframework
Topic 2:
android code image com view stack self pre imgur png http img src layout alt app width height text color
Topic 3:
li http strong href use com rel noreferrer em ul question like user html work application need way stackoverflow ol
Topic 4:
code pre use string return function data new int strong value class public like object method type id amp time


  default_term_info = default_term_info.sort_values(


In [230]:
max_df=0.9
min_df = 100
print("="*50)
print(f"min_df = {min_df}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=1000, n_topics=5)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

min_df = 100
Topic 0:
code file pre error run use build python try app server lib project module version test command user path node
Topic 1:
android code image view com stack pre imgur png img src layout http self alt app width height id color
Topic 2:
gt lt quot java class pre org code div id spring html td type lang xml com http springframework value
Topic 3:
li http strong href use com rel noreferrer em ul question like user html work need request code stackoverflow application
Topic 4:
code pre use string return new public function data class int value strong object like id type method amp list


  default_term_info = default_term_info.sort_values(


On observe que les clusters commence à se chevaucher à partir de min_df = 50.  
Cela nous indique qu'il est préférable de conserver une valeur assez faible pour se paramètre.

##### influence du max_features

In [231]:
max_df = 0.9
min_df = 2
max_features = 100000
print("="*50)
print(f"max_features = {max_features}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=max_features, n_topics=5)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

max_features = 100000
Topic 0:
li http strong href use com rel noreferrer em code question ul like work html ol need stackoverflow way know
Topic 1:
android code java file pre error com org run app build lib version module python layout package project http try
Topic 2:
code pre use user data file function like strong request try work id want return value new string error way
Topic 3:
code pre image self stack imgur png img view alt src com size http use height width color strong text
Topic 4:
gt lt code pre quot class public int string new return id type void div amp private value std static


  default_term_info = default_term_info.sort_values(


In [232]:
max_df = 0.9
min_df = 2
max_features = 5000
print("="*50)
print(f"max_features = {max_features}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=max_features, n_topics=5)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

max_features = 5000
Topic 0:
li strong http use href com rel noreferrer em code question ul like way need work know data ol just
Topic 1:
code pre use function data return file self like value string python var try int strong work array want object
Topic 2:
code file user error pre app server use http run build request lib com try module version project application command
Topic 3:
quot android image com http src view stack imgur png img layout code alt text button app width height id
Topic 4:
gt lt code pre class public java id string org new int return type private void std value div method


  default_term_info = default_term_info.sort_values(


In [233]:
max_df = 0.9
min_df = 2
max_features = 1000
print("="*50)
print(f"max_features = {max_features}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=max_features, n_topics=5)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

max_features = 1000
Topic 0:
code android pre image com self view stack src http imgur png img lib layout app alt python file height
Topic 1:
code pre quot java string public class new return id method object org use value data void private null int
Topic 2:
gt lt code pre int class amp div function type std id value html td foo text return const item
Topic 3:
code file use http user pre server error com app application request project run try work web strong build client
Topic 4:
code li strong use http href em rel noreferrer com question like ul time way data example need know work


  default_term_info = default_term_info.sort_values(


In [234]:
max_df = 0.9
min_df = 2
max_features = 500
print("="*50)
print(f"max_features = {max_features}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=max_features, n_topics=5)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

max_features = 500
Topic 0:
code pre use function return strong string int new value file like class try data object method type work self
Topic 1:
li http href com strong use rel noreferrer file ul em code question run error project build library org lib
Topic 2:
gt lt java pre class public code org div string id td private type value lang xml html spring springframework
Topic 3:
use user data id request strong server model application like new app table page database create client need want json
Topic 4:
quot android image com http stack view src imgur png code img layout alt app width height text id color


  default_term_info = default_term_info.sort_values(


In [235]:
max_df = 0.9
min_df = 2
max_features = 300
print("="*50)
print(f"max_features = {max_features}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=max_features, n_topics=5)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

max_features = 300
Topic 0:
http com href rel noreferrer code java org file error stack use src www python build imgur png run lib
Topic 1:
li strong use em user ul like server application work need file data app request question way want know ol
Topic 2:
quot android view image id text layout app width height item button color content parent bar background style event title
Topic 3:
gt lt class pre div std code id amp td html type value script form xml data body version text
Topic 4:
code pre use return string public new class function int data value strong like try object self method var type


  default_term_info = default_term_info.sort_values(


In [236]:
max_df = 0.9
min_df = 2
max_features = 100
print("="*50)
print(f"max_features = {max_features}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=max_features, n_topics=5)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

max_features = 100
Topic 0:
http com href rel noreferrer org java html stack www question stackoverflow use strong src net api example blockquote web
Topic 1:
li strong em use ul user data like table need thread question time way want know work make create application
Topic 2:
file android app image use error run view build try test user project application server python import version line work
Topic 3:
gt lt id pre class code amp type html value int data text version public true content string table return
Topic 4:
code pre quot use return string new class public function data int value like self try var object method type


  default_term_info = default_term_info.sort_values(


In [237]:
max_df = 0.9
min_df = 2
max_features = 50
print("="*50)
print(f"max_features = {max_features}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=max_features, n_topics=5)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

max_features = 50
Topic 0:
code pre use string return public new class function int value like try work method type set error want add
Topic 1:
file android java image error test com org run app http try pre id application new method use class create
Topic 2:
gt lt class pre id type value code int data org http string public function add return strong error work
Topic 3:
quot strong use user data app application like need create work time want way new try run make id set
Topic 4:
li http href com strong em rel noreferrer use ul question org like work code need make image way try


  default_term_info = default_term_info.sort_values(


In [238]:
max_df = 0.9
min_df = 2
max_features = 30
print("="*50)
print(f"max_features = {max_features}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=max_features, n_topics=5)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

max_features = 30
Topic 0:
strong java em data use public new string class return like work try error id pre function code com http
Topic 1:
li http com href android rel noreferrer use strong like work try em new code id data function user class
Topic 2:
code pre use function like try return work error new class string id data strong http em com href user
Topic 3:
file quot user error id use data try work pre like new return string function class code http com strong
Topic 4:
gt lt id pre class code http data return function com error string work try public new use href like


  default_term_info = default_term_info.sort_values(


La variation du nombre de features ne montre pas de valeur optimale.  
Grâce à cette variable il est souhaitable de supprimer au maximum les verbes et les adjectifs.  
Pour cela il semble qu'il soit nécessaire de définir un nombre maximum de features faible, autour de 50.

#### Variation du nombre de sujets

In [239]:
min_df=2
max_df=0.9
max_features=50
n_topics=2
print("="*50)
print(f"n_topics = {n_topics}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=max_features, n_topics=n_topics)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

n_topics = 2
Topic 0:
code pre li use strong http com href file like em rel noreferrer work user data try question error want
Topic 1:
gt lt code pre quot android java class public id string int new org return value type com data http


  default_term_info = default_term_info.sort_values(


In [240]:
min_df=2
max_df=0.9
max_features=50
n_topics=3
print("="*50)
print(f"n_topics = {n_topics}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=max_features, n_topics=n_topics)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

n_topics = 3
Topic 0:
code pre use strong file data like user li try error work new function em want way return create need
Topic 1:
http li com href android strong rel noreferrer java use org image ul question app em like application work run
Topic 2:
gt lt quot pre code public id class int string return new type value data set function add error org


  default_term_info = default_term_info.sort_values(


In [241]:
min_df=2
max_df=0.9
max_features=50
n_topics=5
print("="*50)
print(f"n_topics = {n_topics}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=max_features, n_topics=n_topics)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

n_topics = 5
Topic 0:
code pre use string return public new class function int value like try work method type set error want add
Topic 1:
file android java image error test com org run app http try pre id application new method use class create
Topic 2:
gt lt class pre id type value code int data org http string public function add return strong error work
Topic 3:
quot strong use user data app application like need create work time want way new try run make id set
Topic 4:
li http href com strong em rel noreferrer use ul question org like work code need make image way try


  default_term_info = default_term_info.sort_values(


In [242]:
min_df=2
max_df=0.9
max_features=50
n_topics=10
print("="*50)
print(f"n_topics = {n_topics}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=max_features, n_topics=n_topics)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

n_topics = 10
Topic 0:
java string public class int new org return pre method code set add create run use try error application test
Topic 1:
gt lt class pre code type id org data add value work use test try want time application make create
Topic 2:
function user value id return type pre data code create new set class method like try want add use work
Topic 3:
http com href quot rel noreferrer org question use work like try make application need run new type way test
Topic 4:
li ul use question need make user test time like create run work application class set new em try type
Topic 5:
android image app com id http try application add use new want pre run work make string like create set
Topic 6:
file error test run try use work create application app add pre new set time make need want code way
Topic 7:
code pre use like try work want way set need make add method create new question class type time run
Topic 8:
strong data question type class use value method code return new like pre

  default_term_info = default_term_info.sort_values(


In [243]:
min_df=2
max_df=0.9
max_features=50
n_topics=20
print("="*50)
print(f"n_topics = {n_topics}")
print("="*50)
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=min_df, max_features=max_features, n_topics=n_topics)
pyLDAvis.sklearn.prepare(lda, tf, tf_vectorizer)

n_topics = 20
Topic 0:
test method return use code make new create pre run add try set work public class want java way need
Topic 1:
gt lt pre code id value org add work make try use want create set need like return data function
Topic 2:
code pre value use like try work want return way set need create question add method function id make new
Topic 3:
class create method want question pre set make use try code need new like add application work return id public
Topic 4:
li ul use question need like work value make create application want set way try code href method http strong
Topic 5:
image try want use need set make like create way work new code app pre http return com method add
Topic 6:
com http question href work use set make try like code need app want add method create way application run
Topic 7:
code use make pre want like way add need work question set try new create method function return id application
Topic 8:
string public new return pre code set int add method create us

  default_term_info = default_term_info.sort_values(


Au dela de 5 sujets, les clusters ne sont plus totalement distincts et il devient difficile de les différencier clairement.  
On pourra donc déterminer de manière non supervisé grâce au LDA 3 voir 5 sujets différents.   
Les 3 sujets que le peut extraire sont :
- topic 0 : questions concernant les bases de données
- topic 1 : développement d'application 
- topic 2 : questions en rapport avec les définitions des class et de fonctions, questions de programmation

#### creation des labels LDA
Les labels LDA sont ajoutés à la base de données initiale.

In [244]:
lda, tf, tf_vectorizer = do_lda(DATA["Body_sentence_lemmat"].values, max_df=0.9, min_df=2, max_features=50, n_topics=5)

Topic 0:
code pre use string return public new class function int value like try work method type set error want add
Topic 1:
file android java image error test com org run app http try pre id application new method use class create
Topic 2:
gt lt class pre id type value code int data org http string public function add return strong error work
Topic 3:
quot strong use user data app application like need create work time want way new try run make id set
Topic 4:
li http href com strong em rel noreferrer use ul question org like work code need make image way try


In [245]:
ld_val = lda.transform(tf_vectorizer.transform(DATA["Body_sentence_lemmat"].values))
lda_label = np.argmax(ld_val, axis=1)
DATA["LDA_labels"] = lda_label

In [246]:
ld_val = lda.transform(tf_vectorizer.transform(DATA_20tags["Body_sentence_lemmat"].values))
lda_label = np.argmax(ld_val, axis=1)
DATA_20tags["LDA_labels"] = lda_label

### tf-idf NMF

Pour le NMF, les paramètres ne sont pas optimisés, nous conseront ceux trouvés pour le LDA

### Variation du nombre de sujets

In [248]:
max_df = 0.9
min_df=2
max_features = 50
n_topics = 2
print("="*50)
print(f"n_topics = {n_topics}")
print("="*50)
nmf, tf, tf_vectorizer = do_nmf(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=1, max_features=max_features, n_topics=n_topics)
#pyLDAvis.sklearn.prepare(nmf, tf, tf_vectorizer)

n_topics = 2




Topic 0:
code pre gt lt use file class string like function try return error new work value want data strong way
Topic 1:
li http strong com href use rel noreferrer ul question em file image like org app application user need work


In [249]:
max_df = 0.9
min_df=2
max_features = 50
n_topics = 3
print("="*50)
print(f"n_topics = {n_topics}")
print("="*50)
nmf, tf, tf_vectorizer = do_nmf(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=1, max_features=max_features, n_topics=n_topics)

n_topics = 3




Topic 0:
code pre use file like try want function work error string way new return class value data strong run method
Topic 1:
li http strong com href use rel noreferrer ul question em file image like org app application user need work
Topic 2:
gt lt pre class public id int string type return new value data android function quot strong error user test


In [250]:
max_df = 0.9
min_df=2
max_features = 50
n_topics = 5
print("="*50)
print(f"n_topics = {n_topics}")
print("="*50)
nmf, tf, tf_vectorizer = do_nmf(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=1, max_features=max_features, n_topics=n_topics)

n_topics = 5




Topic 0:
code pre function return string value error class try int method like work set type new id public want em
Topic 1:
http com href rel noreferrer org image strong question android em use like work try java quot app make error
Topic 2:
gt lt pre class public id int string type return value new function quot data android error test add strong
Topic 3:
li ul strong em question code user time need data application run make app test create use java quot class
Topic 4:
use file strong way like want user need data application app work create new try run make time pre error


In [251]:
max_df = 0.9
min_df=2
max_features = 50
n_topics = 10
print("="*50)
print(f"n_topics = {n_topics}")
print("="*50)
nmf, tf, tf_vectorizer = do_nmf(DATA["Body_sentence_lemmat"].values, max_df=max_df, min_df=1, max_features=max_features, n_topics=n_topics)

n_topics = 10




Topic 0:
code pre em want method set run type way work like time add question make need value gt java try
Topic 1:
http com href rel noreferrer org question work try error like quot em test java function add make set run
Topic 2:
gt lt class type id int public value quot return data code org user add function string test new http
Topic 3:
li ul em question user data time need run test application class create make app quot java value set method
Topic 4:
use way like want need user application data work make time create app run try set question new add method
Topic 5:
strong em question time value set class user make type data function add method work com new app return id
Topic 6:
file error run try create work need data add want application way app type test java new set quot make
Topic 7:
pre string return function error class new public value try int id data like test work method type set create
Topic 8:
android app java com application new id quot run public error try string work a

#### création NMF labels

In [252]:
nmf, tf, tf_vectorizer = do_nmf(DATA["Body_sentence_lemmat"].values, max_df=0.9, min_df=2, max_features=50, n_topics=5)



Topic 0:
code pre function return string value error class try int method like work set type new id public want em
Topic 1:
http com href rel noreferrer org image strong question android em use like work try java quot app make error
Topic 2:
gt lt pre class public id int string type return value new function quot data android error test add strong
Topic 3:
li ul strong em question code user time need data application run make app test create use java quot class
Topic 4:
use file strong way like want user need data application app work create new try run make time pre error


In [253]:
nmf_val = nmf.transform(tf_vectorizer.transform(DATA["Body_sentence_lemmat"].values))
nmf_label = np.argmax(nmf_val, axis=1)
DATA["NMF_labels"] = nmf_label

In [254]:
nmf_val = nmf.transform(tf_vectorizer.transform(DATA_20tags["Body_sentence_lemmat"].values))
nmf_label = np.argmax(nmf_val, axis=1)
DATA_20tags["NMF_labels"] = nmf_label

Les différents sujets pouvant être déterminés à l'aide de cette méthode sont très proches des sujets obtenus par la méthode LDA.

## Segmentation supervisée des 20tags
Dans cette partie les 20 tags les plus fréquents seront prédits, ce que signifie qu'une même question peut avoir plusieurs tags.  
Dans le but de déterminer les meilleures features et le meilleur modèle, on ne sélectionne qu'une partie des observations dans un soucis de temps de calcul.  
L'ensemble des résultats seront discuté à la fin de cette partie.

### fonctions 

In [1]:
def calc_metrics(y_true_train, y_pred_train, y_true_test, y_pred_test) :
    """ fonction permettant le calcul de toutes les metrics"""
    ascore_train = accuracy_score(y_true_train, y_pred_train)
    pscore_train = precision_score(y_true_train, y_pred_train, average='samples')
    jacscore_train = jaccard_score(y_true_train, y_pred_train, average="samples")
    true_train, falseneg_train, falsepos_train = mymetrics(y_true_train, y_pred_train)
    ascore_test = accuracy_score(y_true_test, y_pred_test)
    pscore_test = precision_score(y_true_test, y_pred_test, average='samples')
    true_test, falseneg_test, falsepos_test = mymetrics(y_true_test, y_pred_test)
    jacscore_test = jaccard_score(y_true_test, y_pred_test, average="samples")
    return {'accuracy_train': ascore_train, 'precision_train':pscore_train,
            'accuracy_test': ascore_test, 'precision_test':pscore_test,
            'jaccard_train': jacscore_train, 'jaccard_test': jacscore_test,
            'true_train' : true_train, 'falseneg_train' : falseneg_train, 'falsepos_train' : falsepos_train,
            'true_test' : true_test, 'falseneg_test' : falseneg_test, 'falsepos_test' : falsepos_test }

In [121]:
def add_result(df, scores) :
    """ fonction pour mettre à jour le DataFrame de résultats"""
    features = scores["features"]
    classifier = scores['classifier']
    if len(df[(df['features'] == features) & (df['classifier'] == classifier)].index) > 0 :
        df = df.drop(index=df[(df['features'] == features) & (df['classifier'] == classifier)].index)
    df = df.append(scores, ignore_index=True)
    return df

In [122]:
def bert_encode(texts, tokenizer, max_len=512):
    all_tokens = []
    all_masks = []
    all_segments = []
    
    for text in texts:
        text = tokenizer.tokenize(text)
            
        text = text[:max_len-2]
        input_sequence = ["[CLS]"] + text + ["[SEP]"]
        pad_len = max_len - len(input_sequence)
        
        tokens = tokenizer.convert_tokens_to_ids(input_sequence)
        tokens += [0] * pad_len
        pad_masks = [1] * len(input_sequence) + [0] * pad_len
        segment_ids = [0] * max_len
        
        all_tokens.append(tokens)
        all_masks.append(pad_masks)
        all_segments.append(segment_ids)
    
    return np.array(all_tokens), np.array(all_masks), np.array(all_segments)

In [123]:
result_all = pd.DataFrame(columns=["features", "classifier", "accuracy_train", "precision_train","accuracy_test", "precision_test",
                                    'jaccard_train', 'jaccard_test',
                                    'true_train', 'falseneg_train', 'falsepos_train',
                                    'true_test', 'falseneg_test', 'falsepos_test'])

### Preprocessing

#### Liste des tags
une variable à prédire est créée, de taille n*20, grâce à la class MultiLabelBinarizer

In [124]:
mlb = MultiLabelBinarizer()
y_all = mlb.fit_transform(DATA_20tags['Tags_list'])

### Bag of words countvectorizer

In [125]:
features = "bow_cv"

#### chargement des features

In [126]:
with open(DIR+"/bow_cv_20tags.pkl", 'rb') as ifile :
    bow_cv_features_20tags = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/USE_features.pkl", 'rb') as ifile :
#     USE_features = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/BERT_features.pkl", 'rb') as ifile :
#     BERT_features = pickle.load(ifile)

#### Séparation du dataset en train et test set

In [127]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(bow_cv_features_20tags, \
                                                                            y_all, \
                                                                            train_size=5000, 
                                                                            test_size=2000,
                                                                            random_state=42)


In [128]:
print(f"Nombre d'observations dans le train set : {X_train.shape[0]}")
print(f"Nombre d'observations dans le test set : {X_test.shape[0]}")

Nombre d'observations dans le train set : 5000
Nombre d'observations dans le test set : 2000


#### SVC

In [133]:
classifier = "SVC"
model = MultiOutputClassifier(SVC(), n_jobs=4)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=SVC(), n_jobs=4)

In [134]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
1,bow_cv,SVC,0.3028,0.377033,0.138,0.19325,0.33914,0.165058,2132,4887,30,394,2430,23


#### NaiveBays

ne fonctionne pas avec une matrice de trop grande taille

In [135]:
classifier = "NB"
model = MultiOutputClassifier(GaussianNB(), n_jobs=1)
model.fit(X_train.toarray(), y_train)

MultiOutputClassifier(estimator=GaussianNB(), n_jobs=1)

In [136]:
scores = calc_metrics(y_train, model.predict(X_train.toarray()), y_test, model.predict(X_test.toarray()))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
1,bow_cv,NB,0.8408,0.902731,0.2135,0.353014,0.902731,0.305434,7019,0,2902,1068,1756,1964


#### KNN


In [137]:
classifier = "KNN"
model = MultiOutputClassifier(KNeighborsClassifier(), n_jobs=4)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=KNeighborsClassifier(), n_jobs=4)

In [138]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
2,bow_cv,KNN,0.2052,0.297467,0.1015,0.161167,0.253193,0.134892,1687,5332,364,363,2461,284


### Bag of words tf-idf

In [143]:
features = "bow_tfidf"

#### chargement des features

In [144]:
with open(DIR+"/bow_tdif_20tags.pkl", 'rb') as ifile :
    bow_tdif_features_20tags = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/USE_features.pkl", 'rb') as ifile :
#     USE_features = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/BERT_features.pkl", 'rb') as ifile :
#     BERT_features = pickle.load(ifile)

#### Séparation du dataset en train et test set

In [145]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(bow_tdif_features_20tags, \
                                                                            y_all, \
                                                                            train_size=5000,
                                                                            test_size=2000,
                                                                            random_state=42)


In [146]:
print(f"Nombre d'observations dans le train set : {X_train.shape[0]}")
print(f"Nombre d'observations dans le test set : {X_test.shape[0]}")

Nombre d'observations dans le train set : 5000
Nombre d'observations dans le test set : 2000


#### SVC

In [147]:
classifier = "SVC"
model = MultiOutputClassifier(SVC(), n_jobs=1)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=SVC(), n_jobs=1)

In [148]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
4,bow_tfidf,SVC,0.7482,0.887267,0.243,0.353333,0.82122,0.299058,5569,1450,28,760,2064,88


#### NaiveBays

In [149]:
classifier = "NB"
model = MultiOutputClassifier(GaussianNB(), n_jobs=1)
model.fit(X_train.toarray(), y_train)

MultiOutputClassifier(estimator=GaussianNB(), n_jobs=1)

In [150]:
scores = calc_metrics(y_train, model.predict(X_train.toarray()), y_test, model.predict(X_test.toarray()))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
5,bow_tfidf,NB,0.8464,0.906326,0.209,0.348847,0.906326,0.301288,7019,0,2774,1056,1768,1924


#### KNN


In [151]:
classifier = "KNN"
model = MultiOutputClassifier(KNeighborsClassifier(), n_jobs=4)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=KNeighborsClassifier(), n_jobs=4)

In [152]:
scores = calc_metrics(y_train, model.predict(X_train.toarray()), y_test, model.predict(X_test.toarray()))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
5,bow_tfidf,KNN,0.3402,0.482,0.215,0.321833,0.415737,0.271225,2793,4226,434,732,2092,364


### Word2vec

In [153]:
features = "word2vec"

#### chargement des features

In [154]:
with open(DIR+"/word2vec_20tags_features.pkl", 'rb') as ifile :
    word2vec_features_20tags = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/USE_features.pkl", 'rb') as ifile :
#     USE_features = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/BERT_features.pkl", 'rb') as ifile :
#     BERT_features = pickle.load(ifile)

#### Séparation du dataset en train et test set

In [155]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(word2vec_features_20tags, \
                                                                            y_all, \
                                                                            train_size = 5000,
                                                                            test_size=2000,
                                                                            random_state=42)

In [156]:
print(f"Nombre d'observations dans le train set : {len(X_train)}")
print(f"Nombre d'observations dans le test set : {len(X_test)}")

Nombre d'observations dans le train set : 5000
Nombre d'observations dans le test set : 2000


#### SVC

In [157]:
classifier = "SVC"
model = MultiOutputClassifier(SVC(), n_jobs=1)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=SVC(), n_jobs=1)

In [158]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
6,word2vec,SVC,0.14,0.2008,0.0935,0.13925,0.168673,0.113833,1030,5989,37,279,2545,35


#### NaiveBays

In [159]:
classifier = "NB"
model = MultiOutputClassifier(GaussianNB(), n_jobs=1)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=GaussianNB(), n_jobs=1)

In [160]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
7,word2vec,NB,0.004,0.169555,0.005,0.160993,0.164594,0.155224,4500,2519,25710,1692,1132,10310


#### KNN


In [161]:
classifier = "KNN"
model = MultiOutputClassifier(KNeighborsClassifier(), n_jobs=4)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=KNeighborsClassifier(), n_jobs=4)

In [162]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
8,word2vec,KNN,0.1644,0.251117,0.0665,0.119,0.20967,0.09525,1449,5570,463,281,2543,315


### BERT uncased

In [163]:
features = "BERT_uncased"

#### chargement des features

In [164]:
with open(DIR+"/BERT_features_20tags.pkl", 'rb') as ifile :
    BERT_features_20tags = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/USE_features.pkl", 'rb') as ifile :
#     USE_features = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/BERT_features.pkl", 'rb') as ifile :
#     BERT_features = pickle.load(ifile)

#### Séparation du dataset en train et test set

In [165]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(BERT_features_20tags, \
                                                                            y_all, \
                                                                            train_size=5000,
                                                                            test_size=2000,
                                                                            random_state=42)

In [166]:
print(f"Nombre d'observations dans le train set : {len(X_train)}")
print(f"Nombre d'observations dans le test set : {len(X_test)}")

Nombre d'observations dans le train set : 5000
Nombre d'observations dans le test set : 2000


#### SVC

In [167]:
classifier = "SVC"
model = MultiOutputClassifier(SVC(), n_jobs=1)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=SVC(), n_jobs=1)

In [168]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
9,BERT_uncased,SVC,0.0242,0.031,0.0105,0.013,0.027233,0.011667,155,6864,0,26,2798,0


#### NaiveBays

In [169]:
classifier = "NB"
model = MultiOutputClassifier(GaussianNB(), n_jobs=1)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=GaussianNB(), n_jobs=1)

In [170]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
10,BERT_uncased,NB,0.003,0.155771,0.003,0.141745,0.151418,0.13739,4541,2478,25987,1697,1127,10649


#### KNN


In [171]:
classifier = "KNN"
model = MultiOutputClassifier(KNeighborsClassifier(), n_jobs=4)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=KNeighborsClassifier(), n_jobs=4)

In [172]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
11,BERT_uncased,KNN,0.15,0.2305,0.055,0.098417,0.189753,0.077417,1258,5761,382,218,2606,332


### BERT HF_notrain

In [173]:
features = "BERT_HF_notrain"

#### chargement des features

In [174]:
with open(DIR+"/BERT_HF_features_20tags_notrain.pkl", 'rb') as ifile :
    BERT_features_20tags = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/USE_features.pkl", 'rb') as ifile :
#     USE_features = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/BERT_features.pkl", 'rb') as ifile :
#     BERT_features = pickle.load(ifile)

#### Séparation du dataset en train et test set

In [175]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(BERT_features_20tags, \
                                                                            y_all, \
                                                                            train_size=5000,
                                                                            test_size=2000,
                                                                            random_state=42)

In [176]:
print(f"Nombre d'observations dans le train set : {len(X_train)}")
print(f"Nombre d'observations dans le test set : {len(X_test)}")

Nombre d'observations dans le train set : 5000
Nombre d'observations dans le test set : 2000


#### SVC

In [177]:
classifier = "SVC"
model = MultiOutputClassifier(SVC(), n_jobs=1)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=SVC(), n_jobs=1)

In [178]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
12,BERT_HF_notrain,SVC,0.0,0.0,0.0,0.0,0.0,0.0,0,7019,0,0,2824,0


#### NaiveBays

In [179]:
classifier = "NB"
model = MultiOutputClassifier(GaussianNB(), n_jobs=1)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=GaussianNB(), n_jobs=1)

In [180]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
13,BERT_HF_notrain,NB,0.0,0.093274,0.0,0.09045,0.091835,0.08875,4235,2784,43514,1619,1205,17177


#### KNN


In [181]:
classifier = "KNN"
model = MultiOutputClassifier(KNeighborsClassifier(), n_jobs=4)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=KNeighborsClassifier(), n_jobs=4)

In [182]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
14,BERT_HF_notrain,KNN,0.0908,0.140967,0.026,0.0445,0.11482,0.034625,746,6273,294,94,2730,255


### BERT HF_train

In [183]:
features = "BERT_HF_train"

#### chargement des features

In [184]:
with open(DIR+"/BERT_HF_features_20tags_new.pkl", 'rb') as ifile :
    BERT_features_20tags = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/USE_features.pkl", 'rb') as ifile :
#     USE_features = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/BERT_features.pkl", 'rb') as ifile :
#     BERT_features = pickle.load(ifile)

#### Séparation du dataset en train et test set

In [185]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(BERT_features_20tags, \
                                                                            y_all, \
                                                                            train_size=5000,
                                                                            test_size=2000,
                                                                            random_state=42)

In [186]:
print(f"Nombre d'observations dans le train set : {len(X_train)}")
print(f"Nombre d'observations dans le test set : {len(X_test)}")

Nombre d'observations dans le train set : 5000
Nombre d'observations dans le test set : 2000


#### SVC

In [187]:
classifier = "SVC"
model = MultiOutputClassifier(SVC(), n_jobs=1)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=SVC(), n_jobs=1)

In [188]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
15,BERT_HF_train,SVC,0.3428,0.5204,0.332,0.501167,0.435837,0.421225,2956,4063,506,1135,1689,234


#### NaiveBays

In [189]:
classifier = "NB"
model = MultiOutputClassifier(GaussianNB(), n_jobs=1)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=GaussianNB(), n_jobs=1)

In [190]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
16,BERT_HF_train,NB,0.1292,0.396769,0.124,0.387992,0.385595,0.378452,5544,1475,11595,2217,607,4682


#### KNN


In [191]:
classifier = "KNN"
model = MultiOutputClassifier(KNeighborsClassifier(), n_jobs=4)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=KNeighborsClassifier(), n_jobs=4)

In [192]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
17,BERT_HF_train,KNN,0.4458,0.626417,0.3535,0.539083,0.548557,0.462383,3823,3196,868,1315,1509,581


### USE

In [193]:
features = "USE"

#### chargement des features

In [194]:
with open(DIR+"/USE_features_20tags.pkl", 'rb') as ifile :
    USE_features_20tags = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/USE_features.pkl", 'rb') as ifile :
#     USE_features = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/BERT_features.pkl", 'rb') as ifile :
#     BERT_features = pickle.load(ifile)

#### Séparation du dataset en train et test set

In [195]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(USE_features_20tags, \
                                                                            y_all, \
                                                                            train_size=5000,
                                                                            test_size=2000,
                                                                            random_state=42)

In [196]:
print(f"Nombre d'observations dans le train set : {len(X_train)}")
print(f"Nombre d'observations dans le test set : {len(X_test)}")

Nombre d'observations dans le train set : 5000
Nombre d'observations dans le test set : 2000


#### SVC

In [197]:
classifier = "SVC"
model = MultiOutputClassifier(SVC(), n_jobs=1)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=SVC(), n_jobs=1)

In [198]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
18,USE,SVC,0.647,0.845233,0.447,0.668167,0.755967,0.567992,5115,1904,305,1529,1295,259


#### NaiveBays

In [199]:
classifier = "NB"
model = MultiOutputClassifier(GaussianNB(), n_jobs=1)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=GaussianNB(), n_jobs=1)

In [200]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
19,USE,NB,0.1724,0.488678,0.164,0.473783,0.476128,0.460551,6080,939,7852,2399,425,3212


#### KNN


In [201]:
classifier = "KNN"
model = MultiOutputClassifier(KNeighborsClassifier(), n_jobs=4)
model.fit(X_train, y_train)

MultiOutputClassifier(estimator=KNeighborsClassifier(), n_jobs=4)

In [202]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_all = add_result(result_all, scores)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
20,USE,KNN,0.51,0.70675,0.3945,0.596375,0.623597,0.511308,4310,2709,763,1444,1380,506


### Modèle de BERT sans surentrainement

#### Création des inputs

In [204]:
module_url ='https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4'
bert_layer = hub.KerasLayer(module_url, trainable=False)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)
inputs = bert_encode(DATA_20tags['Body'], tokenizer, max_len=64)

In [205]:
model = tensorflow.keras.models.load_model(DIR+"/model_BERT_20tags_len64_notrain.h5",
                                           custom_objects={'KerasLayer':hub.KerasLayer})

#### prédiction

In [206]:
y_pred = model.predict(inputs)

In [218]:
np.argmax(y_pred)

300305

In [207]:
ascore_train = accuracy_score(y_all, y_pred)
pscore_train = precision_score(y_all, y_pred, average='samples')
jacscore_train = jaccard_score(y_all, y_pred, average="samples")
true_train, falseneg_train, falsepos_train = mymetrics(y_all, y_pred)
scores = {'accuracy_train': ascore_train, 'precision_train':pscore_train,
          'jaccard_train': jacscore_train,
          'true_train' : true_train, 'falseneg_train' : falseneg_train, 'falsepos_train' : falsepos_train}

ValueError: Classification metrics can't handle a mix of multilabel-indicator and continuous-multioutput targets

In [None]:
features = "BERT_HF_notrain"
classifier = "BERT_model"
scores["features"] = features
scores['classifier'] = classifier
if len(result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)].index) > 0 :
    result_all = result_all.drop(index=result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)].index)
result_all = result_all.append(scores, ignore_index=True)
rresult_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

### Modèle de BERT avec surentrainement

#### Création des inputs

In [None]:
module_url ='https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4'
bert_layer = hub.KerasLayer(module_url, trainable=False)
vocab_file = bert_layer.resolved_object.vocab_file.asset_path.numpy()
do_lower_case = bert_layer.resolved_object.do_lower_case.numpy()
tokenizer = tokenization.FullTokenizer(vocab_file, do_lower_case)
inputs = bert_encode(DATA_20tags['Body'], tokenizer, max_len=64)

In [None]:
model = tensorflow.keras.models.load_model(DIR+"/model_BERT_20tags_len64_new.h5",
                                           custom_objects={'KerasLayer':hub.KerasLayer})

#### prédiction

In [None]:
y_pred = model.predict(inputs)

In [None]:
ascore_train = accuracy_score(y_all, y_pred)
pscore_train = precision_score(y_all, y_pred, average='samples')
jacscore_train = jaccard_score(y_all, y_pred, average="samples")
true_train, falseneg_train, falsepos_train = mymetrics(y_all, y_pred)
scores = {'accuracy_train': ascore_train, 'precision_train':pscore_train,
          'jaccard_train': jacscore_train,
          'true_train' : true_train, 'falseneg_train' : falseneg_train, 'falsepos_train' : falsepos_train}

In [None]:
features = "BERT_HF_train"
classifier = "BERT_model"
scores["features"] = features
scores['classifier'] = classifier
if len(result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)].index) > 0 :
    result_all = result_all.drop(index=result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)].index)
result_all = result_all.append(scores, ignore_index=True)
result_all[(result_all['features'] == features) & (result_all['classifier'] == classifier)]

### résumé des résultats

In [219]:
result_all

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
0,bow_cv,SVC,0.3028,0.377033,0.138,0.19325,0.33914,0.165058,2132,4887,30,394,2430,23
1,bow_cv,NB,0.8408,0.902731,0.2135,0.353014,0.902731,0.305434,7019,0,2902,1068,1756,1964
2,bow_cv,KNN,0.2052,0.297467,0.1015,0.161167,0.253193,0.134892,1687,5332,364,363,2461,284
3,bow_tfidf,SVC,0.7482,0.887267,0.243,0.353333,0.82122,0.299058,5569,1450,28,760,2064,88
4,bow_tfidf,NB,0.8464,0.906326,0.209,0.348847,0.906326,0.301288,7019,0,2774,1056,1768,1924
5,bow_tfidf,KNN,0.3402,0.482,0.215,0.321833,0.415737,0.271225,2793,4226,434,732,2092,364
6,word2vec,SVC,0.14,0.2008,0.0935,0.13925,0.168673,0.113833,1030,5989,37,279,2545,35
7,word2vec,NB,0.004,0.169555,0.005,0.160993,0.164594,0.155224,4500,2519,25710,1692,1132,10310
8,word2vec,KNN,0.1644,0.251117,0.0665,0.119,0.20967,0.09525,1449,5570,463,281,2543,315
9,BERT_uncased,SVC,0.0242,0.031,0.0105,0.013,0.027233,0.011667,155,6864,0,26,2798,0


In [220]:
result_all.to_csv(DIR+"/result_segmentation.csv")

Les différents tests mettent en évidence plusieurs points : 
- Le modèle utilisé pour réaliser la segmentation n'est pas le facteur principale de performance
- les différentes metrics donnent des résultats cohérent.
- les différents préprocessing ont une forte influence sur la qualité de la segmentation
- le traitement TF-IDF permet d'avoir un très bon score sur le jeu d'apprentissage, mais pas sur le jeu de test : surapprentissage
- le traitement BERT est fortement influencé par le choix du modèle et si ce modèle a été sur-entrainé ou non
- le traitement BERT obtenu grâce au modèle sur entrainé permet d'avoir de bon résultat sans surapprentissage
- le traitement USE semble également très efficace. (particulièrement avec le modèle SVC

## Segmentation supervisée avec un seul tag
Dans cette partie seul le premier des 20 tags les plus fréquents sera prédit, ce que signifie qu'une même question peut avoir qu'un seul tag. Lorsqu'il existe plusieurs tags, on choisit le premier de la liste arbitrairement. 
Dans le but de déterminer les meilleures features et le meilleur modèle, on ne sélectionne qu'une partie des observations dans un soucis de temps de calcul.  
L'ensemble des résultats seront discuté à la fin de cette partie.

In [41]:
def calc_metrics(y_true_train, y_pred_train, y_true_test, y_pred_test) :
    ascore_train = accuracy_score(y_true_train, y_pred_train)
    pscore_train = precision_score(y_true_train, y_pred_train, average='micro')
    jacscore_train = jaccard_score(y_true_train, y_pred_train, average="micro")
    true_train, falseneg_train, falsepos_train = mymetrics(y_true_train, y_pred_train)
    ascore_test = accuracy_score(y_true_test, y_pred_test)
    pscore_test = precision_score(y_true_test, y_pred_test, average='micro')
    true_test, falseneg_test, falsepos_test = mymetrics(y_true_test, y_pred_test)
    jacscore_test = jaccard_score(y_true_test, y_pred_test, average="micro")
    return {'accuracy_train': ascore_train, 'precision_train':pscore_train,
            'accuracy_test': ascore_test, 'precision_test':pscore_test,
            'jaccard_train': jacscore_train, 'jaccard_test': jacscore_test,
            'true_train' : true_train, 'falseneg_train' : falseneg_train, 'falsepos_train' : falsepos_train,
            'true_test' : true_test, 'falseneg_test' : falseneg_test, 'falsepos_test' : falsepos_test }

In [42]:
result_one = pd.DataFrame(columns=["features", "classifier", "accuracy_train", "precision_train","accuracy_test", "precision_test",
                                    'jaccard_train', 'jaccard_test',
                                    'true_train', 'falseneg_train', 'falsepos_train',
                                    'true_test', 'falseneg_test', 'falsepos_test'])

### Preprocessing

#### Liste des tags

In [43]:
one_tag = DATA_20tags['Tags_list'].apply(lambda x : x[0])

In [44]:
enc = LabelEncoder()
y_one = enc.fit_transform(one_tag)

### Bag of words countvectorizer

In [45]:
features = "bow_cv"

#### chargement des features

In [46]:
with open(DIR+"/bow_cv_20tags.pkl", 'rb') as ifile :
    bow_cv_features_20tags = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/USE_features.pkl", 'rb') as ifile :
#     USE_features = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/BERT_features.pkl", 'rb') as ifile :
#     BERT_features = pickle.load(ifile)

#### Séparation du dataset en train et test set

In [47]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(bow_cv_features_20tags, \
                                                                            y_one, \
                                                                            train_size=5000, 
                                                                            test_size=2000,
                                                                            random_state=42)


In [48]:
print(f"Nombre d'observations dans le train set : {X_train.shape[0]}")
print(f"Nombre d'observations dans le test set : {X_test.shape[0]}")

Nombre d'observations dans le train set : 5000
Nombre d'observations dans le test set : 2000


#### SVC

In [49]:
classifier = "SVC"
model = SVC()
model.fit(X_train, y_train)

SVC()

In [50]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
0,bow_cv,SVC,0.5988,0.5988,0.4135,0.4135,0.427348,0.260637,215,0,0,51,0,0


#### NaiveBays

ne fonctionne pas avec une matrice de trop grande taille

In [51]:
classifier = "NB"
model = GaussianNB()
model.fit(X_train.toarray(), y_train)

GaussianNB()

In [53]:
scores = calc_metrics(y_train, model.predict(X_train.toarray()), y_test, model.predict(X_test.toarray()))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
1,bow_cv,NB,0.9392,0.9392,0.4275,0.4275,0.88537,0.27186,326,23,0,103,21,0


#### KNN


In [54]:
classifier = "KNN"
model = KNeighborsClassifier()
model.fit(X_train, y_train)

KNeighborsClassifier()

In [55]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
2,bow_cv,KNN,0.4998,0.4998,0.2585,0.2585,0.333156,0.148435,394,119,0,132,77,0


### Bag of words tf-idf

In [56]:
features = "bow_tfidf"

#### chargement des features

In [57]:
with open(DIR+"/bow_tdif_20tags.pkl", 'rb') as ifile :
    bow_tdif_features_20tags = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/USE_features.pkl", 'rb') as ifile :
#     USE_features = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/BERT_features.pkl", 'rb') as ifile :
#     BERT_features = pickle.load(ifile)

#### Séparation du dataset en train et test set

In [58]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(bow_tdif_features_20tags, \
                                                                            y_one, \
                                                                            train_size=5000,
                                                                            test_size=2000,
                                                                            random_state=42)


In [59]:
print(f"Nombre d'observations dans le train set : {X_train.shape[0]}")
print(f"Nombre d'observations dans le test set : {X_test.shape[0]}")

Nombre d'observations dans le train set : 5000
Nombre d'observations dans le test set : 2000


#### SVC

In [60]:
classifier = "SVC"
model = SVC()
model.fit(X_train, y_train)

SVC()

In [61]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
3,bow_tfidf,SVC,0.955,0.955,0.554,0.554,0.913876,0.383126,335,0,0,88,0,0


#### NaiveBays

In [62]:
classifier = "NB"
model = GaussianNB()
model.fit(X_train.toarray(), y_train)

GaussianNB()

In [63]:
scores = calc_metrics(y_train, model.predict(X_train.toarray()), y_test, model.predict(X_test.toarray()))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
4,bow_tfidf,NB,0.9516,0.9516,0.4205,0.4205,0.907669,0.266223,327,20,0,108,17,0


#### KNN


In [64]:
classifier = "KNN"
model = KNeighborsClassifier()
model.fit(X_train, y_train)

KNeighborsClassifier()

In [65]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
5,bow_tfidf,KNN,0.6224,0.6224,0.392,0.392,0.4518,0.243781,415,59,2,151,45,0


### Word2vec

In [66]:
features = "word2vec"

#### chargement des features

In [67]:
with open(DIR+"/word2vec_20tags_features.pkl", 'rb') as ifile :
    word2vec_features_20tags = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/USE_features.pkl", 'rb') as ifile :
#     USE_features = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/BERT_features.pkl", 'rb') as ifile :
#     BERT_features = pickle.load(ifile)

In [68]:
word2vec_features_20tags.shape

(21335, 300)

#### Séparation du dataset en train et test set

In [69]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(word2vec_features_20tags, \
                                                                            y_one, \
                                                                            train_size = 5000,
                                                                            test_size=2000,
                                                                            random_state=42)

In [70]:
print(f"Nombre d'observations dans le train set : {len(X_train)}")
print(f"Nombre d'observations dans le test set : {len(X_test)}")

Nombre d'observations dans le train set : 5000
Nombre d'observations dans le test set : 2000


#### SVC

In [71]:
classifier = "SVC"
model = SVC()
model.fit(X_train, y_train)

SVC()

In [72]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
6,word2vec,SVC,0.5192,0.5192,0.383,0.383,0.350621,0.236858,297,0,1,99,0,1


#### NaiveBays

In [73]:
classifier = "NB"
model = GaussianNB()
model.fit(X_train, y_train)

GaussianNB()

In [74]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
7,word2vec,NB,0.1878,0.1878,0.152,0.152,0.103631,0.082251,449,343,5,183,139,4


#### KNN


In [75]:
classifier = "KNN"
model = KNeighborsClassifier()
model.fit(X_train, y_train)

KNeighborsClassifier()

In [76]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
8,word2vec,KNN,0.4644,0.4644,0.215,0.215,0.302423,0.120448,570,140,2,222,65,0


### BERT uncased

In [77]:
features = "BERT_uncased"

#### chargement des features

In [78]:
with open(DIR+"/BERT_features_20tags.pkl", 'rb') as ifile :
    BERT_features_20tags = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/USE_features.pkl", 'rb') as ifile :
#     USE_features = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/BERT_features.pkl", 'rb') as ifile :
#     BERT_features = pickle.load(ifile)

#### Séparation du dataset en train et test set

In [79]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(BERT_features_20tags, \
                                                                            y_one, \
                                                                            train_size=5000,
                                                                            test_size=2000,
                                                                            random_state=42)

In [80]:
print(f"Nombre d'observations dans le train set : {len(X_train)}")
print(f"Nombre d'observations dans le test set : {len(X_test)}")

Nombre d'observations dans le train set : 5000
Nombre d'observations dans le test set : 2000


#### SVC

In [81]:
classifier = "SVC"
model = SVC()
model.fit(X_train, y_train)

SVC()

In [82]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
9,BERT_uncased,SVC,0.405,0.405,0.288,0.288,0.253918,0.168224,403,0,2,175,0,1


#### NaiveBays

In [83]:
classifier = "NB"
model = GaussianNB()
model.fit(X_train, y_train)

GaussianNB()

In [84]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
10,BERT_uncased,NB,0.2086,0.2086,0.137,0.137,0.116445,0.073537,718,202,11,295,70,6


#### KNN


In [85]:
classifier = "KNN"
model = KNeighborsClassifier()
model.fit(X_train, y_train)

KNeighborsClassifier()

In [86]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
11,BERT_uncased,KNN,0.4494,0.4494,0.1965,0.1965,0.289823,0.108955,631,104,0,270,54,3


### BERT HF_notrain

In [87]:
features = "BERT_HF_notrain"

#### chargement des features

In [88]:
with open(DIR+"/BERT_HF_features_20tags_notrain.pkl", 'rb') as ifile :
    BERT_features_20tags = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/USE_features.pkl", 'rb') as ifile :
#     USE_features = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/BERT_features.pkl", 'rb') as ifile :
#     BERT_features = pickle.load(ifile)

#### Séparation du dataset en train et test set

In [89]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(BERT_features_20tags, \
                                                                            y_one, \
                                                                            train_size=5000,
                                                                            test_size=2000,
                                                                            random_state=42)

In [90]:
print(f"Nombre d'observations dans le train set : {len(X_train)}")
print(f"Nombre d'observations dans le test set : {len(X_test)}")

Nombre d'observations dans le train set : 5000
Nombre d'observations dans le test set : 2000


#### SVC

In [91]:
classifier = "SVC"
model = SVC()
model.fit(X_train, y_train)

SVC()

In [92]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
12,BERT_HF_notrain,SVC,0.2282,0.2282,0.193,0.193,0.128796,0.106807,8,0,0,6,0,0


#### NaiveBays

In [93]:
classifier = "NB"
model = GaussianNB()
model.fit(X_train, y_train)

GaussianNB()

In [94]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
13,BERT_HF_notrain,NB,0.0474,0.0474,0.043,0.043,0.024275,0.021972,4,131,0,0,51,0


#### KNN


In [95]:
classifier = "KNN"
model = KNeighborsClassifier()
model.fit(X_train, y_train)

KNeighborsClassifier()

In [96]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
14,BERT_HF_notrain,KNN,0.3986,0.3986,0.138,0.138,0.248907,0.074114,764,145,3,321,81,8


### BERT HF_train

In [97]:
features = "BERT_HF_train"

#### chargement des features

In [98]:
with open(DIR+"/BERT_HF_features_20tags_new.pkl", 'rb') as ifile :
    BERT_features_20tags = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/USE_features.pkl", 'rb') as ifile :
#     USE_features = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/BERT_features.pkl", 'rb') as ifile :
#     BERT_features = pickle.load(ifile)

#### Séparation du dataset en train et test set

In [99]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(BERT_features_20tags, \
                                                                            y_one, \
                                                                            train_size=5000,
                                                                            test_size=2000,
                                                                            random_state=42)

In [100]:
print(f"Nombre d'observations dans le train set : {len(X_train)}")
print(f"Nombre d'observations dans le test set : {len(X_test)}")

Nombre d'observations dans le train set : 5000
Nombre d'observations dans le test set : 2000


#### SVC

In [101]:
classifier = "SVC"
model = SVC()
model.fit(X_train, y_train)

SVC()

In [102]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
15,BERT_HF_train,SVC,0.5966,0.5966,0.5555,0.5555,0.42511,0.384562,415,0,1,179,0,2


#### NaiveBays

In [103]:
classifier = "NB"
model = GaussianNB()
model.fit(X_train, y_train)

GaussianNB()

In [104]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
16,BERT_HF_train,NB,0.495,0.495,0.4905,0.4905,0.328904,0.324942,346,240,1,142,105,2


#### KNN


In [105]:
classifier = "KNN"
model = KNeighborsClassifier()
model.fit(X_train, y_train)

KNeighborsClassifier()

In [106]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
17,BERT_HF_train,KNN,0.6306,0.6306,0.497,0.497,0.460494,0.330672,465,65,3,186,35,2


### USE

In [107]:
features = "USE"

#### chargement des features

In [108]:
with open(DIR+"/USE_features_20tags.pkl", 'rb') as ifile :
    USE_features_20tags = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/USE_features.pkl", 'rb') as ifile :
#     USE_features = pickle.load(ifile)
# with open("gdrive/Othercomputers/Mon ordinateur portable/P5_stackoverflow/BERT_features.pkl", 'rb') as ifile :
#     BERT_features = pickle.load(ifile)

#### Séparation du dataset en train et test set

In [109]:
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(USE_features_20tags, \
                                                                            y_one, \
                                                                            train_size=5000,
                                                                            test_size=2000,
                                                                            random_state=42)

In [110]:
print(f"Nombre d'observations dans le train set : {len(X_train)}")
print(f"Nombre d'observations dans le test set : {len(X_test)}")

Nombre d'observations dans le train set : 5000
Nombre d'observations dans le test set : 2000


#### SVC

In [111]:
classifier = "SVC"
model = SVC()
model.fit(X_train, y_train)

SVC()

In [112]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
18,USE,SVC,0.8568,0.8568,0.6755,0.6755,0.749475,0.510004,371,0,0,135,0,0


#### NaiveBays

In [113]:
classifier = "NB"
model = GaussianNB()
model.fit(X_train, y_train)

GaussianNB()

In [114]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
19,USE,NB,0.6016,0.6016,0.5505,0.5505,0.430206,0.379786,339,141,0,110,64,0


#### KNN


In [115]:
classifier = "KNN"
model = KNeighborsClassifier()
model.fit(X_train, y_train)

KNeighborsClassifier()

In [116]:
scores = calc_metrics(y_train, model.predict(X_train), y_test, model.predict(X_test))
scores["features"] = features
scores['classifier'] = classifier
result_one = add_result(result_one, scores)
result_one[(result_one['features']==features) & (result_one['classifier']==classifier)]

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
20,USE,KNN,0.7034,0.7034,0.559,0.559,0.542496,0.387925,431,27,0,160,20,0


In [117]:
result_one

Unnamed: 0,features,classifier,accuracy_train,precision_train,accuracy_test,precision_test,jaccard_train,jaccard_test,true_train,falseneg_train,falsepos_train,true_test,falseneg_test,falsepos_test
0,bow_cv,SVC,0.5988,0.5988,0.4135,0.4135,0.427348,0.260637,215,0,0,51,0,0
1,bow_cv,NB,0.9392,0.9392,0.4275,0.4275,0.88537,0.27186,326,23,0,103,21,0
2,bow_cv,KNN,0.4998,0.4998,0.2585,0.2585,0.333156,0.148435,394,119,0,132,77,0
3,bow_tfidf,SVC,0.955,0.955,0.554,0.554,0.913876,0.383126,335,0,0,88,0,0
4,bow_tfidf,NB,0.9516,0.9516,0.4205,0.4205,0.907669,0.266223,327,20,0,108,17,0
5,bow_tfidf,KNN,0.6224,0.6224,0.392,0.392,0.4518,0.243781,415,59,2,151,45,0
6,word2vec,SVC,0.5192,0.5192,0.383,0.383,0.350621,0.236858,297,0,1,99,0,1
7,word2vec,NB,0.1878,0.1878,0.152,0.152,0.103631,0.082251,449,343,5,183,139,4
8,word2vec,KNN,0.4644,0.4644,0.215,0.215,0.302423,0.120448,570,140,2,222,65,0
9,BERT_uncased,SVC,0.405,0.405,0.288,0.288,0.253918,0.168224,403,0,2,175,0,1


In [119]:
result_one.to_csv(DIR+"/result_segmentation_onetags.csv")

Ces résultats permettent de conclure sur différents points: 
- les observations générales discutées pour la segmentation multilabel restent valides
- Ne prédire qu'un seul tag permet d'obtenir une meilleure segmentation
- Le prétraitement TF-IDF conduit à un excellent score sur le train test mais montre un overfitting
- le prétraitement USE est toujours très performant
- le prétaitement BERT semble prometteur lorsqu'on utilise des modèles complexes et surentrainés 

Pour conclure, le prétraitement USE permet d'obtenir les meilleurs résultats, même si ils ne sont pas parfait.  
Le prétraitement TF-IDF est semble efficace, il faudrait cependant étudier plus en détail le cas de l'overfitting.  
Dans tous les cas, le modèle SVC permet d'obtenir de très bon résultat, même si on pourrait envisager d'autres modèles peut être plus performant.  
Pour le modèle à déployer il faudrait sélectionner le prétraitement USE avec le modèle SVC, cependant l'enregistrement du modèle tensorflow dans un pipeline est complexe. De ce fait, dans un soucis de simplicité, nous utiliserons le modèle TF-IDF avec le modèle SVC.  