Dans ce notebook nous allons analyser la base de donées des post Python et R pour un
entrainement de modèles supervisés.

Nous allons commencer avec un modèle KNN, puis RandomForest et un Multi Layer Perceptron
pour une classification entre les posts Python et R, nous allons utiliser plusieurs tags par post
avec la strategie One vs All

Une fois le modèle entrainé nous allons comparer leurs scores F1 et choisir le meilleur pour
l'utiliser dans notre API

# Importation des bibliothèques

In [19]:
import pandas as pd
import numpy as np
from IPython.core.display import display
import pickle # pour exporter les modèles entrainés



pd.set_option('display.max_colwidth', None)

# Importation des données 

In [20]:
dtypes_questions = {'Id':'int32', 'Score': 'int16', 'Title': 'str',
                    'Body': 'str', 'Title_raw': 'str', 'Text': 'str',
                    'Tags': 'str'}

nrows = 4000

df_questions = pd.read_csv('df_questions_fullclean.csv',
                           usecols=dtypes_questions.keys(),
                           encoding = "utf-8",
                           dtype=dtypes_questions,
                           nrows=nrows
                          )

print(len(df_questions))
display(df_questions.head(5))


4000


Unnamed: 0,Id,Title,Body,Tags,Score,Title_raw,Text
0,3604587,use string format show zero precision,try represent number lead trail width include point example want represent seem string format let one output get illustrating problem f f line exactly would expect line ignores fact want lead zero idea thanks,python string-formatting,35,How do I use string formatting to show BOTH leading zeros and precision of 3?,use string format show zero precision try represent number lead trail width include point example want represent seem string format let one output get illustrating problem f f line exactly would expect line ignores fact want lead zero idea thanks
1,1089662,python inflate deflate implementation,interfacing server require data send deflate algorithm huffman encode lz also send data need inflate know python include c library zlib support call inflate deflate apparently provide zlib module provide compress decompress make call follow resultdata zlibdecompress basedecodedcompressedstring follow error error decompress data header check gzip make resultdata gzipgzipfile fileobj basedecodedcompressedstring error ioerror gzipped file make sense data deflate file gzipped file know implementation pyflate know implementation seem option find exist implementation ideal inflate deflate python python extension zlib c include deflate call something else execute command line ruby script since inflatedeflate call zlib fully wrap ruby seek solution lack solution insight opinion idea information result deflate encode string purpose need result follow snippet c code parameter array bytes correspond data compress string deflateandencodebasebyte data null data datalength return string compressedbase memory stream wrap stream use memorystream m memorystream use deflatestream deflatestream compressionmodecompress byte buffer memorystream deflatestreamwritedata datalength memory stream base string byte compressedbytes bytemslength seekoriginbegin msreadcompressedbytes intmslength compressedbase converttobasestringcompressedbytes return run code string encode give result deflate encode run python base encode result ejxlsulssxjvujmsfizuvotlvyefafxhbk zlibcompress implementation algorithm deflate algorithm information first bytes deflate data bhy b decode correspond gzip data xfb bzip xa data zlib xc data first bytes python compress data ejxls b decode xc zlib header solve deflate inflate without checksum follow thing need deflatecompress strip first two byte four byte checksum argument window size value suppress header method currently include base encodingdecoding work properly import zlib import base def decodebaseandinflate bstring decodeddata basebdecode bstring return zlibdecompress decodeddata def deflateandbaseencode stringval zlibbedstr zlibcompress stringval compressedstring return basebencode compressedstring,c# python compression zlib,61,Python: Inflate and Deflate implementations,python inflate deflate implementation interfacing server require data send deflate algorithm huffman encode lz also send data need inflate know python include c library zlib support call inflate deflate apparently provide zlib module provide compress decompress make call follow resultdata zlibdecompress basedecodedcompressedstring follow error error decompress data header check gzip make resultdata gzipgzipfile fileobj basedecodedcompressedstring error ioerror gzipped file make sense data deflate file gzipped file know implementation pyflate know implementation seem option find exist implementation ideal inflate deflate python python extension zlib c include deflate call something else execute command line ruby script since inflatedeflate call zlib fully wrap ruby seek solution lack solution insight opinion idea information result deflate encode string purpose need result follow snippet c code parameter array bytes correspond data compress string deflateandencodebasebyte data null data datalength return string compressedbase memory stream wrap stream use memorystream m memorystream use deflatestream deflatestream compressionmodecompress byte buffer memorystream deflatestreamwritedata datalength memory stream base string byte compressedbytes bytemslength seekoriginbegin msreadcompressedbytes intmslength compressedbase converttobasestringcompressedbytes return run code string encode give result deflate encode run python base encode result ejxlsulssxjvujmsfizuvotlvyefafxhbk zlibcompress implementation algorithm deflate algorithm information first bytes deflate data bhy b decode correspond gzip data xfb bzip xa data zlib xc data first bytes python compress data ejxls b decode xc zlib header solve deflate inflate without checksum follow thing need deflatecompress strip first two byte four byte checksum argument window size value suppress header method currently include base encodingdecoding work properly import zlib import base def decodebaseandinflate bstring decodeddata basebdecode bstring return zlibdecompress decodeddata def deflateandbaseencode stringval zlibbedstr zlibcompress stringval compressedstring return basebencode compressedstring
2,1185634,solve mastermind guessing game,would create solve follow puzzle mastermind choose four colour set six blue orange purple must guess choose order guess opponent tell colour guess right colour place black right colour place white game end guess correctly black white example opponent choose orange guess yellow get one two white would get score guess orange purple algorithm would choose optionally code preferably python coded solution easily concise fast make number guess solve puzzle easily answer question algorithm case easily adapt type puzzle mastermind algorithm efficient provide poorly implement however algorithm implement inflexibly impenetrably use solution python post mean approach please post expect essay,python algorithm,38,"How to solve the ""Mastermind"" guessing game?",solve mastermind guessing game would create solve follow puzzle mastermind choose four colour set six blue orange purple must guess choose order guess opponent tell colour guess right colour place black right colour place white game end guess correctly black white example opponent choose orange guess yellow get one two white would get score guess orange purple algorithm would choose optionally code preferably python coded solution easily concise fast make number guess solve puzzle easily answer question algorithm case easily adapt type puzzle mastermind algorithm efficient provide poorly implement however algorithm implement inflexibly impenetrably use solution python post mean approach please post expect essay
3,15259547,sum pandas aggregate,recently make switch r python trouble get use data frame oppose use r problem would like take list string check value sum count string break user would like take data aid b c leave return aidgrouped overup r code use dt listaidgrouped sumdown overup sumup bylista however attempt python fail c npsumdtdtbup dtc thank advance seem like question however could find,python r pandas data.table,22,conditional sums for pandas aggregate,sum pandas aggregate recently make switch r python trouble get use data frame oppose use r problem would like take list string check value sum count string break user would like take data aid b c leave return aidgrouped overup r code use dt listaidgrouped sumdown overup sumup bylista however attempt python fail c npsumdtdtbup dtc thank advance seem like question however could find
4,2317849,use sock proxy urllib,use sock proxy download web page,python proxy urllib2 socks,48,How can I use a SOCKS 4/5 proxy with urllib2?,use sock proxy urllib use sock proxy download web page


In [21]:
# création des labels python et r
text, tag = df_questions.Text, df_questions.Tags
print("type of text_train: {}".format(type(text)))
print("length of text_train: {}".format(len(text)))
print("text_train[6]:\n{}".format(text[6]))

type of text_train: <class 'pandas.core.series.Series'>
length of text_train: 4000
text_train[6]:
match fuzzy match string two datasets work way join two datasets base string name company match two dirty list one list name information another list names address neither ids match assume clean already apply maybe insertion far agrep tool find might work use distance package measure number deletion insertion substitution two string agrep return string distance however trouble turn command value apply data frame crudely use repeat function get way see follow code cobayes asd baes bayspricec bdataframenamecace cobayes incasdfqtyc agrepanamei bname value max listdel ins ayi agrepanamei bname value max listdel in


In [22]:
type(text)

pandas.core.series.Series

In [23]:
type(tag)

pandas.core.series.Series

In [24]:
# ajouter les fichiers pour test

### Répresenter les données textuelles comme un bag-of-words

In [69]:
from sklearn.feature_extraction.text import CountVectorizer

vect_X = CountVectorizer().fit(text)
X = vect_X.transform(text)

# Les posts ont été randomisés dans le prétraitement, nous utilisons les premiers 1000 post
# comme Validation et les suivants 1000 comme notre test set le restant de notre dataset
# sera utilisé pour entrainement du modèle

X_val = X[:1000]
X_test = X[1000:2000]
X_train = X[2000:]

pickle.dump(vect_X, open('API/models/vect_X.pickle', 'wb')) # enregistre le modeèle de transformation X

print("X_train:\n{}".format(repr(X_train)))

X_train:
<2000x25593 sparse matrix of type '<class 'numpy.int64'>'
	with 66639 stored elements in Compressed Sparse Row format>


In [70]:
feature_names_X = vect_X.get_feature_names()
print("Number of features: {}".format(len(feature_names_X)))
print("First 20 features:\n{}".format(feature_names_X[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names_X[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names_X[::2000]))

Number of features: 25593
First 20 features:
['aa', 'aaa', 'aaaaa', 'aaaaaaaaaa', 'aaab', 'aaabbb', 'aaabbzzyy', 'aaardvark', 'aab', 'aababcbac', 'aabbcdefg', 'aabbzzyy', 'aabcd', 'aac', 'aadata', 'aaf', 'aafaf', 'aall', 'aamir', 'aandalucia']
Features 20010 to 20030:
['selfinitb', 'selfinitdefaultregister', 'selfinitialjobrecord', 'selfinitialoperators', 'selfinitmessage', 'selfinitsocketafinet', 'selfinittupleb', 'selfinstreamread', 'selfintersecting', 'selfinventorynames', 'selfitem', 'selfjointloglikelihoodx', 'selfkey', 'selfkids', 'selfkillreceived', 'selfknowledge', 'selfl', 'selflabelencodertransformselftrainlabels', 'selflabeltext', 'selflearnc']
Every 2000th feature:
['aa', 'boolean', 'cperform', 'docallload', 'frameroot', 'icudata', 'light', 'mydir', 'patchgetpatchtransformtransformvertices', 'read', 'selfidxnil', 'summarisenew', 'vba']


In [71]:
vect_Y = CountVectorizer(binary=True,
                         max_features=None,
                         token_pattern=rx).fit(tag)
# nous avons laissé l'option binary true car il est inutile d'avoir plus d'une fois le même token
# nous allons limiter le nombre de de features car le F1 Score avec tous les tags pour le modèle KNN est 0.013
# avec 10 features nous arrivons à un F1 Score de 37,8%
# On utilise un token_pattern different pour pouvoir récupérer le tag r
Y = vect_Y.transform(tag).toarray()

Y_val = Y[:1000]
Y_test = Y[1000:2000]
Y_train = Y[2000:]

pickle.dump(vect_Y, open('API/models/vect_Y.pickle',
                         'wb'))  # enregistre le modèle de transformation Y

print("Y_train:\n{}".format(repr(Y_train)))

Y_train:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])


In [72]:
feature_names_Y = vect_Y.get_feature_names()
print("Number of features: {}".format(len(feature_names_Y)))
print("First 20 features:\n{}".format(feature_names_Y[:20]))
print("Features 210 to 230:\n{}".format(feature_names_Y[800:820]))
print("Every 200th feature:\n{}".format(feature_names_Y[::200]))

Number of features: 1802
First 20 features:
['32bit-64bit', '3d', '64-bit', 'a-star', 'aabb', 'abc', 'abort', 'absolute-path', 'abstract', 'abstract-class', 'abstract-syntax-tree', 'accessor', 'action', 'active-directory', 'aesthetics', 'aggregate', 'aggregate-functions', 'algorithm', 'alias', 'alignment']
Features 210 to 230:
['lattice', 'layout', 'lazy-evaluation', 'lazy-loading', 'lda', 'ldap', 'leaflet', 'left-join', 'legend', 'legend-properties', 'lemmatization', 'let', 'levels', 'libcurl', 'libraries', 'libsvm', 'libvlc', 'licensing', 'limit', 'line']
Every 200th feature:
['32bit-64bit', 'collation', 'django-urls', 'github-pages', 'lattice', 'nose', 'pyparsing', 'search', 'text', 'zoo']


# Modèle KNN

## Définition de fonctions

In [73]:
def text_prediction_labels(new_post, vect_X, vect_Y, model, df_questions):
    '''
        Cette fonction fait une prediction de tags avec le "model"
        il transforme le texte en vecteur avec vect_X
        il fait la prédiction qu'il transforme avec vect_Y
        il affiche la prédiction ainsi que le score donnée par le modèle
        il affiche aussi les vrais labels
    '''
    feature_names_Y = vect_Y.get_feature_names() # liste des tags
    Y_train = vect_Y.transform(df_questions.Tags) # liste des listes des Tags par post
    new_post_vect = vect_X.transform([new_post]) # vectorisation du nouveau post pour prediction du modèle
    y_predict = model.predict(new_post_vect) # prediction du modèle entrainé

    tags = np.argsort(y_predict[0,:])[::-1][:10].tolist()
    scores = np.sort(y_predict[0,:])[::-1][:10]

    print(df_questions.Title_raw[id_sample],'\n')
    print(df_questions.Body[id_sample],'\n')
    print(df_questions.Text[id_sample])
    print('\n','Tags prediction : ', '\n')
    for tag,score in zip(tags,scores) :
        if score > 0  :
            print(feature_names_Y[tag],score)
    print('\n','Tags labels : ','\n')
    
    y_labels = Y_train[id_sample].toarray()
    tags = np.argsort(y_labels[0,:])[::-1][:10].tolist()
    scores = np.sort(y_labels[0,:])[::-1][:10]
    for tag,score in zip(tags,scores) :
        if score > 0  :
            print(feature_names_Y[tag],score)

In [56]:
from sklearn.metrics import f1_score , precision_score, recall_score

def model_score(model,X_vect,tag_text,vect_Y,names=False,seuil=0.5):
    '''
        cette fonction affiche les résultats f1, precision et recall pour le modele choisi
        une option de seuil est ajouté pour modifier le trade-off precision/recall
        nous pouvons aussi afficher ou pas les tags
    '''
    Y = vect_Y.transform(tag_text).toarray()

    y_model_pred = model.predict(X_vect)
    y_model_pred_ones = (y_model_pred >seuil).astype(int)
    
    f1_score_model        = f1_score(Y, y_model_pred_ones, average="micro")
    precision_score_model = precision_score(Y, y_model_pred_ones, average="micro")
    recall_score_model    = recall_score(Y, y_model_pred_ones, average="micro")
    
    nb_tag = len(Y[0])
    
    print("Nombre de Tags pour l'entrainement: ", nb_tag)
    if names :
        print(vect_Y.get_feature_names(),'\n')
    print("f1-score: {:.2f}".format(f1_score_model))
    print("precision_score: {:.2f}".format(precision_score_model))
    print("recall_score: {:.2f}".format(recall_score_model))
    print("Model parameters: ", model.get_params,'\n')

## Entrainement du modèle avec tous les labels

In [74]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(weights ='distance') # le plus proche voisin est privilegié 
knn_clf.fit(X_train, Y_train)

pickle.dump(knn_clf, open('API/models/knn_clf.pickle', 'wb'))

In [76]:
# exemple de prédiction val set entre 0 et 1000 test set entre 1000 
# et 2000 et train set au délà de 2000

id_sample = 1500
new_post = text[id_sample]
text_prediction_labels(new_post,vect_X,vect_Y,knn_clf,df_questions)


How to do multiple imports in Python? 

ruby instead repeat require import python word lot time lib lib libeach x require x iterate set libs require import one writing python script would like something like way need import straightforward traduction would something like follow code anyway since python import libs name string work requirement lib lib lib lib requirement import thanks advance 

import python ruby instead repeat require import python word lot time lib lib libeach x require x iterate set libs require import one writing python script would like something like way need import straightforward traduction would something like follow code anyway since python import libs name string work requirement lib lib lib lib requirement import thanks advance

 Tags prediction :  

python 1

 Tags labels :  

require 1
python 1
iterator 1
import 1


### Visualisation des prédiction vs les labels pour le trains set et test set

In [34]:
y_pred =knn_clf.predict(X_train)

In [35]:
y_pred_one = (y_pred>0.5).astype(int)
np.sort(np.sum(y_pred_one,axis=0))[-40:]

array([  14,   14,   15,   15,   15,   15,   15,   17,   17,   18,   18,
         18,   19,   19,   19,   20,   21,   22,   24,   24,   26,   27,
         27,   28,   28,   30,   31,   31,   37,   39,   42,   55,   55,
         60,   72,   84,   92,  104,  575, 1434])

In [36]:
np.sort(np.sum(Y_train,axis=0))[-40:]

array([  14,   14,   15,   15,   15,   15,   15,   17,   17,   18,   18,
         18,   19,   19,   19,   20,   21,   22,   24,   24,   26,   27,
         27,   28,   28,   30,   31,   31,   37,   39,   42,   55,   55,
         60,   72,   84,   92,  104,  575, 1434])

Les predictions sont les mêmes que les labels car nos avons choisi de privilegier la distance dans notre modèle, les predictions correspondent au plus proche voisin 

In [37]:
y_test_pred =knn_clf.predict(X_test)

In [38]:
y_test_pred_one = (y_test_pred>0.5).astype(int)
np.sort(np.sum(y_test_pred_one,axis=0))[-40:]

array([  0,   0,   0,   0,   0,   1,   1,   1,   1,   1,   1,   1,   1,
         1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   2,   2,
         2,   2,   3,   3,   3,   5,   6,   6,  10,  11,  22,  33,  90,
       919])

In [39]:
np.sort(np.sum(Y_test,axis=0))[-40:]

array([  8,   8,   8,   9,   9,   9,   9,  10,  10,  10,  10,  10,  10,
        11,  11,  11,  12,  12,  12,  12,  13,  14,  15,  16,  16,  17,
        17,  18,  19,  25,  25,  26,  29,  33,  33,  45,  46,  62, 282,
       727])

Sur le test set nous avons beaucoup moins de tags dans les predictions que dans les labels

## Metrics : f1 score, precision et recall

### Cross Validation

In [50]:
%%time
# 50 secondes pour calculer 2000 lignes
# 15min 57s pour 20 000 lignes

from sklearn.model_selection import cross_val_predict

y_train_knn_pred = cross_val_predict(knn_clf, X_train, Y_train, cv=3)


CPU times: user 1.11 s, sys: 0 ns, total: 1.11 s
Wall time: 1.11 s


In [53]:
from sklearn.metrics import f1_score,precision_score, recall_score

y_train_knn_pred_ones = (y_train_knn_pred >0.5).astype(int)
print('résultats suite à une cross validation sur le train set')
print('f1 : ',f1_score(Y_train, y_train_knn_pred_ones, average="micro"))
print('precision : ',precision_score(Y_train, y_train_knn_pred_ones, average="micro"))
print('recall : ',recall_score(Y_train, y_train_knn_pred_ones, average="micro"))


résultats suite à une cross validation sur le train set
f1 :  0.4118083003952569
precision :  0.7465293327362292
recall :  0.2843254306668941


### sur Test Set

In [58]:
model_score(knn_clf,X_test,tag[1000:2000],vect_Y,seuil=0.2) 
# le seuil n'a pas d'infuence dans le modèlde KNN

Nombre de Tags pour l'entrainement:  1802
f1-score: 0.42
precision_score: 0.75
recall_score: 0.29
Model parameters:  <bound method BaseEstimator.get_params of KNeighborsClassifier(weights='distance')> 



In [59]:
model_score(knn_clf,X_train,tag[2000:],vect_Y,seuil=0.2) 
# le modèle a overfitté le train set

Nombre de Tags pour l'entrainement:  1802
f1-score: 1.00
precision_score: 1.00
recall_score: 1.00
Model parameters:  <bound method BaseEstimator.get_params of KNeighborsClassifier(weights='distance')> 



Comme vu plus ahut nous avons des scores parfait sur le train_set, mais des résultats beaucoup moins encourageants avec le test set, nous allons tenter de augmenter la précision en diminuant le nombre de tags possibles. ceci devrait diminuer le nombre de faux positifs, donc la précision

## Création d'une pipeline pour choisir le meilleurs parametre pour ce modèle

In [79]:
%%time
# 6min 42s pour 20 000 post

from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors' : [2,4,6, 8,32,]}
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='f1_micro')

best_score = 0
best_param = {}
best_feature_number = 0

for i in [2,4,6,8,16,32]: # testet avec un nombre limité de tags
    vect_Y = CountVectorizer(binary=True, token_pattern= rx, max_features=i).fit(tag[2000:])
    Y_train = vect_Y.transform(tag[2000:]).toarray()
    grid.fit(X_train, Y_train)
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names())
    print("Best cross-validation f1-score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param,'\n')
    
    

Nombre de features:  2
['python', 'r']
Best cross-validation f1-score: 0.76
Best parameters:  {'n_neighbors': 6} 

Nombre de features:  4
['django', 'ggplot2', 'python', 'r']
Best cross-validation f1-score: 0.73
Best parameters:  {'n_neighbors': 6} 

Nombre de features:  6
['django', 'ggplot2', 'list', 'numpy', 'python', 'r']
Best cross-validation f1-score: 0.71
Best parameters:  {'n_neighbors': 6} 

Nombre de features:  8
['dataframe', 'django', 'ggplot2', 'list', 'numpy', 'python', 'r', 'string']
Best cross-validation f1-score: 0.69
Best parameters:  {'n_neighbors': 6} 

Nombre de features:  16
['data.table', 'dataframe', 'datetime', 'dictionary', 'django', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'plot', 'python', 'python-3.x', 'r', 'regex', 'string']
Best cross-validation f1-score: 0.65
Best parameters:  {'n_neighbors': 6} 

Nombre de features:  32
['arrays', 'class', 'data.table', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'file', 'flask', 'function', 'gg

## Meilleur modèle

In [65]:
KNNmodel = grid.best_estimator_

id_sample = 10
new_post = text[id_sample]
text_prediction_labels(new_post,vect_X,vect_Y,KNNmodel,df_questions)

Expanding Environment variable in string using python 

string contain environment mypath homedirdir want parse string look replace string print home osenvironhome myexpandedpath parsestringmypath print path myexpandedpath see output home homeuser path homeuserdirdir way thanks conor 

expand environment string use python string contain environment mypath homedirdir want parse string look replace string print home osenvironhome myexpandedpath parsestringmypath print path myexpandedpath see output home homeuser path homeuserdirdir way thanks conor

 Tags prediction :  

string 1
python 1

 Tags labels :  

python 1


Avec 32 Tags (et 2000 post) nous avons

- f1-score de 61%
- précision de 78% # la précision n'a pas beaucoup évolué car le spost python et r sont généralement bien prédit
- recall de 49%

Avec tous les Tags soit 1802

- f1-score: 42 %
- precision_score: 75 % 
- recall_score: 29 %

In [67]:
model_score(KNNmodel,X_test,tag[1000:2000],vect_Y,names=True,seuil=0.2) 

Nombre de Tags pour l'entrainement:  32
['arrays', 'class', 'data.table', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'file', 'flask', 'function', 'ggplot2', 'knitr', 'linux', 'list', 'matplotlib', 'numpy', 'pandas', 'performance', 'plot', 'python', 'python-2.7', 'python-3.x', 'r', 'r-faq', 'regex', 'scipy', 'sqlalchemy', 'string', 'unicode', 'unit-testing', 'windows'] 

f1-score: 0.61
precision_score: 0.78
recall_score: 0.49
Model parameters:  <bound method BaseEstimator.get_params of KNeighborsClassifier(n_neighbors=6)> 



## Random Forest

In [80]:
%%time
# 4h 32min 57s pour 20 000 posts

from sklearn.ensemble import RandomForestClassifier

param_grid = {'n_estimators' : [100,500,1000],
              'max_depth' : [2,4,8]
              }
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='f1_micro')

best_score = 0
best_param = {}
best_feature_number = 0

for i in [2,4,8,16,32]:
    vect_Y = CountVectorizer(binary=True, token_pattern= rx, max_features=i).fit(tag[2000:])
    Y_train = vect_Y.transform(tag[2000:]).toarray()
    grid.fit(X_train, Y_train)
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names())
    print("Best cross-validation f1-score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param,'\n')

Nombre de features:  2
['python', 'r']
Best cross-validation f1-score: 0.73
Best parameters:  {'max_depth': 8, 'n_estimators': 100} 

Nombre de features:  4
['django', 'ggplot2', 'python', 'r']
Best cross-validation f1-score: 0.69
Best parameters:  {'max_depth': 8, 'n_estimators': 100} 

Nombre de features:  8
['dataframe', 'django', 'ggplot2', 'list', 'numpy', 'python', 'r', 'string']
Best cross-validation f1-score: 0.65
Best parameters:  {'max_depth': 8, 'n_estimators': 100} 

Nombre de features:  16
['data.table', 'dataframe', 'datetime', 'dictionary', 'django', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'plot', 'python', 'python-3.x', 'r', 'regex', 'string']
Best cross-validation f1-score: 0.61
Best parameters:  {'max_depth': 8, 'n_estimators': 1000} 

Nombre de features:  32
['arrays', 'class', 'data.table', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'file', 'flask', 'function', 'ggplot2', 'knitr', 'linux', 'list', 'matplotlib', 'numpy', 'pandas', 'performa

In [81]:
RandomForestmodel = grid.best_estimator_

id_sample = 10
new_post = text[id_sample]
text_prediction_labels(new_post,vect_X,vect_Y,RandomForestmodel,df_questions)

Expanding Environment variable in string using python 

string contain environment mypath homedirdir want parse string look replace string print home osenvironhome myexpandedpath parsestringmypath print path myexpandedpath see output home homeuser path homeuserdirdir way thanks conor 

expand environment string use python string contain environment mypath homedirdir want parse string look replace string print home osenvironhome myexpandedpath parsestringmypath print path myexpandedpath see output home homeuser path homeuserdirdir way thanks conor

 Tags prediction :  

python 1

 Tags labels :  

python 1


In [82]:
model_score(RandomForestmodel,X_test,tag[1000:2000],vect_Y,names=True,seuil=0.2) 

Nombre de Tags pour l'entrainement:  32
['arrays', 'class', 'data.table', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'file', 'flask', 'function', 'ggplot2', 'knitr', 'linux', 'list', 'matplotlib', 'numpy', 'pandas', 'performance', 'plot', 'python', 'python-2.7', 'python-3.x', 'r', 'r-faq', 'regex', 'scipy', 'sqlalchemy', 'string', 'unicode', 'unit-testing', 'windows'] 

f1-score: 0.57
precision_score: 0.73
recall_score: 0.47
Model parameters:  <bound method BaseEstimator.get_params of RandomForestClassifier(max_depth=8)> 



Random Forest avec {'max_depth': 8, 'n_estimators': 100} et avec 32 Tags (et 2000 post) nous avons

- f1-score de 57%
- précision de 73%
- recall de 49%

Une moins bonne precision avec un recall équivalent que notre précedent
modèle KNN
- f1-score de 61%
- précision de 78%
- recall de 49%

## Multilayer Perceptron classifier

In [83]:
%%time
# 2d 7h 36min 50s pour 20 000 posts
# 3min 58s pour 2000 post et 32 tags

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV

nb_tag = 32

MLPC = MLPClassifier(hidden_layer_sizes=(100,100),
                    max_iter=500,
                    alpha=0.1, # L2 penalty (regularization term) parameter.
                    learning_rate_init=0.0001) #The initial learning rate used. It controls the step-size in updating the weights. 

vect_Y = CountVectorizer(binary=True,
                         token_pattern=rx,
                         max_features=nb_tag).fit(tag[2000:])
Y_train = vect_Y.transform(tag[2000:]).toarray()

MLPC.fit(X_train, Y_train)

CPU times: user 13min 57s, sys: 35min 55s, total: 49min 52s
Wall time: 4min 31s




MLPClassifier(alpha=0.1, hidden_layer_sizes=(100, 100),
              learning_rate_init=0.0001, max_iter=500)

In [85]:
id_sample = 10
new_post = text[id_sample]
text_prediction_labels(new_post,vect_X,vect_Y,MLPC,df_questions)

Expanding Environment variable in string using python 

string contain environment mypath homedirdir want parse string look replace string print home osenvironhome myexpandedpath parsestringmypath print path myexpandedpath see output home homeuser path homeuserdirdir way thanks conor 

expand environment string use python string contain environment mypath homedirdir want parse string look replace string print home osenvironhome myexpandedpath parsestringmypath print path myexpandedpath see output home homeuser path homeuserdirdir way thanks conor

 Tags prediction :  

string 1
python 1

 Tags labels :  

python 1


In [84]:
model_score(MLPC,X_test,tag[1000:2000],vect_Y,names=True,seuil=0.2) 

Nombre de Tags pour l'entrainement:  32
['arrays', 'class', 'data.table', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'file', 'flask', 'function', 'ggplot2', 'knitr', 'linux', 'list', 'matplotlib', 'numpy', 'pandas', 'performance', 'plot', 'python', 'python-2.7', 'python-3.x', 'r', 'r-faq', 'regex', 'scipy', 'sqlalchemy', 'string', 'unicode', 'unit-testing', 'windows'] 

f1-score: 0.77
precision_score: 0.88
recall_score: 0.69
Model parameters:  <bound method BaseEstimator.get_params of MLPClassifier(alpha=0.1, hidden_layer_sizes=(100, 100),
              learning_rate_init=0.0001, max_iter=500)> 



Multi Layer Perceptron avec deux couches denses de 100 elements avec une regularisation Ridge avec alpha = 0.1 et un learning rate initialle de 0.0001 et avec 32 Tags (et 2000 post) nous avons

- f1-score de 77%
- précision de 88%
- recall de 69%

Une meilleure performance que notre modèle KNN 
- f1-score de 61%
- précision de 78%
- recall de 49%

In [None]:
%%time
# 2d 7h 36min 50s pour 20 000 posts

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV

param_grid = {'alpha' : [0.001,0.01,0.1],
              'learning_rate_init' : [0.0001,0.001,0.01],
             'hidden_layer_sizes': [(30,),(100,),(30,30),(100,100)]}
grid = RandomizedSearchCV(MLPClassifier(max_iter=500), param_grid, cv=5, scoring='f1_micro')

best_score = 0
best_param = {}
best_feature_number = 0

for i in [2,32]:
    vect_Y = CountVectorizer(binary=True, token_pattern= rx, max_features=i).fit(tag[2000:])
    Y_train = vect_Y.transform(tag[2000:]).toarray()
    grid.fit(X_train, Y_train)
    #if grid.best_score_>best_score:
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names())
    print("Best cross-validation f1-score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param,'\n')

Résultats avec 2000 post

Nombre de features:  2
['python', 'r']
Best cross-validation f1-score: 0.91
Best parameters:  {'learning_rate_init': 0.01, 'hidden_layer_sizes': (30,), 'alpha': 0.001} 

Nombre de features:  4
['django', 'list', 'python', 'r']
Best cross-validation f1-score: 0.89
Best parameters:  {'learning_rate_init': 0.001, 'hidden_layer_sizes': (100, 100), 'alpha': 0.1} 

Nombre de features:  8
['2', 'django', 'ggplot2', 'list', 'numpy', 'python', 'r', 'string']
Best cross-validation f1-score: 0.85
Best parameters:  {'learning_rate_init': 0.01, 'hidden_layer_sizes': (100, 100), 'alpha': 0.001} 

Nombre de features:  16
['2', '3', 'data', 'dataframe', 'dictionary', 'django', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'python', 'r', 'regex', 'string', 'x']
Best cross-validation f1-score: 0.79
Best parameters:  {'learning_rate_init': 0.01, 'hidden_layer_sizes': (100, 100), 'alpha': 0.1} 

Nombre de features:  32
['2', '3', '7', 'c', 'data', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'exception', 'faq', 'file', 'ggplot2', 'import', 'list', 'markdown', 'matplotlib', 'matrix', 'numpy', 'pandas', 'plot', 'python', 'r', 'regex', 'scipy', 'sqlalchemy', 'string', 'table', 'testing', 'unit', 'x']
Best cross-validation f1-score: 0.74
Best parameters:  {'learning_rate_init': 0.001, 'hidden_layer_sizes': (100,), 'alpha': 0.1} 

CPU times: user 4h 27min 25s, sys: 11h 42min 22s, total: 16h 9min 47s
Wall time: 1h 24min 31s

In [None]:
pickle.dump(vect_Y_32, open('API/models/vect_Y_32.pickle', 'wb'))
pickle.dump(MLP_clf, open('API/models/MLP_clf.pickle', 'wb'))

In [None]:

import pickle

clf=grid.best_estimator_

pickle.dump(clf, open('models/final_prediction.pickle', 'wb'))

In [None]:
id_sample = 1002
new_post = text_train[id_sample]
text_prediction_labels(new_post,vect_X,vect_Y,MLP_clf,df_questions)

y_train_MLP_pred = MLP_clf.predict(X_train)
from sklearn.metrics import f1_score

print('F1 score : ')
y_train_MLP_pred_ones = (y_train_MLP_pred >0).astype(int)
f1_score(Y_train, y_train_MLP_pred_ones, average='weighted')

### pour 20 000 post
Cross validation CV=5
MLPClassifier(max_iter=500), param_grid, cv=5, scoring='f1_micro')

Best cross-validation score: 0.87
Best parameters:  {'learning_rate_init': 0.001, 'alpha': 0.1}
Best nombre de features:  2
['python', 'r']

Best cross-validation score: 0.88
Best parameters:  {'learning_rate_init': 0.0001, 'alpha': 0.1}
Best nombre de features:  4
['django', 'ggplot2', 'python', 'r']

Best cross-validation score: 0.92
Best parameters:  {'learning_rate_init': 0.0001, 'alpha': 0.1}
Best nombre de features:  8
['dataframe', 'django', 'ggplot2', 'list', 'numpy', 'python', 'r', 'string']

Best cross-validation score: 0.87
Best parameters:  {'learning_rate_init': 0.0001, 'alpha': 0.1}
Best nombre de features:  16
['2', '3', 'data', 'dataframe', 'dictionary', 'django', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'python', 'r', 'regex', 'string', 'x']

Best cross-validation score: 0.91
Best parameters:  {'learning_rate_init': 0.0001, 'alpha': 0.1}
Best nombre de features:  32
['2', '3', '7', 'class', 'data', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'faq', 'file', 'flask', 'function', 'ggplot2', 'import', 'list', 'matplotlib', 'models', 'numpy', 'pandas', 'performance', 'plot', 'python', 'r', 'regex', 'scipy', 'sqlalchemy', 'string', 'table', 'testing', 'x']

CPU times: user 13d 12h 11min 43s, sys: 2h 41min 2s, total: 13d 14h 52min 45s
Wall time: 2d 7h 36min 50s