Dans ce notebook nous allons analyser la base de donées des post Python et R pour un
entrainement de modèles supervisés.

Nous allons commencer avec un modèle KNN, puis RandomForest et un Multi Layer Perceptron
pour une classification entre les posts Python et R, nous allons utiliser plusieurs tags par post
avec la strategie One vs All

Une fois le modèle entrainé nous allons comparer leurs scores F1 et choisir le meilleur pour
l'utiliser dans notre API

# Importation des bibliothèques

In [1]:
import pandas as pd
import numpy as np
from IPython.core.display import display
import pickle # pour exporter les modèles entrainés



pd.set_option('display.max_colwidth', None)

# Importation des données 

In [2]:
dtypes_questions = {'Id':'int32', 'Score': 'int16', 'Title': 'str',
                    'Body': 'str', 'Title_raw': 'str', 'Text': 'str',
                    'Tags': 'str'}

nrows = 4000

df_questions = pd.read_csv('df_questions_fullclean.csv',
                           usecols=dtypes_questions.keys(),
                           encoding = "utf-8",
                           dtype=dtypes_questions,
                           nrows=nrows
                          )

print(len(df_questions))
display(df_questions.head(5))


4000


Unnamed: 0,Id,Title,Body,Tags,Score,Title_raw,Text
0,3366506,pvalue aov,look pvalue generate r run test aovasq asq yield df sum sq f value prf asq residual signif code observation delete look structure see usually work list get need time one google search also seem reveal simpler structure get note asq data frame list coefficient name num name chr asq residual name num name effect name num name chr intercept asq int fittedvalues name num name chr int list qr num dimnameslist chr chr intercept assign int qraux num pivot int tol num e rank int attr class chr qr int naaction class omit name int attr name chr xlevels list language aovformula asq term class term formula length asq asq attr variable language listasq factor int dimnameslist chr asq asq chr asq attr termlabels chr order int attr intercept int attr response int attr environmentenvironment rglobalenv attr predvars language listasq attr dataclasses name attr name chr model dataframe variable asq int asq int attr termsclasses term length asq asq attr variable language listasq factor int dimnameslist chr asq asq chr asq attr termlabels chr order int attr intercept int attr response int attr environmentenvironment rglobalenv attr predvars language listasq attr dataclasses name attr name chr asq attr naactionclass omit name int attr name chr class chr aov lm,r anova,65,Extract p-value from aov,pvalue aov look pvalue generate r run test aovasq asq yield df sum sq f value prf asq residual signif code observation delete look structure see usually work list get need time one google search also seem reveal simpler structure get note asq data frame list coefficient name num name chr asq residual name num name effect name num name chr intercept asq int fittedvalues name num name chr int list qr num dimnameslist chr chr intercept assign int qraux num pivot int tol num e rank int attr class chr qr int naaction class omit name int attr name chr xlevels list language aovformula asq term class term formula length asq asq attr variable language listasq factor int dimnameslist chr asq asq chr asq attr termlabels chr order int attr intercept int attr response int attr environmentenvironment rglobalenv attr predvars language listasq attr dataclasses name attr name chr model dataframe variable asq int asq int attr termsclasses term length asq asq attr variable language listasq factor int dimnameslist chr asq asq chr asq attr termlabels chr order int attr intercept int attr response int attr environmentenvironment rglobalenv attr predvars language listasq attr dataclasses name attr name chr asq attr naactionclass omit name int attr name chr class chr aov lm
1,3995546,define constant class self really need,define set constant class like class fooobject nonexistingvagueconfirmed initself selfstatus vague however get name vague define way define constant class without resort selfnonexisting etc,python constants visibility,29,"Defining constants in python class, is self really needed?",define constant class self really need define set constant class like class fooobject nonexistingvagueconfirmed initself selfstatus vague however get name vague define way define constant class without resort selfnonexisting etc
2,195534,production apache modwsgi nginx modwsgi,use medium python wsgi application apache modwsgi modwsgi combination need memory cpu time one faster know also think use cherrypys wsgi server hear highload application know note use python web framework write thing scratch note suggestion also welcome,python apache nginx mod-wsgi,68,"In production, Apache + mod_wsgi or Nginx + mod_wsgi?",production apache modwsgi nginx modwsgi use medium python wsgi application apache modwsgi modwsgi combination need memory cpu time one faster know also think use cherrypys wsgi server hear highload application know note use python web framework write thing scratch note suggestion also welcome
3,14237018,method consistency warn building r package roxygen,create roxygen file function use class roxygenize build check get warn check consistency warn functionwordlist overlap equalor see section function method write r extension spend time study httpcranrprojectorgdocmanualsrextshtmlgenericfunctionsandmethods figure do file function work expect warn occur make go away find word group find word group variable eg people param list name chacter vector param minimumexact amount overlap param equalor character vector codegreater codemore codeless dot liu user may input number character vector rdname return return dataframe word match criterion set codeoverlap codeequalor export example ca cat dog b ccorn chicken chouse feed chicken commona b overlap commona b overlap lista b commonr commonr functionwordlist return codenull rdname method list list functionwordlist overlap equalor ifoverlapall lengthwordlist else overlap lis df asdataframetableunlistlis stringsasfactors namesdf cword freq df dforderdffreq dfword df switchequalor dfdffreq dfdffreq dfdffreq rownamesdf nrowdf returndf return codenull rdname method default default commondefault function overlap equalor list returncommonlistlis overlap equalor,r r-s3,32,S3 method consistency warning when building R package with Roxygen,method consistency warn building r package roxygen create roxygen file function use class roxygenize build check get warn check consistency warn functionwordlist overlap equalor see section function method write r extension spend time study httpcranrprojectorgdocmanualsrextshtmlgenericfunctionsandmethods figure do file function work expect warn occur make go away find word group find word group variable eg people param list name chacter vector param minimumexact amount overlap param equalor character vector codegreater codemore codeless dot liu user may input number character vector rdname return return dataframe word match criterion set codeoverlap codeequalor export example ca cat dog b ccorn chicken chouse feed chicken commona b overlap commona b overlap lista b commonr commonr functionwordlist return codenull rdname method list list functionwordlist overlap equalor ifoverlapall lengthwordlist else overlap lis df asdataframetableunlistlis stringsasfactors namesdf cword freq df dforderdffreq dfword df switchequalor dfdffreq dfdffreq dfdffreq rownamesdf nrowdf returndf return codenull rdname method default default commondefault function overlap equalor list returncommonlistlis overlap equalor
4,11748384,format date x axis ggplot,time get look correct graphs data generate via dput structure label class factor avgvisits name cmonthavgvisits rownames cna class dataframe chart try graph ggplotdf month avgvisits geombar month visit per chart work want adjust format date believe add scalexdatelabels dateformatmy try make date label mmmyyyy month avgvisits geombar month visit per user scalexdatelabels dateformatmy plot continue get statbin binwidth default range use binwidth x adjust despite hour research format geombar fix anyone explain edit followup think use date factor use asdate date column,r ggplot2,61,Formatting dates on X axis in ggplot2,format date x axis ggplot time get look correct graphs data generate via dput structure label class factor avgvisits name cmonthavgvisits rownames cna class dataframe chart try graph ggplotdf month avgvisits geombar month visit per chart work want adjust format date believe add scalexdatelabels dateformatmy try make date label mmmyyyy month avgvisits geombar month visit per user scalexdatelabels dateformatmy plot continue get statbin binwidth default range use binwidth x adjust despite hour research format geombar fix anyone explain edit followup think use date factor use asdate date column


In [3]:
# création des labels python et r
text, tag = df_questions.Text, df_questions.Tags
print("type of text_train: {}".format(type(text)))
print("length of text_train: {}".format(len(text)))
print("text_train[6]:\n{}".format(text[6]))

type of text_train: <class 'pandas.core.series.Series'>
length of text_train: 4000
text_train[6]:
generate permutation repetition know itertools seem permutation without repetition example would like generate dice roll dice need permutation size include repetition etc want scratch


In [4]:
type(text)

pandas.core.series.Series

In [5]:
type(tag)

pandas.core.series.Series

In [6]:
# ajouter les fichiers pour test

### Répresenter les données textuelles comme un bag-of-words

In [7]:
from sklearn.feature_extraction.text import CountVectorizer

vect_X = CountVectorizer().fit(text)
X = vect_X.transform(text)

# Les posts ont été randomisés dans le prétraitement, nous utilisons les premiers 1000 post
# comme Validation et les suivants 1000 comme notre test set le restant de notre dataset
# sera utilisé pour entrainement du modèle

X_val = X[:1000]
X_test = X[1000:2000]
X_train = X[2000:]

pickle.dump(vect_X, open('API/models/vect_X.pickle', 'wb')) # enregistre le modeèle de transformation X

print("X_train:\n{}".format(repr(X_train)))

X_train:
<2000x26268 sparse matrix of type '<class 'numpy.int64'>'
	with 67490 stored elements in Compressed Sparse Row format>


In [8]:
feature_names_X = vect_X.get_feature_names()
print("Number of features: {}".format(len(feature_names_X)))
print("First 20 features:\n{}".format(feature_names_X[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names_X[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names_X[::2000]))

Number of features: 26268
First 20 features:
['aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaaaa', 'aaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaadwpwaaaaaaaabaaaaaaaaaceaaaaaaaaaqqaaaaaaaabraaaaaaaaageaaaaaaaaacqaaaaaaaacbaaaaaaaaaikaaaaaaaaakqaaaaaaaaczaaaaaaaaakeaaaaaaaaaqqaaaaaaaacxaaaaaaaaalkaaaaaaaaawqaaaaaaaadfaaaaaaaaamkaaaaaaaaazqaaaaaaaadraaaaaaaaanuaaaaaaaaaqaaaaaaaaddaaaaaaaaaoea', 'aaabbb', 'aaabcabccd', 'aaah', 'aab', 'aabbccccdd', 'aabsiddfdfdatatg', 'aac', 'aacute', 'aadjacencylist', 'aaf', 'aafaf']
Features 20010 to 20030:
['scaleyreverseexpandc', 'scan', 'scanfile', 'scaninfo', 'scanmyfileseptlisturlpopularitymintimemaxtime', 'scanstringmatchstring', 'scant', 'scatter', 'scatterplot', 'scatterplotdx', 'scatterplots', 'scatterxyczsdxmarkers', 'scd', 'scelerycamerr', 'scelerycamnohup', 'scenario', 'scene', 'sched', 'schedule', 'scheduler']
Every 2000th feature:
['aa', 'bone', 'cpnaturallanguagemyfiletxt', 'dogma', 'four', 'httpmyser

In [9]:
vect_Y = CountVectorizer(binary=True,
                         max_features=None,
                         token_pattern="(?u)\\b\\w+\\b").fit(tag)
# nous avons laissé l'option binary true car il est inutile d'avoir plus d'une fois le même token
# nous allons limiter le nombre de de features car le F1 Score avec tous les tags pour le modèle KNN est 0.013
# avec 10 features nous arrivons à un F1 Score de 37,8%
# On utilise un token_pattern different pour pouvoir récupérer le tag r
Y = vect_Y.transform(tag).toarray()

Y_val = Y[:1000]
Y_test = Y[1000:2000]
Y_train = Y[2000:]

pickle.dump(vect_Y, open('API/models/vect_Y.pickle',
                         'wb'))  # enregistre le modèle de transformation Y

print("Y_train:\n{}".format(repr(Y_train)))

Y_train:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])


In [10]:
feature_names_Y = vect_Y.get_feature_names()
print("Number of features: {}".format(len(feature_names_Y)))
print("First 20 features:\n{}".format(feature_names_Y[:20]))
print("Features 210 to 230:\n{}".format(feature_names_Y[800:820]))
print("Every 200th feature:\n{}".format(feature_names_Y[::200]))

Number of features: 1733
First 20 features:
['04', '11', '12', '14', '16', '2', '2003', '2008', '3', '32bit', '3d', '4', '403', '5', '6', '64', '64bit', '7', '8', 'abc']
Features 210 to 230:
['legend', 'lemmatization', 'length', 'leopard', 'less', 'levenshtein', 'lexical', 'libcurl', 'libjpeg', 'libraries', 'library', 'libusb', 'libxml2', 'licensing', 'like', 'limits', 'line', 'linear', 'linguistics', 'linker']
Every 200th feature:
['04', 'click', 'docstring', 'git', 'legend', 'octave', 'qt4', 'single', 'typed']


In [11]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, Y_train)

pickle.dump(knn_clf, open('API/models/knn_clf.pickle', 'wb'))

In [12]:
def text_prediction_labels(new_post, vect_X, vect_Y, model, df_questions):
    feature_names_Y = vect_Y.get_feature_names() # liste des tags
    Y_train = vect_Y.transform(df_questions.Tags) # liste des listes des Tags par post
    new_post_vect = vect_X.transform([new_post]) # vectorisation du nouveau post pour prediction du modèle
    y_predict = model.predict(new_post_vect) # prediction du modèle entrainé

    tags = np.argsort(y_predict[0,:])[::-1][:10].tolist()
    scores = np.sort(y_predict[0,:])[::-1][:10]

    print(df_questions.Title_raw[id_sample],'\n')
    print(df_questions.Body[id_sample],'\n')
    print(df_questions.Text[id_sample])
    print('\n','Tags prediction : ', '\n')
    for tag,score in zip(tags,scores) :
        if score > 0  :
            print(feature_names_Y[tag],score)
    print('\n','Tags labels : ','\n')
    y_labels = Y_train[id_sample].toarray()
    tags = np.argsort(y_labels[0,:])[::-1][:10].tolist()
    scores = np.sort(y_labels[0,:])[::-1][:10]
    for tag,score in zip(tags,scores) :
        if score > 0  :
            print(feature_names_Y[tag],score)

In [13]:
id_sample = 35
new_post = text[id_sample]
text_prediction_labels(new_post,vect_X,vect_Y,knn_clf,df_questions)


Rolling window for 1D arrays in Numpy? 

way efficiently roll window array example pure python code snippet calculate roll deviation list observation list value n standard deviation stdev data enumerateobservationsn observationsiin sumstrip stripn way completely within ie without python loop deviation numpystd roll part completely stump find blog post regard roll window numpy seem array 

roll window array numpy way efficiently roll window array example pure python code snippet calculate roll deviation list observation list value n standard deviation stdev data enumerateobservationsn observationsiin sumstrip stripn way completely within ie without python loop deviation numpystd roll part completely stump find blog post regard roll window numpy seem array

 Tags prediction :  

python 1

 Tags labels :  

python 1
3 1
numpy 1
window 1
x 1


In [14]:
%%time
# 50 secondes pour calculer 2000 lignes
# 15min 57s pour 20 000 lignes

from sklearn.model_selection import cross_val_predict

y_train_knn_pred = cross_val_predict(knn_clf, X_train, Y_train, cv=3)


CPU times: user 1min 21s, sys: 79.1 ms, total: 1min 21s
Wall time: 1min 21s


In [24]:
from sklearn.metrics import f1_score

y_train_knn_pred_ones = (y_train_knn_pred >0.5).astype(int)
f1_score(Y_train, y_train_knn_pred_ones, average="micro")


In [26]:
y_train_knn_pred_ones.shape

(2000, 1733)

## Création d'une pipeline pour choisir le meilleurs parametre pour ce modèle

In [16]:
%%time
# 6min 42s pour 20 000 post

from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors' : [6, 8,32,64,128,256,512]}
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='f1_micro')

best_score = 0
best_param = {}
best_feature_number = 0

for i in [2,4,6,8,16,32]:
    vect_Y = CountVectorizer(binary=True, token_pattern= "(?u)\\b\\w+\\b", max_features=i).fit(tag[2000:])
    Y_train = vect_Y.transform(tag[2000:]).toarray()
    grid.fit(X_train, Y_train)
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names())
    print("Best cross-validation f1-score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param,'\n')
    
    

['python', 'r'] 

Best cross-validation score: 0.78
Best parameters:  {'n_neighbors': 6}
Best nombre de features:  2
['django', 'list', 'python', 'r'] 

Best cross-validation score: 0.75
Best parameters:  {'n_neighbors': 6}
Best nombre de features:  4
['django', 'ggplot2', 'list', 'numpy', 'python', 'r'] 

Best cross-validation score: 0.72
Best parameters:  {'n_neighbors': 6}
Best nombre de features:  6
['2', 'django', 'ggplot2', 'list', 'numpy', 'python', 'r', 'string'] 

Best cross-validation score: 0.71
Best parameters:  {'n_neighbors': 6}
Best nombre de features:  8
['2', '3', 'data', 'dataframe', 'dictionary', 'django', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'python', 'r', 'regex', 'string', 'x'] 

Best cross-validation score: 0.65
Best parameters:  {'n_neighbors': 6}
Best nombre de features:  16
['2', '3', '7', 'c', 'data', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'exception', 'faq', 'file', 'ggplot2', 'import', 'list', 'markdown', 'matplotlib', 'mat

In [17]:
Y_train.shape

(2000, 32)

In [20]:
KNNmodel = grid.best_estimator_

id_sample = 105
new_post = text[id_sample]
text_prediction_labels(new_post,vect_X,vect_Y,KNNmodel,df_questions)

What you can do with a data.frame that you can't with a data.table? 

start use r come across find question ignore dataframe use confusion two package 

dataframe start use r come across find question ignore dataframe use confusion two package

 Tags prediction :  

python 1

 Tags labels :  

r 1
table 1
data 1
dataframe 1


### pour 20 000 post
Cross validation CV=5
KNeighborsClassifier(), param_grid, cv=5, scoring='f1_micro')

Best cross-validation score: 0.78
Best parameters:  {'n_neighbors': 255}
Best nombre de features:  2
['python', 'r']

Best cross-validation score: 0.75
Best parameters:  {'n_neighbors': 255}
Best nombre de features:  4
['django', 'ggplot2', 'python', 'r']

Best cross-validation score: 0.71
Best parameters:  {'n_neighbors': 64}
Best nombre de features:  6
['django', 'ggplot2', 'list', 'numpy', 'python', 'r']

Best cross-validation score: 0.66
Best parameters:  {'n_neighbors': 32}
Best nombre de features:  8
['dataframe', 'django', 'ggplot2', 'list', 'numpy', 'python', 'r', 'string']

Best cross-validation score: 0.57
Best parameters:  {'n_neighbors': 32}
Best nombre de features:  16
['2', '3', 'data', 'dataframe', 'dictionary', 'django', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'python', 'r', 'regex', 'string', 'x']

Best cross-validation score: 0.50
Best parameters:  {'n_neighbors': 8}
Best nombre de features:  32
['2', '3', '7', 'class', 'data', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'faq', 'file', 'flask', 'function', 'ggplot2', 'import', 'list', 'matplotlib', 'models', 'numpy', 'pandas', 'performance', 'plot', 'python', 'r', 'regex', 'scipy', 'sqlalchemy', 'string', 'table', 'testing', 'x']

CPU times: user 6min 17s, sys: 28.4 s, total: 6min 45s
Wall time: 6min 45s

## Random Forest

In [21]:
%%time
# 4h 32min 57s pour 20 000 posts

from sklearn.ensemble import RandomForestClassifier

param_grid = {'n_estimators' : [100,500,1000],
              'max_depth' : [2,4,8]
              }
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='f1_micro')

best_score = 0
best_param = {}
best_feature_number = 0

for i in [2,4,8,16,32]:
    vect_Y = CountVectorizer(binary=True, token_pattern= "(?u)\\b\\w+\\b", max_features=i).fit(tag[2000:])
    Y_train = vect_Y.transform(tag[2000:]).toarray()
    grid.fit(X_train, Y_train)
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names())
    print("Best cross-validation f1-score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param,'\n')

Nombre de features:  2
['python', 'r']
Best cross-validation f1-score: 0.74
Best parameters:  {'max_depth': 8, 'n_estimators': 100} 

Nombre de features:  4
['django', 'list', 'python', 'r']
Best cross-validation f1-score: 0.70
Best parameters:  {'max_depth': 8, 'n_estimators': 100} 

Nombre de features:  8
['2', 'django', 'ggplot2', 'list', 'numpy', 'python', 'r', 'string']
Best cross-validation f1-score: 0.66
Best parameters:  {'max_depth': 8, 'n_estimators': 100} 

Nombre de features:  16
['2', '3', 'data', 'dataframe', 'dictionary', 'django', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'python', 'r', 'regex', 'string', 'x']
Best cross-validation f1-score: 0.61
Best parameters:  {'max_depth': 8, 'n_estimators': 500} 

Nombre de features:  32
['2', '3', '7', 'c', 'data', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'exception', 'faq', 'file', 'ggplot2', 'import', 'list', 'markdown', 'matplotlib', 'matrix', 'numpy', 'pandas', 'plot', 'python', 'r', 'regex', 'scipy

In [None]:
RandomForestmodel = grid.best_estimator_

id_sample = 10
new_post = text[id_sample]
text_prediction_labels(new_post,vect_X,vect_Y,RandomForestmodel,df_questions)

### pour 20 000 post
Cross validation CV=5
RandomForestClassifier(), param_grid, cv=5, scoring='f1_micro')

Best cross-validation score: 0.83
Best parameters:  {'n_estimators': 1000}
Best nombre de features:  2
['python', 'r']

Best cross-validation score: 0.82
Best parameters:  {'n_estimators': 500}
Best nombre de features:  4
['django', 'ggplot2', 'python', 'r']

Best cross-validation score: 0.75
Best parameters:  {'n_estimators': 1000}
Best nombre de features:  8
['dataframe', 'django', 'ggplot2', 'list', 'numpy', 'python', 'r', 'string']

Best cross-validation score: 0.67
Best parameters:  {'n_estimators': 100}
Best nombre de features:  16
['2', '3', 'data', 'dataframe', 'dictionary', 'django', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'python', 'r', 'regex', 'string', 'x']

Best cross-validation score: 0.58
Best parameters:  {'n_estimators': 100}
Best nombre de features:  32
['2', '3', '7', 'class', 'data', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'faq', 'file', 'flask', 'function', 'ggplot2', 'import', 'list', 'matplotlib', 'models', 'numpy', 'pandas', 'performance', 'plot', 'python', 'r', 'regex', 'scipy', 'sqlalchemy', 'string', 'table', 'testing', 'x']

CPU times: user 4h 32min 40s, sys: 16 s, total: 4h 32min 56s
Wall time: 4h 32min 57s

## Multilayer Perceptron classifier

In [34]:
%%time
# 2d 7h 36min 50s pour 20 000 posts
# 3min 58s pour 2000 post et 32 tags

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV

nb_tag = 32

MLPC = MLPClassifier(hidden_layer_sizes=(100,100),
                    max_iter=500,
                    alpha=0.1, # L2 penalty (regularization term) parameter.
                    learning_rate_init=0.0001) #The initial learning rate used. It controls the step-size in updating the weights. 

vect_Y = CountVectorizer(binary=True,
                         token_pattern="(?u)\\b\\w+\\b",
                         max_features=nb_tag).fit(tag[2000:])
Y_train = vect_Y.transform(tag[2000:]).toarray()

MLPC.fit(X_train, Y_train)

CPU times: user 12min 47s, sys: 33min 32s, total: 46min 19s
Wall time: 3min 58s




MLPClassifier(alpha=0.1, hidden_layer_sizes=(100, 100),
              learning_rate_init=0.0001, max_iter=500)

In [37]:
best_param = MLPC.get_params

Y_val = vect_Y.transform(tag[:1000]).toarray()
y_val_MLPC_pred=MLPC.predict(X_val)

from sklearn.metrics import f1_score , precision_score, recall_score

y_val_MLPC_pred_ones = (y_val_MLPC_pred >0.5).astype(int)
f1_score        = f1_score(Y_val, y_val_MLPC_pred_ones, average="micro")
precision_score = precision_score(Y_val, y_val_MLPC_pred_ones, average="micro")
recall_score    = recall_score(Y_val, y_val_MLPC_pred_ones, average="micro")

print("Nombre de Tags pour l'entrainement: ", nb_tag)
print(vect_Y.get_feature_names(),'\n')
print("f1-score: {:.2f}".format(f1_score))
print("precision_score: {:.2f}".format(precision_score))
print("recall_score: {:.2f}".format(recall_score))
print("Model parameters: ", best_param,'\n')

Nombre de Tags pour l'entrainement:  32
['2', '3', '7', 'c', 'data', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'exception', 'faq', 'file', 'ggplot2', 'import', 'list', 'markdown', 'matplotlib', 'matrix', 'numpy', 'pandas', 'plot', 'python', 'r', 'regex', 'scipy', 'sqlalchemy', 'string', 'table', 'testing', 'unit', 'x'] 

f1-score: 0.76
precision_score: 0.86
recall_score: 0.68
Model parameters:  <bound method BaseEstimator.get_params of MLPClassifier(alpha=0.1, hidden_layer_sizes=(100, 100),
              learning_rate_init=0.0001, max_iter=500)> 



In [None]:
%%time
# 2d 7h 36min 50s pour 20 000 posts

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV

param_grid = {'alpha' : [0.001,0.01,0.1],
              'learning_rate_init' : [0.0001,0.001,0.01],
             'hidden_layer_sizes': [(30,),(100,),(30,30),(100,100)]}
grid = RandomizedSearchCV(MLPClassifier(max_iter=100), param_grid, cv=5, scoring='f1_micro')

best_score = 0
best_param = {}
best_feature_number = 0

for i in [2,4,8,16,32]:
    vect_Y = CountVectorizer(binary=True, token_pattern= "(?u)\\b\\w+\\b", max_features=i).fit(tag[2000:])
    Y_train = vect_Y.transform(tag[2000:]).toarray()
    grid.fit(X_train, Y_train)
    #if grid.best_score_>best_score:
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names())
    print("Best cross-validation f1-score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param,'\n')



Nombre de features:  2
['python', 'r']
Best cross-validation f1-score: 0.91
Best parameters:  {'learning_rate_init': 0.01, 'hidden_layer_sizes': (30,), 'alpha': 0.001} 







Nombre de features:  4
['django', 'list', 'python', 'r']
Best cross-validation f1-score: 0.89
Best parameters:  {'learning_rate_init': 0.001, 'hidden_layer_sizes': (100, 100), 'alpha': 0.1} 





Nombre de features:  8
['2', 'django', 'ggplot2', 'list', 'numpy', 'python', 'r', 'string']
Best cross-validation f1-score: 0.85
Best parameters:  {'learning_rate_init': 0.01, 'hidden_layer_sizes': (100, 100), 'alpha': 0.001} 





Nombre de features:  16
['2', '3', 'data', 'dataframe', 'dictionary', 'django', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'python', 'r', 'regex', 'string', 'x']
Best cross-validation f1-score: 0.79
Best parameters:  {'learning_rate_init': 0.01, 'hidden_layer_sizes': (100, 100), 'alpha': 0.1} 







Résultats avec 2000 post

Nombre de features:  2
['python', 'r']
Best cross-validation f1-score: 0.91
Best parameters:  {'learning_rate_init': 0.01, 'hidden_layer_sizes': (30,), 'alpha': 0.001} 

Nombre de features:  4
['django', 'list', 'python', 'r']
Best cross-validation f1-score: 0.89
Best parameters:  {'learning_rate_init': 0.001, 'hidden_layer_sizes': (100, 100), 'alpha': 0.1} 

Nombre de features:  8
['2', 'django', 'ggplot2', 'list', 'numpy', 'python', 'r', 'string']
Best cross-validation f1-score: 0.85
Best parameters:  {'learning_rate_init': 0.01, 'hidden_layer_sizes': (100, 100), 'alpha': 0.001} 

Nombre de features:  16
['2', '3', 'data', 'dataframe', 'dictionary', 'django', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'python', 'r', 'regex', 'string', 'x']
Best cross-validation f1-score: 0.79
Best parameters:  {'learning_rate_init': 0.01, 'hidden_layer_sizes': (100, 100), 'alpha': 0.1} 

In [None]:
%%time
from sklearn.neural_network import MLPClassifier
MLP_clf = MLPClassifier(max_iter=1500, alpha=0.1, learning_rate_init=0.0001)
# la convergence du modèle n'est pas encore faite à 1000 itérations
vect_Y_32 = CountVectorizer(binary=True, token_pattern= "(?u)\\b\\w+\\b", max_features=32).fit(tag_train)
Y_train = vect_Y_32.transform(tag_train).toarray()
MLP_clf.fit(X_train, Y_train)

In [None]:
pickle.dump(vect_Y_32, open('API/models/vect_Y_32.pickle', 'wb'))
pickle.dump(MLP_clf, open('API/models/MLP_clf.pickle', 'wb'))

In [None]:
id_sample = 1002
new_post = text_train[id_sample]
text_prediction_labels(new_post,vect_X,vect_Y,MLP_clf,df_questions)

y_train_MLP_pred = MLP_clf.predict(X_train)
from sklearn.metrics import f1_score

print('F1 score : ')
y_train_MLP_pred_ones = (y_train_MLP_pred >0).astype(int)
f1_score(Y_train, y_train_MLP_pred_ones, average='weighted')

### pour 20 000 post
Cross validation CV=5
MLPClassifier(max_iter=500), param_grid, cv=5, scoring='f1_micro')

Best cross-validation score: 0.87
Best parameters:  {'learning_rate_init': 0.001, 'alpha': 0.1}
Best nombre de features:  2
['python', 'r']

Best cross-validation score: 0.88
Best parameters:  {'learning_rate_init': 0.0001, 'alpha': 0.1}
Best nombre de features:  4
['django', 'ggplot2', 'python', 'r']

Best cross-validation score: 0.92
Best parameters:  {'learning_rate_init': 0.0001, 'alpha': 0.1}
Best nombre de features:  8
['dataframe', 'django', 'ggplot2', 'list', 'numpy', 'python', 'r', 'string']

Best cross-validation score: 0.87
Best parameters:  {'learning_rate_init': 0.0001, 'alpha': 0.1}
Best nombre de features:  16
['2', '3', 'data', 'dataframe', 'dictionary', 'django', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'python', 'r', 'regex', 'string', 'x']

Best cross-validation score: 0.91
Best parameters:  {'learning_rate_init': 0.0001, 'alpha': 0.1}
Best nombre de features:  32
['2', '3', '7', 'class', 'data', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'faq', 'file', 'flask', 'function', 'ggplot2', 'import', 'list', 'matplotlib', 'models', 'numpy', 'pandas', 'performance', 'plot', 'python', 'r', 'regex', 'scipy', 'sqlalchemy', 'string', 'table', 'testing', 'x']

CPU times: user 13d 12h 11min 43s, sys: 2h 41min 2s, total: 13d 14h 52min 45s
Wall time: 2d 7h 36min 50s

In [None]:

import pickle

clf=grid.best_estimator_

pickle.dump(clf, open('models/final_prediction.pickle', 'wb'))

In [None]:
id_sample = 100
new_post = text_train[id_sample]
new_post_vect = vect_X.transform([new_post])
y_predict = clf.predict(new_post_vect)

tags = np.argsort(y_predict[0,:])[::-1][:10].tolist()
scores = np.sort(y_predict[0,:])[::-1][:10]
print(df_questions.Title_raw[id_sample],'\n')
print(df_questions.Body[id_sample],'\n')
print(text_train[id_sample])
for tag,score in zip(tags,scores) :
    if score > 0  :
        print(feature_names_Y[tag],score)

In [None]:
y_labels = Y_train[id_sample]
tags = np.argsort(y_labels)[::-1][:10].tolist()
scores = np.sort(y_labels)[::-1][:10]
for tag,score in zip(tags,scores) :
    if score > 0  :
        print(feature_names_Y[tag],score)

In [None]:
scores