Dans ce notebook nous allons analyser la base de donées des post Python et R pour un
entrainement de modèles supervisés.

Nous allons commencer avec un modèle KNN, puis RandomForest et un Multi Layer Perceptron
pour une classification entre les posts Python et R, nous allons utiliser plusieurs tags par post
avec la strategie One vs All

Une fois le modèle entrainé nous allons comparer leurs scores F1 et choisir le meilleur pour
l'utiliser dans notre API

# Importation des bibliothèques

In [2]:
import pandas as pd
import numpy as np
from IPython.core.display import display
import pickle # pour exporter les modèles entrainés



pd.set_option('display.max_colwidth', None)

# Importation des données 

In [3]:
dtypes_questions = {'Id':'int32', 'Score': 'int16', 'Title': 'str',
                    'Body': 'str', 'Title_raw': 'str', 'Text': 'str',
                    'Tags': 'str'}

nrows = 20000

df_questions = pd.read_csv('df_questions_fullclean.csv',
                           usecols=dtypes_questions.keys(),
                           encoding = "utf-8",
                           dtype=dtypes_questions,
                           nrows=nrows
                          )

print(len(df_questions))
display(df_questions.head(5))


13907


Unnamed: 0,Id,Title,Body,Tags,Score,Title_raw,Text
0,5649407,string array python,long string represent series value type convert hex string array shift value convert proper data type,python bytearray,182,hexadecimal string to byte array in python,string array python long string represent series value type convert hex string array shift value convert proper data type
1,7974849,make one python file run another,make one python file run another example two py file want one file run run py file,python,355,How can I make one python file run another?,make one python file run another make one python file run another example two py file want one file run run py file
2,29554796,mean band width ggplot geomsmooth lm,follow code libraryggplot ggplotmtcars aesxwt geompointaescolourfactorcyl geomsmoothmethodlm get question grey zone define mean play around parameter control width band,r ggplot2,32,Meaning of band width in ggplot geom_smooth lm,mean band width ggplot geomsmooth lm follow code libraryggplot ggplotmtcars aesxwt geompointaescolourfactorcyl geomsmoothmethodlm get question grey zone define mean play around parameter control width band
3,250151,lua generalpurpose script language,see thing ever read embed often anything world warcraft wow limit embed thing another application write script like python perl lua seem aspect like speed memoryusage script language afaik never see lua use scriptinglanguage automate task example rename file download file web webscraping lack library,python scripting lua,36,Lua as a general-purpose scripting language?,lua generalpurpose script language see thing ever read embed often anything world warcraft wow limit embed thing another application write script like python perl lua seem aspect like speed memoryusage script language afaik never see lua use scriptinglanguage automate task example rename file download file web webscraping lack library
4,1342000,make python interpreter correctly character string operation,string look like way string understand python simply say string call get sreplace course complain character xc file blablapy encode never quite could understand switch encoding code really file save notepad follow header cod code f soup beautifulsoupf soupfinddiv idmaincount make print go well show sreplace savemaincounts get sreplace,python unicode,104,How to make the python interpreter correctly handle non-ASCII characters in string operations?,make python interpreter correctly character string operation string look like way string understand python simply say string call get sreplace course complain character xc file blablapy encode never quite could understand switch encoding code really file save notepad follow header cod code f soup beautifulsoupf soupfinddiv idmaincount make print go well show sreplace savemaincounts get sreplace


In [4]:
# création des labels python et r
text, tag = df_questions.Text, df_questions.Tags
print("type of text_train: {}".format(type(text)))
print("length of text_train: {}".format(len(text)))
print("text_train[6]:\n{}".format(text[6]))

type of text_train: <class 'pandas.core.series.Series'>
length of text_train: 13907
text_train[6]:
python count element object match attribute try find way count number object match criterion eg class person def initself age gender selfname name selfage age selfgender gender list people peoplelist f personhenry personmarg f function count number object match argument base attribute return persongender f personage


In [5]:
type(text)

pandas.core.series.Series

In [6]:
type(tag)

pandas.core.series.Series

In [7]:
# ajouter les fichiers pour test

### Répresenter les données textuelles comme un bag-of-words

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

vect_X = CountVectorizer().fit(text)
X = vect_X.transform(text)

# Les posts ont été randomisés dans le prétraitement, nous utilisons les premiers 1000 post
# comme Validation et les suivants 1000 comme notre test set le restant de notre dataset
# sera utilisé pour entrainement du modèle

X_val = X[:1000]
X_test = X[1000:2000]
X_train = X[2000:]

pickle.dump(vect_X, open('API/models/vect_X.pickle', 'wb')) 
# enregistre le modeèle de transformation X

print("X_train:\n{}".format(repr(X_train)))

X_train:
<11907x66996 sparse matrix of type '<class 'numpy.int64'>'
	with 399250 stored elements in Compressed Sparse Row format>


In [9]:
feature_names_X = vect_X.get_feature_names()
print("Number of features: {}".format(len(feature_names_X)))
print("First 20 features:\n{}".format(feature_names_X[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names_X[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names_X[::2000]))

Number of features: 66996
First 20 features:
['aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaaaa', 'aaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaadwpwaaaaaaaabaaaaaaaaaceaaaaaaaaaqqaaaaaaaabraaaaaaaaageaaaaaaaaacqaaaaaaaacbaaaaaaaaaikaaaaaaaaakqaaaaaaaaczaaaaaaaaakeaaaaaaaaaqqaaaaaaaacxaaaaaaaaalkaaaaaaaaawqaaaaaaaadfaaaaaaaaamkaaaaaaaaazqaaaaaaaadraaaaaaaaanuaaaaaaaaaqaaaaaaaaddaaaaaaaaaoea', 'aaaab', 'aaaabbcccdddddd', 'aaaarghxxx', 'aaab', 'aaabbb', 'aaabbbcccdddeee', 'aaabbcc', 'aaabbzzyy', 'aaabcabccd', 'aaadcabdcaeafd', 'aaadcabdcaeafdaeeeaaaaabecaebeeddecacfffffdbebefcefdbccbbed']
Features 20010 to 20030:
['foobarcom', 'foobarfoo', 'foobarinfoneeded', 'foobarinfoneededhereddd', 'foobarmysettings', 'foobarnamedtuplef', 'foobarobject', 'foobarparam', 'foobarpy', 'foobarself', 'foobaseclass', 'foobaz', 'foobo', 'foobyidvalues', 'fooc', 'foochecklist', 'fooclass', 'foocleancsv', 'foocls', 'fooconf']
Every 2000th feature:
['aa', 'applicability', 

In [10]:
rx = "['(-.)\w]+"
vect_Y = CountVectorizer(binary=True,
                         max_features=None,
                         token_pattern=rx).fit(tag)
# nous avons laissé l'option binary true car il est inutile d'avoir plus d'une fois le même token
# nous allons limiter le nombre de de features car le F1 Score avec tous les tags pour le modèle KNN est 0.013
# avec 10 features nous arrivons à un F1 Score de 37,8%
# On utilise un token_pattern different pour pouvoir récupérer le tag r
Y = vect_Y.transform(tag).toarray()

Y_val = Y[:1000]
Y_test = Y[1000:2000]
Y_train = Y[2000:]

pickle.dump(vect_Y, open('API/models/vect_Y.pickle',
                         'wb'))  # enregistre le modèle de transformation Y

print("Y_train:\n{}".format(repr(Y_train)))

Y_train:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])


In [11]:
feature_names_Y = vect_Y.get_feature_names()
print("Number of features: {}".format(len(feature_names_Y)))
print("First 20 features:\n{}".format(feature_names_Y[:20]))
print("Features 800 to 820:\n{}".format(feature_names_Y[800:820]))
print("Every 200th feature:\n{}".format(feature_names_Y[::200]))

Number of features: 3362
First 20 features:
['.htaccess', '.net', '2-digit-year', '2d', '32bit-64bit', '3d', '64-bit', '7zip', 'a-star', 'aabb', 'abc', 'abort', 'absolute-path', 'absolute-value', 'abstract', 'abstract-class', 'abstract-data-type', 'abstract-methods', 'abstract-syntax-tree', 'access-modifiers']
Features 800 to 820:
['dpapi', 'dpi', 'dplyr', 'drag-and-drop', 'draggable', 'draw', 'drawing', 'dreamhost', 'drop-down-menu', 'dry', 'dsl', 'dst', 'dt', 'dtd', 'dto', 'dtype', 'duck-typing', 'dulwich', 'dummy-variable', 'dump']
Every 200th feature:
['.htaccess', 'between', 'code-documentation', 'datatable', 'dpapi', 'file-upload', 'gpl', 'integer-division', 'list', 'multicore', 'package-development', 'put', 'readlines', 'scrollbar', 'statet', 'time-measurement', 'version-control']


# Modèle KNN

## Définition de fonctions

In [12]:
def text_prediction_labels(new_post, vect_X, vect_Y, model, df_questions):
    '''
        Cette fonction fait une prediction de tags avec le "model"
        il transforme le texte en vecteur avec vect_X
        il fait la prédiction qu'il transforme avec vect_Y
        il affiche la prédiction ainsi que le score donnée par le modèle
        il affiche aussi les vrais labels
    '''
    feature_names_Y = vect_Y.get_feature_names() # liste des tags
    Y_train = vect_Y.transform(df_questions.Tags) # liste des listes des Tags par post
    new_post_vect = vect_X.transform([new_post]) # vectorisation du nouveau post pour prediction du modèle
    y_predict = model.predict(new_post_vect) # prediction du modèle entrainé

    tags = np.argsort(y_predict[0,:])[::-1][:10].tolist()
    scores = np.sort(y_predict[0,:])[::-1][:10]

    print(df_questions.Title_raw[id_sample],'\n')
    print(df_questions.Body[id_sample],'\n')
    print(df_questions.Text[id_sample])
    print('\n','Tags prediction : ', '\n')
    for tag,score in zip(tags,scores) :
        if score > 0  :
            print(feature_names_Y[tag],score)
    print('\n','Tags labels : ','\n')
    
    y_labels = Y_train[id_sample].toarray()
    tags = np.argsort(y_labels[0,:])[::-1][:10].tolist()
    scores = np.sort(y_labels[0,:])[::-1][:10]
    for tag,score in zip(tags,scores) :
        if score > 0  :
            print(feature_names_Y[tag],score)

In [13]:
from sklearn.metrics import f1_score , precision_score, recall_score

def model_score(model,X_vect,tag_text,vect_Y,names=False,seuil=0.5):
    '''
        cette fonction affiche les résultats f1, precision et recall pour le modele choisi
        une option de seuil est ajouté pour modifier le trade-off precision/recall
        nous pouvons aussi afficher ou pas les tags
    '''
    Y = vect_Y.transform(tag_text).toarray()

    y_model_pred = model.predict(X_vect)
    y_model_pred_ones = (y_model_pred >seuil).astype(int)
    
    f1_score_model        = f1_score(Y, y_model_pred_ones, average="micro")
    precision_score_model = precision_score(Y, y_model_pred_ones, average="micro")
    recall_score_model    = recall_score(Y, y_model_pred_ones, average="micro")
    
    nb_tag = len(Y[0])
    
    print("Nombre de Tags pour l'entrainement: ", nb_tag)
    if names :
        print(vect_Y.get_feature_names(),'\n')
    print("f1-score: {:.2f}".format(f1_score_model))
    print("precision_score: {:.2f}".format(precision_score_model))
    print("recall_score: {:.2f}".format(recall_score_model))
    print("Model parameters: ", model.get_params,'\n')

## Entrainement du modèle avec tous les labels

In [14]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(weights ='distance') # le plus proche voisin est privilegié 
knn_clf.fit(X_train, Y_train)

pickle.dump(knn_clf, open('API/models/knn_clf.pickle', 'wb'))

In [15]:
# exemple de prédiction val set entre 0 et 1000 test set entre 1000 
# et 2000 et train set au délà de 2000

id_sample = 1500
new_post = text[id_sample]
text_prediction_labels(new_post,vect_X,vect_Y,knn_clf,df_questions)


print() vs sys.stdout.write(): which and why? 

script create output sysstderrwrite recently post question aspect script people post answer seem prefer text filesysstdout concern output simply pip either file input another program someone please explain existing explanation difference print case use convention 

print v sysstdoutwrite script create output sysstderrwrite recently post question aspect script people post answer seem prefer text filesysstdout concern output simply pip either file input another program someone please explain existing explanation difference print case use convention

 Tags prediction :  

python 1

 Tags labels :  

python 1
python-3.x 1


### Visualisation des prédiction vs les labels pour le trains set et test set

In [16]:
y_pred =knn_clf.predict(X_train)

In [17]:
y_pred_one = (y_pred>0.5).astype(int)
np.sort(np.sum(y_pred_one,axis=0))[-40:]

array([  86,   87,   87,   87,   92,   92,   93,   97,  101,  104,  105,
        107,  111,  111,  116,  118,  119,  120,  124,  131,  134,  150,
        158,  163,  168,  172,  175,  206,  218,  248,  258,  309,  334,
        352,  449,  452,  552,  705, 3380, 8618])

In [18]:
np.sort(np.sum(Y_train,axis=0))[-40:]

array([  86,   87,   87,   87,   92,   92,   93,   97,  101,  104,  105,
        107,  111,  111,  116,  118,  119,  120,  124,  131,  134,  150,
        158,  163,  168,  172,  175,  206,  218,  248,  258,  309,  334,
        352,  449,  452,  552,  705, 3380, 8618])

Les predictions sont les mêmes que les labels car nos avons choisi de privilegier la distance dans notre modèle, les predictions correspondent au plus proche voisin 

In [19]:
y_test_pred =knn_clf.predict(X_test)

In [20]:
y_test_pred_one = (y_test_pred>0.5).astype(int)
np.sort(np.sum(y_test_pred_one,axis=0))[-40:]

array([  1,   2,   2,   2,   2,   2,   2,   2,   2,   2,   2,   2,   2,
         2,   2,   2,   2,   3,   3,   3,   3,   3,   3,   4,   4,   4,
         4,   4,   5,   5,  10,  11,  13,  14,  16,  19,  19,  37, 143,
       861])

In [21]:
np.sort(np.sum(Y_test,axis=0))[-40:]

array([  8,   8,   8,   8,   8,   9,   9,   9,   9,   9,   9,  10,  10,
        10,  10,  10,  10,  11,  13,  13,  14,  14,  15,  16,  18,  19,
        19,  19,  20,  21,  21,  27,  32,  33,  38,  38,  41,  76, 280,
       724])

Sur le test set nous avons beaucoup moins de tags dans les predictions que dans les labels

## Metrics : f1 score, precision et recall

### Cross Validation

In [22]:
%%time
# 50 secondes pour calculer 2000 lignes
# 15min 57s pour 20 000 lignes

from sklearn.model_selection import cross_val_predict

y_train_knn_pred = cross_val_predict(knn_clf, X_train, Y_train, cv=3)


CPU times: user 10.6 s, sys: 680 ms, total: 11.2 s
Wall time: 11.2 s


In [23]:
from sklearn.metrics import f1_score,precision_score, recall_score

y_train_knn_pred_ones = (y_train_knn_pred >0.5).astype(int)
print('résultats suite à une cross validation sur le train set')
print('f1 : ',f1_score(Y_train, y_train_knn_pred_ones, average="micro"))
print('precision : ',precision_score(Y_train, y_train_knn_pred_ones, average="micro"))
print('recall : ',recall_score(Y_train, y_train_knn_pred_ones, average="micro"))


résultats suite à une cross validation sur le train set
f1 :  0.45198652111624527
precision :  0.7622677465459743
recall :  0.3212298514311937


### sur Test Set

In [24]:
model_score(knn_clf,X_test,tag[1000:2000],vect_Y,seuil=0.2) 
# le seuil n'a pas d'infuence dans le modèlde KNN

Nombre de Tags pour l'entrainement:  3362
f1-score: 0.45
precision_score: 0.78
recall_score: 0.32
Model parameters:  <bound method BaseEstimator.get_params of KNeighborsClassifier(weights='distance')> 



In [25]:
model_score(knn_clf,X_train,tag[2000:],vect_Y,seuil=0.2) 
# le modèle a overfitté le train set

Nombre de Tags pour l'entrainement:  3362
f1-score: 1.00
precision_score: 1.00
recall_score: 1.00
Model parameters:  <bound method BaseEstimator.get_params of KNeighborsClassifier(weights='distance')> 



Comme vu plus ahut nous avons des scores parfait sur le train_set, mais des résultats beaucoup moins encourageants avec le test set, nous allons tenter de augmenter la précision en diminuant le nombre de tags possibles. ceci devrait diminuer le nombre de faux positifs, donc la précision

## Création d'une pipeline pour choisir le meilleurs parametre pour ce modèle

In [26]:
%%time
# 6min 42s pour 20 000 post

from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors' : [2,4,6, 8,32,]}
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='f1_micro')

best_score = 0
best_param = {}
best_feature_number = 0

for i in [2,4,6,8,16,32]: # testet avec un nombre limité de tags
    vect_Y = CountVectorizer(binary=True, token_pattern= rx, max_features=i).fit(tag[2000:])
    Y_train = vect_Y.transform(tag[2000:]).toarray()
    grid.fit(X_train, Y_train)
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names())
    print("Best cross-validation f1-score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param,'\n')
    
    

Nombre de features:  2
['python', 'r']
Best cross-validation f1-score: 0.80
Best parameters:  {'n_neighbors': 6} 

Nombre de features:  4
['django', 'ggplot2', 'python', 'r']
Best cross-validation f1-score: 0.77
Best parameters:  {'n_neighbors': 6} 

Nombre de features:  6
['django', 'ggplot2', 'list', 'numpy', 'python', 'r']
Best cross-validation f1-score: 0.75
Best parameters:  {'n_neighbors': 8} 

Nombre de features:  8
['dataframe', 'django', 'ggplot2', 'list', 'numpy', 'python', 'r', 'string']
Best cross-validation f1-score: 0.73
Best parameters:  {'n_neighbors': 8} 

Nombre de features:  16
['dataframe', 'dictionary', 'django', 'dplyr', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'plot', 'python', 'python-2.7', 'python-3.x', 'r', 'regex', 'string']
Best cross-validation f1-score: 0.69
Best parameters:  {'n_neighbors': 6} 

Nombre de features:  32
['arrays', 'class', 'data.table', 'dataframe', 'datetime', 'dictionary', 'django', 'django-models', 'dplyr', 'file', 'flask', '

## Meilleur modèle

In [27]:
KNNmodel = grid.best_estimator_

id_sample = 10
new_post = text[id_sample]
text_prediction_labels(new_post,vect_X,vect_Y,KNNmodel,df_questions)

Is it possible to change an instance's method implementation without changing all other instances of the same class? 

know python never use seem find anything maybe google question go change instance implementation method google find could change implementation instance class example def showyimpself print class foo def initself selfx selfy def showxself print def showyself print class bar def initself fooshowy selffoo foo def showself selffooshowx name b bar bshow foo fshowx fshowy work expect output follow x want x woohoo try change bar init def initself selffoo foo selffooshowy showyimp get follow error message showyimp take exactly give yeah tried use setattr seem like selffooshowy showyimp clue 

change instance method implementation without change instance class know python never use seem find anything maybe google question go change instance implementation method google find could change implementation instance class example def showyimpself print class foo def initself selfx s

Avec 32 Tags (et 2000 post) nous avons

- f1-score de 61%
- précision de 78% # la précision n'a pas beaucoup évolué car le spost python et r sont généralement bien prédit
- recall de 49%

Avec tous les Tags soit 1802

- f1-score: 42 %
- precision_score: 75 % 
- recall_score: 29 %

In [28]:
model_score(KNNmodel,X_test,tag[1000:2000],vect_Y,names=True,seuil=0.2) 

Nombre de Tags pour l'entrainement:  32
['arrays', 'class', 'data.table', 'dataframe', 'datetime', 'dictionary', 'django', 'django-models', 'dplyr', 'file', 'flask', 'function', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'performance', 'plot', 'python', 'python-2.7', 'python-3.x', 'r', 'r-faq', 'regex', 'scipy', 'sorting', 'sqlalchemy', 'string', 'unicode', 'unit-testing', 'windows'] 

f1-score: 0.65
precision_score: 0.82
recall_score: 0.55
Model parameters:  <bound method BaseEstimator.get_params of KNeighborsClassifier(n_neighbors=8)> 



## Random Forest

In [29]:
%%time
# 4h 32min 57s pour 20 000 posts

from sklearn.ensemble import RandomForestClassifier

param_grid = {'n_estimators' : [100,500,1000],
              'max_depth' : [2,4,8]
              }
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='f1_micro')

best_score = 0
best_param = {}
best_feature_number = 0

for i in [2,32]:
    vect_Y = CountVectorizer(binary=True, token_pattern= rx, max_features=i).fit(tag[2000:])
    Y_train = vect_Y.transform(tag[2000:]).toarray()
    grid.fit(X_train, Y_train)
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names())
    print("Best cross-validation f1-score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param,'\n')

Nombre de features:  2
['python', 'r']
Best cross-validation f1-score: 0.72
Best parameters:  {'max_depth': 8, 'n_estimators': 100} 

Nombre de features:  32
['arrays', 'class', 'data.table', 'dataframe', 'datetime', 'dictionary', 'django', 'django-models', 'dplyr', 'file', 'flask', 'function', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'performance', 'plot', 'python', 'python-2.7', 'python-3.x', 'r', 'r-faq', 'regex', 'scipy', 'sorting', 'sqlalchemy', 'string', 'unicode', 'unit-testing', 'windows']
Best cross-validation f1-score: 0.57
Best parameters:  {'max_depth': 8, 'n_estimators': 100} 

CPU times: user 13min 27s, sys: 427 ms, total: 13min 27s
Wall time: 13min 27s


In [30]:
RandomForestmodel = grid.best_estimator_

id_sample = 10
new_post = text[id_sample]
text_prediction_labels(new_post,vect_X,vect_Y,RandomForestmodel,df_questions)

Is it possible to change an instance's method implementation without changing all other instances of the same class? 

know python never use seem find anything maybe google question go change instance implementation method google find could change implementation instance class example def showyimpself print class foo def initself selfx selfy def showxself print def showyself print class bar def initself fooshowy selffoo foo def showself selffooshowx name b bar bshow foo fshowx fshowy work expect output follow x want x woohoo try change bar init def initself selffoo foo selffooshowy showyimp get follow error message showyimp take exactly give yeah tried use setattr seem like selffooshowy showyimp clue 

change instance method implementation without change instance class know python never use seem find anything maybe google question go change instance implementation method google find could change implementation instance class example def showyimpself print class foo def initself selfx s

In [31]:
model_score(RandomForestmodel,X_test,tag[1000:2000],vect_Y,names=True,seuil=0.2) 

Nombre de Tags pour l'entrainement:  32
['arrays', 'class', 'data.table', 'dataframe', 'datetime', 'dictionary', 'django', 'django-models', 'dplyr', 'file', 'flask', 'function', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'performance', 'plot', 'python', 'python-2.7', 'python-3.x', 'r', 'r-faq', 'regex', 'scipy', 'sorting', 'sqlalchemy', 'string', 'unicode', 'unit-testing', 'windows'] 

f1-score: 0.56
precision_score: 0.73
recall_score: 0.46
Model parameters:  <bound method BaseEstimator.get_params of RandomForestClassifier(max_depth=8)> 



Random Forest avec {'max_depth': 8, 'n_estimators': 100} et avec 32 Tags (et 2000 post) nous avons

- f1-score de 57%
- précision de 73%
- recall de 49%

Une moins bonne precision avec un recall équivalent que notre précedent
modèle KNN
- f1-score de 61%
- précision de 78%
- recall de 49%

## Multilayer Perceptron classifier

In [32]:
%%time
# 2d 7h 36min 50s pour 20 000 posts
# 3min 58s pour 2000 post et 32 tags

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV

nb_tag = 32

MLPC = MLPClassifier(hidden_layer_sizes=(100,100),
                    max_iter=500,
                    alpha=0.1, # L2 penalty (regularization term) parameter.
                    learning_rate_init=0.0001) #The initial learning rate used. It controls the step-size in updating the weights. 

vect_Y = CountVectorizer(binary=True,
                         token_pattern=rx,
                         max_features=nb_tag).fit(tag[2000:])
Y_train = vect_Y.transform(tag[2000:]).toarray()

MLPC.fit(X_train, Y_train)

CPU times: user 2h 40min 29s, sys: 6h 46min 3s, total: 9h 26min 32s
Wall time: 1h 38min 47s


MLPClassifier(alpha=0.1, hidden_layer_sizes=(100, 100),
              learning_rate_init=0.0001, max_iter=500)

In [33]:
id_sample = 10
new_post = text[id_sample]
text_prediction_labels(new_post,vect_X,vect_Y,MLPC,df_questions)

Is it possible to change an instance's method implementation without changing all other instances of the same class? 

know python never use seem find anything maybe google question go change instance implementation method google find could change implementation instance class example def showyimpself print class foo def initself selfx selfy def showxself print def showyself print class bar def initself fooshowy selffoo foo def showself selffooshowx name b bar bshow foo fshowx fshowy work expect output follow x want x woohoo try change bar init def initself selffoo foo selffooshowy showyimp get follow error message showyimp take exactly give yeah tried use setattr seem like selffooshowy showyimp clue 

change instance method implementation without change instance class know python never use seem find anything maybe google question go change instance implementation method google find could change implementation instance class example def showyimpself print class foo def initself selfx s

In [34]:
model_score(MLPC,X_test,tag[1000:2000],vect_Y,names=True,seuil=0.2) 

Nombre de Tags pour l'entrainement:  32
['arrays', 'class', 'data.table', 'dataframe', 'datetime', 'dictionary', 'django', 'django-models', 'dplyr', 'file', 'flask', 'function', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'performance', 'plot', 'python', 'python-2.7', 'python-3.x', 'r', 'r-faq', 'regex', 'scipy', 'sorting', 'sqlalchemy', 'string', 'unicode', 'unit-testing', 'windows'] 

f1-score: 0.83
precision_score: 0.87
recall_score: 0.79
Model parameters:  <bound method BaseEstimator.get_params of MLPClassifier(alpha=0.1, hidden_layer_sizes=(100, 100),
              learning_rate_init=0.0001, max_iter=500)> 



Multi Layer Perceptron avec deux couches denses de 100 elements avec une regularisation Ridge avec alpha = 0.1 et un learning rate initialle de 0.0001 et avec 32 Tags (et 2000 post) nous avons

- f1-score de 77%
- précision de 88%
- recall de 69%

Une meilleure performance que notre modèle KNN 
- f1-score de 61%
- précision de 78%
- recall de 49%

In [35]:
%%time
# 2d 7h 36min 50s pour 20 000 posts

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV

param_grid = {'alpha' : [0.001,0.01,0.1],
              'learning_rate_init' : [0.0001,0.001,0.01],
             'hidden_layer_sizes': [(30,),(100,),(30,30),(100,100)]}
grid = RandomizedSearchCV(MLPClassifier(max_iter=500), param_grid, cv=5, scoring='f1_micro')

best_score = 0
best_param = {}
best_feature_number = 0

for i in [32]:
    vect_Y = CountVectorizer(binary=True, token_pattern= rx, max_features=i).fit(tag[2000:])
    Y_train = vect_Y.transform(tag[2000:]).toarray()
    grid.fit(X_train, Y_train)
    #if grid.best_score_>best_score:
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names())
    print("Best cross-validation f1-score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param,'\n')



Nombre de features:  32
['arrays', 'class', 'data.table', 'dataframe', 'datetime', 'dictionary', 'django', 'django-models', 'dplyr', 'file', 'flask', 'function', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'performance', 'plot', 'python', 'python-2.7', 'python-3.x', 'r', 'r-faq', 'regex', 'scipy', 'sorting', 'sqlalchemy', 'string', 'unicode', 'unit-testing', 'windows']
Best cross-validation f1-score: 0.82
Best parameters:  {'learning_rate_init': 0.001, 'hidden_layer_sizes': (100,), 'alpha': 0.1} 

CPU times: user 23h 28min 59s, sys: 2d 16h 9min 19s, total: 3d 15h 38min 19s
Wall time: 9h 27min 21s


Résultats avec 2000 post

Nombre de features:  32
['class', 'csv', 'data.table', 'dataframe', 'date', 'datetime', 'dictionary', 'django', 'dplyr', 'flask', 'ggplot2', 'json', 'list', 'matplotlib', 'numpy', 'pandas', 'performance', 'pip', 'plot', 'python', 'python-2.7', 'python-3.x', 'r', 'r-faq', 'regex', 'scipy', 'sorting', 'sqlalchemy', 'statistics', 'string', 'unicode', 'windows']
Best cross-validation f1-score: 0.77
Best parameters:  {'learning_rate_init': 0.01, 'hidden_layer_sizes': (100,), 'alpha': 0.1} 

CPU times: user 3h 19min 28s, sys: 8h 42min 14s, total: 12h 1min 42s
Wall time: 1h 3min 24s

In [36]:
import pickle

clf=grid.best_estimator_

pickle.dump(clf, open('API/models/final_prediction.pickle', 'wb'))
pickle.dump(vect_Y, open('API/models/vect_Y_32.pickle', 'wb'))

In [37]:
id_sample = 1002
new_post = text[id_sample]
text_prediction_labels(new_post,vect_X,vect_Y,clf,df_questions)

y_train_MLP_pred = clf.predict(X_train)
from sklearn.metrics import f1_score

print('F1 score : ')
y_train_MLP_pred_ones = (y_train_MLP_pred >0).astype(int)
f1_score(Y_train, y_train_MLP_pred_ones, average='weighted')

Unable to import a module that is definitely installed 

instal mechanize seem import try instal pip easyinstall via python install repo httpsgithubcomabielrmechanize time enter python get python default aug gcc linux type help copyright credit information import mechanize traceback call file stdin line module importerror module name installation run previously report complete successfully expect import work could cause error 

import module definitely instal instal mechanize seem import try instal pip easyinstall via python install repo httpsgithubcomabielrmechanize time enter python get python default aug gcc linux type help copyright credit information import mechanize traceback call file stdin line module importerror module name installation run previously report complete successfully expect import work could cause error

 Tags prediction :  

python 1

 Tags labels :  

python 1
F1 score : 


0.9987488463115602

In [38]:
model_score(clf,X_test,tag[1000:2000],vect_Y,names=True,seuil=0.2) 

Nombre de Tags pour l'entrainement:  32
['arrays', 'class', 'data.table', 'dataframe', 'datetime', 'dictionary', 'django', 'django-models', 'dplyr', 'file', 'flask', 'function', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'performance', 'plot', 'python', 'python-2.7', 'python-3.x', 'r', 'r-faq', 'regex', 'scipy', 'sorting', 'sqlalchemy', 'string', 'unicode', 'unit-testing', 'windows'] 

f1-score: 0.83
precision_score: 0.88
recall_score: 0.79
Model parameters:  <bound method BaseEstimator.get_params of MLPClassifier(alpha=0.1, max_iter=500)> 

