Dans ce notebook nous allons analyser la base de donées des post Python et R pour un entrainement de modèles supervisés.

Nous allons commencer avec une regresion logistique pour une clasification entre les posts Python et R, nous allons par la suite ajouter les autres tags disponibles et faire une classifaction One vs All 

Une fois le modèle entrainé nous allons le comparer avec un modèle RNN

# Importation des bibliothèques

In [187]:
import pandas as pd
import numpy as np


pd.set_option('display.max_colwidth', None)

# Importation des données 

In [188]:
dtypes_questions = {'Id':'int32', 'Score': 'int16', 'Title': 'str',
                    'Body': 'str', 'Title_raw': 'str', 'Text': 'str',
                    'Tags': 'str'}

nrows = 20000

df_questions = pd.read_csv('df_questions_fullclean.csv',
                           usecols=dtypes_questions.keys(),
                           encoding = "utf-8",
                           dtype=dtypes_questions,
                           nrows=nrows
                          )

print(len(df_questions))
display(df_questions.head(5))


13913


Unnamed: 0,Id,Title,Body,Tags,Score,Title_raw,Text
0,7033844,write complicate business logic,recently get cool package like express jade question consistently knock door pick nodejs build website use serverside complicate think compare javascript java python serverside code library nodejs really mean miss something call java python nodejs,java python node.js,38,Nodejs: Where or How to write complicated business logic?,write complicate business logic recently get cool package like express jade question consistently knock door pick nodejs build website use serverside complicate think compare javascript java python serverside code library nodejs really mean miss something call java python nodejs
1,13960657,python evaluate condition lazily,example follow statement foo python check condition foo,python lazy-evaluation,76,Does Python evaluate if's conditions lazily?,python evaluate condition lazily example follow statement foo python check condition foo
2,16632568,remove column numpy,nparray array delete column nphstacknpdeletearr array way consider question,python numpy,24,remove a specific column in numpy,remove column numpy nparray array delete column nphstacknpdeletearr array way consider question
3,285061,programmatically set attribute,python object string set x x myattr magic go goal incidentally cache call xgetattr,python attributes object,231,How do you programmatically set an attribute?,programmatically set attribute python object string set x x myattr magic go goal incidentally cache call xgetattr
4,16729574,get value cell dataframe,construct condition extract exactly one row data frame dfitemitem dfwnwn dfwd would like take value column val dcolname result get data frame contain one row one column ie one cell need need one value one float number panda,python pandas dataframe,490,How to get a value from a cell of a dataframe?,get value cell dataframe construct condition extract exactly one row data frame dfitemitem dfwnwn dfwd would like take value column val dcolname result get data frame contain one row one column ie one cell need need one value one float number panda


In [189]:
# création des labels python et r
text_train, tag_train = df_questions.Text, df_questions.Tags
print("type of text_train: {}".format(type(text_train)))
print("length of text_train: {}".format(len(text_train)))
print("text_train[6]:\n{}".format(text_train[6]))

type of text_train: <class 'pandas.core.series.Series'>
length of text_train: 13913
text_train[6]:
collapse run number string range let u say follow vector number vec look function create string summarize number way would run number collapsed end value r


In [190]:
type(text_train)

pandas.core.series.Series

In [191]:
type(tag_train)

pandas.core.series.Series

In [192]:
# ajouter les fichiers pour test

### Répresenter les données textuelles comme un bag-of-words

In [193]:
from sklearn.feature_extraction.text import CountVectorizer

vect_X = CountVectorizer().fit(text_train)
X_train = vect_X.transform(text_train)
print("X_train:\n{}".format(repr(X_train)))

X_train:
<13913x66997 sparse matrix of type '<class 'numpy.int64'>'
	with 469464 stored elements in Compressed Sparse Row format>


In [194]:
feature_names_X = vect_X.get_feature_names()
print("Number of features: {}".format(len(feature_names_X)))
print("First 20 features:\n{}".format(feature_names_X[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names_X[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names_X[::2000]))

Number of features: 66997
First 20 features:
['aa', 'aaa', 'aaaa', 'aaaaa', 'aaaaaa', 'aaaaaaaaa', 'aaaaaaaaaa', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa', 'aaaaaaaaaaaaaaaaaadwpwaaaaaaaabaaaaaaaaaceaaaaaaaaaqqaaaaaaaabraaaaaaaaageaaaaaaaaacqaaaaaaaacbaaaaaaaaaikaaaaaaaaakqaaaaaaaaczaaaaaaaaakeaaaaaaaaaqqaaaaaaaacxaaaaaaaaalkaaaaaaaaawqaaaaaaaadfaaaaaaaaamkaaaaaaaaazqaaaaaaaadraaaaaaaaanuaaaaaaaaaqaaaaaaaaddaaaaaaaaaoea', 'aaaab', 'aaaabbcccdddddd', 'aaaarghxxx', 'aaab', 'aaabbb', 'aaabbbcccdddeee', 'aaabbcc', 'aaabbzzyy', 'aaabcabccd', 'aaadcabdcaeafd', 'aaadcabdcaeafdaeeeaaaaabecaebeeddecacfffffdbebefcefdbccbbed']
Features 20010 to 20030:
['foobarclass', 'foobarcom', 'foobarfoo', 'foobarinfoneeded', 'foobarinfoneededhereddd', 'foobarmysettings', 'foobarnamedtuplef', 'foobarobject', 'foobarparam', 'foobarpy', 'foobarself', 'foobaseclass', 'foobaz', 'foobo', 'foobyidvalues', 'fooc', 'foochecklist', 'fooclass', 'foocleancsv', 'foocls']
Every 2000th feature:
['aa', 'applicabilit

In [195]:
vect_Y = CountVectorizer(binary=True, max_features=None, token_pattern= "(?u)\\b\\w+\\b").fit(tag_train)
# nous avons laissé l'option binary true car il est inutile d'avoir plus d'une fois le même token
# nous allons limiter le nombre de de features car le F1 Score avec tous les tags pour le modèle KNN est 0.013
# avec 10 features nous arrivons à un F1 Score de 37,8%
# On utilise un token_pattern different pour pouvoir récupérer le tag r
Y_train = vect_Y.transform(text_train).toarray()
print("Y_train:\n{}".format(repr(Y_train)))

Y_train:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])


In [196]:
feature_names_Y = vect_Y.get_feature_names()
print("Number of features: {}".format(len(feature_names_Y)))
print("First 20 features:\n{}".format(feature_names_Y[:20]))
print("Features 210 to 230:\n{}".format(feature_names_Y[800:820]))
print("Every 2000th feature:\n{}".format(feature_names_Y[::200]))

Number of features: 2966
First 20 features:
['0', '04', '1', '10', '11', '12', '14', '16', '2', '2003', '2008', '2010', '2d', '2to3', '3', '32bit', '3d', '4', '403', '404']
Features 210 to 230:
['espeak', 'ess', 'euclidean', 'eulers', 'eval', 'evaluation', 'event', 'eventlet', 'events', 'excel', 'except', 'exception', 'exceptions', 'exchange', 'exclusion', 'exe', 'exec', 'execfile', 'execl', 'executable']
Every 2000th feature:
['0', 'beamer', 'coin', 'delegates', 'espeak', 'geosphere', 'independent', 'lion', 'multilinestring', 'paste', 'pyopengl', 'rolling', 'skype', 'tar', 'uuid']


In [197]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, Y_train)

KNeighborsClassifier()

In [198]:
id_sample = 55
new_post = text_train[id_sample]
new_post_vect = vect_X.transform([new_post])
y_predict = knn_clf.predict(new_post_vect)

tags = np.argsort(y_predict[0,:])[::-1][:10].tolist()
scores = np.sort(y_predict[0,:])[::-1][:10]
print(df_questions.Title_raw[id_sample],'\n')
print(df_questions.Body[id_sample],'\n')
print(text_train[id_sample])
for tag,score in zip(tags,scores) :
    if score > 0  :
        print(feature_names_Y[tag],score)

Consistently create same random numpy array 

wait another developer finish piece code return array shape value either meantime want randomly create array characteristic get start development test thing want randomly create array time test keep change value time rerun process array like way time wonder another way nprandomrandint size 

consistently array wait another developer finish piece code return array shape value either meantime want randomly create array characteristic get start development test thing want randomly create array time test keep change value time rerun process array like way time wonder another way nprandomrandint size
value 1
array 1


### Cross validation du modèle

In [137]:
%%time
# 50 secondes pour calculer 2000 lignes

from sklearn.model_selection import cross_val_predict

y_train_knn_pred = cross_val_predict(knn_clf, X_train, Y_train, cv=3)


CPU times: user 553 ms, sys: 3.92 ms, total: 557 ms
Wall time: 556 ms


In [199]:
from sklearn.metrics import f1_score

y_train_knn_pred_ones = (y_train_knn_pred >0).astype(int)
f1_score(Y_train, y_train_knn_pred_ones, average="macro")

ValueError: Found input variables with inconsistent numbers of samples: [13913, 2000]

## Création d'une pipeline pour choisir le meilleurs parametre pour ce modèle

In [None]:
%%time
from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors' : [6, 8,32,64,128,255,500]}
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='f1_micro')

best_score = 0
best_param = {}
best_feature_number = 0

for i in [2,4,6,8,16,32]:
    vect_Y = CountVectorizer(binary=True, token_pattern= "(?u)\\b\\w+\\b", max_features=i).fit(tag_train)
    Y_train = vect_Y.transform(text_train).toarray()
    grid.fit(X_train, Y_train)
    #if grid.best_score_>best_score:
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Best cross-validation score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param)
    print("Best nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names(),'\n')


Best cross-validation score: 0.78
Best parameters:  {'n_neighbors': 255}
Best nombre de features:  2
['python', 'r'] 

Best cross-validation score: 0.75
Best parameters:  {'n_neighbors': 255}
Best nombre de features:  4
['django', 'ggplot2', 'python', 'r'] 

Best cross-validation score: 0.71
Best parameters:  {'n_neighbors': 64}
Best nombre de features:  6
['django', 'ggplot2', 'list', 'numpy', 'python', 'r'] 



## Random Forest

In [None]:
%%time
from sklearn.ensemble import RandomForestClassifier

param_grid = {'n_estimators' : [100,500,1000],
              }
grid = GridSearchCV(RandomForestClassifier(n_jobs=-1), param_grid, cv=5, scoring='f1_micro')

best_score = 0
best_param = {}
best_feature_number = 0

for i in [2,4,8,16,32]:
    vect_Y = CountVectorizer(binary=True, token_pattern= "(?u)\\b\\w+\\b", max_features=i).fit(tag_train)
    Y_train = vect_Y.transform(text_train).toarray()
    grid.fit(X_train, Y_train)
    #if grid.best_score_>best_score:
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Best cross-validation score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param)
    print("Best nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names(),'\n')

## Multilayer Perceptron classifier

In [None]:
%%time
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV

param_grid = {'alpha' : [0.001,0.01,0.1],
              'learning_rate_init' : [0.0001,0.001,0.01] }
grid = RandomizedSearchCV(MLPClassifier(max_iter=500), param_grid, cv=5, scoring='f1_micro')

best_score = 0
best_param = {}
best_feature_number = 0

for i in [2,4,8,16,32]:
    vect_Y = CountVectorizer(binary=True, token_pattern= "(?u)\\b\\w+\\b", max_features=i).fit(tag_train)
    Y_train = vect_Y.transform(text_train).toarray()
    grid.fit(X_train, Y_train)
    #if grid.best_score_>best_score:
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Best cross-validation score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param)
    print("Best nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names(),'\n')