Dans ce notebook nous allons analyser la base de donées des post Python et R pour un entrainement de modèles supervisés.

Nous allons commencer avec une regresion logistique pour une clasification entre les posts Python et R, nous allons par la suite ajouter les autres tags disponibles et faire une classifaction One vs All 

Une fois le modèle entrainé nous allons le comparer avec un modèle RNN

# Importation des bibliothèques

In [80]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
import re
from wordcloud import WordCloud , STOPWORDS
import matplotlib.pyplot as plt
from transformers import (GPT2Config,GPT2LMHeadModel,GPT2Tokenizer)
from tqdm.notebook import tqdm
import torch

pd.set_option('display.max_colwidth', None)

# Importation des données 

In [81]:
dtypes_questions = {'Id':'int32', 'Score': 'int16', 'Title': 'str',
                    'Body': 'str', 'Title_raw': 'str', 'Text': 'str',
                    'Tags': 'str'}

nrows = 2000

df_questions = pd.read_csv('df_questions_fullclean.csv',
                           usecols=dtypes_questions.keys(),
                           encoding = "utf-8",
                           dtype=dtypes_questions,
                           nrows=nrows
                          )

print(len(df_questions))
display(df_questions.head(5))


2000


Unnamed: 0,Id,Title,Body,Tags,Score,Title_raw,Text
0,6590630,way subset v mysql,r frequently find function require subsetting datasets million row apply function number observation get time consume implement sometimes used package provide much speed subsetting use data frame recently start experiment package like rmysql push table mysql use package run query return result find performance improvement datasets million seem load data set key make subsetting datasets million appear send query move wonder anyone insight technique return subsetting aggregation query whether size data understand set key somewhat create index intuition beyond,mysql r rmysql data.table,20,Fastest way to subset - data.table vs. MySQL,way subset v mysql r frequently find function require subsetting datasets million row apply function number observation get time consume implement sometimes used package provide much speed subsetting use data frame recently start experiment package like rmysql push table mysql use package run query return result find performance improvement datasets million seem load data set key make subsetting datasets million appear send query move wonder anyone insight technique return subsetting aggregation query whether size data understand set key somewhat create index intuition beyond
1,9620155,csv file contain apostrophe r,difficulty get r read csv file contain apostrophe columns contain text attend customer need sheriff deputy file open correctly data appear cell row miss data ask r read file happens data readtabledatafilecsv sep error scanfile sep dec quote skip nlines nastrings line element line first line contains apostrophe go csv file manually remove apostrophe read file correctly however would rather keep apostrophe r would grateful help,r csv punctuation,36,How to read a .csv file containing apostrophes into R?,csv file contain apostrophe r difficulty get r read csv file contain apostrophe columns contain text attend customer need sheriff deputy file open correctly data appear cell row miss data ask r read file happens data readtabledatafilecsv sep error scanfile sep dec quote skip nlines nastrings line element line first line contains apostrophe go csv file manually remove apostrophe read file correctly however would rather keep apostrophe r would grateful help
2,204017,program python ossystem fail space path,script need program reason fail follow script import o ossystemctempa b cnotepadexe rawinput fail follow error ctempa recognize command program batch file escape program quote import ossystemctempa b cnotepadexe rawinput work however parameter stop work import o ossystemctempa b cnotepadexe ctesttxt rawinput way program wait need read output program job exit need wait also note move program nonspaced path option either work either os ossystemctempa b cnotepadexe rawinput note swap quote without parameter notepad fail message name volume label syntax incorrect,python shellexecute,289,How do I execute a program from Python? os.system fails due to spaces in path,program python ossystem fail space path script need program reason fail follow script import o ossystemctempa b cnotepadexe rawinput fail follow error ctempa recognize command program batch file escape program quote import ossystemctempa b cnotepadexe rawinput work however parameter stop work import o ossystemctempa b cnotepadexe ctesttxt rawinput way program wait need read output program job exit need wait also note move program nonspaced path option either work either os ossystemctempa b cnotepadexe rawinput note swap quote without parameter notepad fail message name volume label syntax incorrect
3,6424856,r function return factor,search foo fail try find function return factor integer package factorize function gmp confdesign however function return factor would like function return factor obviously search make since r construct call factor put lot noise search,r factorization,36,R Function for returning ALL factors,r function return factor search foo fail try find function return factor integer package factorize function gmp confdesign however function return factor would like function return factor obviously search make since r construct call factor put lot noise search
4,117800,get autofields start number,django app would like get number seem way idea,python django autofield,22,How to get Django AutoFields to start at a higher number,get autofields start number django app would like get number seem way idea


In [82]:
# création des labels python et r
text_train, tag_train = df_questions.Text, df_questions.Tags
print("type of text_train: {}".format(type(text_train)))
print("length of text_train: {}".format(len(text_train)))
print("text_train[6]:\n{}".format(text_train[6]))

type of text_train: <class 'pandas.core.series.Series'>
length of text_train: 2000
text_train[6]:
function r function write quite package share throw mine destring factor string ischaracterx else asnumericlevelsxx else x else convert pad functionxmxnullfill pad var string specify size lx mxcalc maxlxnarmtrue isnullmx mxmxcalc stopnumber maxchar else mxcalc px mxlx pastesapplypxfunctionx pasterepfillxcollapsexsep eval functionevaltextenvirsysframe evaluate string r code envirenvir spacetabs mareks version trimfunctions gsubspacespaces


In [83]:
type(text_train)

pandas.core.series.Series

In [84]:
type(tag_train)

pandas.core.series.Series

In [85]:
# ajouter les fichiers pour test

### Répresenter les données textuelles comme un bag-of-words

In [164]:
from sklearn.feature_extraction.text import CountVectorizer

vect_X = CountVectorizer().fit(text_train)
X_train = vect_X.transform(text_train)
print("X_train:\n{}".format(repr(X_train)))

X_train:
<2000x11756 sparse matrix of type '<class 'numpy.int64'>'
	with 61287 stored elements in Compressed Sparse Row format>


In [87]:
feature_names_X = vect_X.get_feature_names()
print("Number of features: {}".format(len(feature_names_X)))
print("First 20 features:\n{}".format(feature_names_X[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names_X[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names_X[::2000]))

Number of features: 11756
First 20 features:
['aa', 'aaa', 'aaaaa', 'aaaarghxxx', 'aaardvark', 'aaaxxx', 'aabbcdefg', 'aabsiddfdfdatatg', 'aacute', 'aardvark', 'aavec', 'aazqmaso', 'ab', 'abbrach', 'abbreche', 'abbreviation', 'abc', 'abca', 'abcabcdefabcd', 'abcdc']
Features 20010 to 20030:
[]
Every 2000th feature:
['aa', 'counter', 'functionlm', 'lstlengthlst', 'promise', 'striph']


In [165]:
vect_Y = CountVectorizer(binary=True, max_features=None, token_pattern= "(?u)\\b\\w+\\b").fit(tag_train)
# nous avons laissé l'option binary true car il est inutile d'avoir plus d'une fois le même token
# nous allons limiter le nombre de de features car le F1 Score avec tous les tags pour le modèle KNN est 0.013
# avec 10 features nous arrivons à un F1 Score de 37,8%
# On utilise un token_pattern different pour pouvoir récupérer le tag r
Y_train = vect_Y.transform(text_train).toarray()
print("Y_train:\n{}".format(repr(Y_train)))

Y_train:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])


In [172]:
feature_names_Y = vect_Y.get_feature_names()
print("Number of features: {}".format(len(feature_names_Y)))
print("First 20 features:\n{}".format(feature_names_Y[:20]))
print("Features 210 to 230:\n{}".format(feature_names_Y[800:820]))
print("Every 2000th feature:\n{}".format(feature_names_Y[::200]))

Number of features: 1151
First 20 features:
['2', '2008', '2to3', '3', '3d', '4', '404', '5', '6', '7', '8', 'access', 'accuracy', 'active', 'address', 'admin', 'administration', 'aes', 'agent', 'aggregate']
Features 210 to 230:
['pyrserve', 'python', 'pytz', 'pywin32', 'qt', 'quantmod', 'queryset', 'queue', 'quotes', 'r', 'random', 'range', 'rapydscript', 'raster', 'ratio', 'rawstring', 'rbind', 'rcpp', 'rdata', 'rdbms']
Every 2000th feature:
['2', 'css', 'global', 'migration', 'pyrserve', 'tab']


In [135]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, Y_train)

KNeighborsClassifier()

In [136]:
id_sample = 55
new_post = text_train[id_sample]
new_post_vect = vect_X.transform([new_post])
y_predict = knn_clf.predict(new_post_vect)

tags = np.argsort(y_predict[0,:])[::-1][:10].tolist()
scores = np.sort(y_predict[0,:])[::-1][:10]
print(df_questions.Title_raw[id_sample],'\n')
print(df_questions.Body[id_sample],'\n')
print(text_train[id_sample])
for tag,score in zip(tags,scores) :
    if score > 0  :
        print(feature_names_Y[tag],score)


dataframe k row columns contain billing data append data alltime rbindalltimeall unfortunately generate warn warn message factortmp ri value cna na na na na na na factor level nas generate patient whose name dataframe therefore would know level give similarly names refer doctor column solution 

factor level append record string value dataframe warn result na dataframe k row columns contain billing data append data alltime rbindalltimeall unfortunately generate warn warn message factortmp ri value cna na na na na na na factor level nas generate patient whose name dataframe therefore would know level give similarly names refer doctor column solution
dataframe 1


### Cross validation du modèle

In [137]:
%%time
# 50 secondes pour calculer 2000 lignes

from sklearn.model_selection import cross_val_predict

y_train_knn_pred = cross_val_predict(knn_clf, X_train, Y_train, cv=3)


CPU times: user 553 ms, sys: 3.92 ms, total: 557 ms
Wall time: 556 ms


In [138]:
from sklearn.metrics import f1_score

y_train_knn_pred_ones = (y_train_knn_pred >0).astype(int)
f1_score(Y_train, y_train_knn_pred_ones, average="macro")

  _warn_prf(


0.3780668019908112

## Création d'une pipeline pour choisir le meilleurs parametre pour ce modèle

In [174]:
from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors' : [6, 8,32,64,128]}
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5)

best_score = 0
best_param = {}
best_feature_number = 0

for i in [2,4,6,8,16,32]:
    vect_Y = CountVectorizer(binary=True, token_pattern= "(?u)\\b\\w+\\b", max_features=i).fit(tag_train)
    Y_train = vect_Y.transform(text_train).toarray()
    grid.fit(X_train, Y_train)
    #if grid.best_score_>best_score:
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Best cross-validation score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param)
    print("Best nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names(),'\n')


Best cross-validation score: 0.67
Best parameters:  {'n_neighbors': 128}
Best nombre de features:  2
['python', 'r'] 

Best cross-validation score: 0.62
Best parameters:  {'n_neighbors': 128}
Best nombre de features:  4
['django', 'ggplot2', 'python', 'r'] 

Best cross-validation score: 0.59
Best parameters:  {'n_neighbors': 128}
Best nombre de features:  6
['dataframe', 'django', 'faq', 'ggplot2', 'python', 'r'] 

Best cross-validation score: 0.45
Best parameters:  {'n_neighbors': 8}
Best nombre de features:  8
['dataframe', 'django', 'faq', 'ggplot2', 'list', 'python', 'r', 'string'] 

Best cross-validation score: 0.28
Best parameters:  {'n_neighbors': 8}
Best nombre de features:  16
['c', 'data', 'dataframe', 'django', 'faq', 'file', 'ggplot2', 'list', 'matrix', 'plot', 'python', 'r', 'regex', 'statistics', 'string', 'windows'] 

Best cross-validation score: 0.16
Best parameters:  {'n_neighbors': 8}
Best nombre de features:  32
['c', 'class', 'data', 'dataframe', 'date', 'datetime',

In [161]:
tag_train

0        mysql r rmysql data.table
1                r csv punctuation
2              python shellexecute
3                  r factorization
4          python django autofield
                   ...            
1995    python while-loop do-while
1996        r installation package
1997     r graphics ggplot2 insets
1998            r string substring
1999              list r dataframe
Name: Tags, Length: 2000, dtype: object