Dans ce notebook nous allons analyser la base de donées des post Python et R pour un
entrainement de modèles supervisés.

Nous allons commencer avec un modèle KNN, puis RandomForest et un Multi Layer Perceptron
pour une classification entre les posts Python et R, nous allons utiliser plusieurs tags par post
avec la strategie One vs All

Une fois le modèle entrainé nous allons comparer leurs scores F1 et choisir le meilleur pour
l'utiliser dans notre API

# Importation des bibliothèques

In [15]:
import pandas as pd
import numpy as np
from IPython.core.display import display
import pickle # pour exporter les modèles entrainés



pd.set_option('display.max_colwidth', None)

# Importation des données 

In [9]:
dtypes_questions = {'Id':'int32', 'Score': 'int16', 'Title': 'str',
                    'Body': 'str', 'Title_raw': 'str', 'Text': 'str',
                    'Tags': 'str'}

nrows = 20000

df_questions = pd.read_csv('df_questions_fullclean.csv',
                           usecols=dtypes_questions.keys(),
                           encoding = "utf-8",
                           dtype=dtypes_questions,
                           nrows=nrows
                          )

print(len(df_questions))
display(df_questions.head(5))


2000


Unnamed: 0,Id,Title,Body,Tags,Score,Title_raw,Text
0,10679131,change order array dimension,reorder dimension n array example three array sale data first dimension represent date dimension store dimension department transform array dimension store department date example hop solution,r multidimensional-array,38,How to change order of array dimensions,change order array dimension reorder dimension n array example three array sale data first dimension represent date dimension store dimension department transform array dimension store department date example hop solution
1,1169714,difference r program language,difference r,r programming-languages s,22,What are the major differences between the R and S programming languages?,difference r program language difference r
2,8096313,bind note r cmd check,check package obtain note bind use function like subset use name element argument example data frame dataframeactruefalsetrueb silly thing like subsetfooa transformfooab work expect code check r cmd however refer element complains binding variable work ok really notes package prefer pas check error warn note also really rework code way cod argument refer variable,r package,43,No visible binding for global variable Note in R CMD check,bind note r cmd check check package obtain note bind use function like subset use name element argument example data frame dataframeactruefalsetrueb silly thing like subsetfooa transformfooab work expect code check r cmd however refer element complains binding variable work ok really notes package prefer pas check error warn note also really rework code way cod argument refer variable
3,526457,form fail validation field,model define class articlemodelsmodel slug modelsslugfieldmaxlength title modelscharfieldmaxlength form class articleformmodelform class meta model article validation fail try exist row requestmethod post form articleformrequestpost poof formsave create entry fine however try field validation longer pass error property nothing drop within gut saw article none already exist look like fails value check want row form validation look like something figure run django croak basemodelformvalidateunique call form initialization,python django,20,Django form fails validation on a unique field,form fail validation field model define class articlemodelsmodel slug modelsslugfieldmaxlength title modelscharfieldmaxlength form class articleformmodelform class meta model article validation fail try exist row requestmethod post form articleformrequestpost poof formsave create entry fine however try field validation longer pass error property nothing drop within gut saw article none already exist look like fails value check want row form validation look like something figure run django croak basemodelformvalidateunique call form initialization
4,10437442,place border around point,would like place border around point scatterplot fill base data use ggplot also would like legend entry border since point basically look plot border around point df dataframeidrunif x yrunif ggplotdf aesxx size bonus would like entry border try df dataframeidrunif x yrunif ggplotdf aesxx colourblack size give understand give education ggplot understand seem map fill color anything help perhaps get fill map right use hack like one hte set figure turn legend,r ggplot2,57,Place a border around points,place border around point would like place border around point scatterplot fill base data use ggplot also would like legend entry border since point basically look plot border around point df dataframeidrunif x yrunif ggplotdf aesxx size bonus would like entry border try df dataframeidrunif x yrunif ggplotdf aesxx colourblack size give understand give education ggplot understand seem map fill color anything help perhaps get fill map right use hack like one hte set figure turn legend


In [10]:
# création des labels python et r
text_train, tag_train = df_questions.Text, df_questions.Tags
print("type of text_train: {}".format(type(text_train)))
print("length of text_train: {}".format(len(text_train)))
print("text_train[6]:\n{}".format(text_train[6]))

type of text_train: <class 'pandas.core.series.Series'>
length of text_train: 2000
text_train[6]:
way list name python module package way list name modules package without use example give package testpkg testpkginitpy testpkgmoduleapy testpkgmodulebpy wonder builtin way something like modulea moduleb approach would iterate module search path order find package directory one could list file filter uniquelynamed file strip extension return list seem like amount work something import mechanism already internally functionality expose anywhere


In [11]:
type(text_train)

pandas.core.series.Series

In [12]:
type(tag_train)

pandas.core.series.Series

In [13]:
# ajouter les fichiers pour test

### Répresenter les données textuelles comme un bag-of-words

In [17]:
from sklearn.feature_extraction.text import CountVectorizer

vect_X = CountVectorizer().fit(text_train)
X_train = vect_X.transform(text_train)

pickle.dump(vect_X, open('API/models/vect_X.pickle', 'wb')) # enregistre le modeèle de transformation X

print("X_train:\n{}".format(repr(X_train)))

X_train:
<2000x11756 sparse matrix of type '<class 'numpy.int64'>'
	with 61287 stored elements in Compressed Sparse Row format>


In [18]:
feature_names_X = vect_X.get_feature_names()
print("Number of features: {}".format(len(feature_names_X)))
print("First 20 features:\n{}".format(feature_names_X[:20]))
print("Features 20010 to 20030:\n{}".format(feature_names_X[20010:20030]))
print("Every 2000th feature:\n{}".format(feature_names_X[::2000]))

Number of features: 11756
First 20 features:
['aa', 'aaa', 'aaaaa', 'aaaarghxxx', 'aaardvark', 'aaaxxx', 'aabbcdefg', 'aabsiddfdfdatatg', 'aacute', 'aardvark', 'aavec', 'aazqmaso', 'ab', 'abbrach', 'abbreche', 'abbreviation', 'abc', 'abca', 'abcabcdefabcd', 'abcdc']
Features 20010 to 20030:
[]
Every 2000th feature:
['aa', 'counter', 'functionlm', 'lstlengthlst', 'promise', 'striph']


In [27]:
vect_Y = CountVectorizer(binary=True, max_features=None, token_pattern= "(?u)\\b\\w+\\b").fit(tag_train)
# nous avons laissé l'option binary true car il est inutile d'avoir plus d'une fois le même token
# nous allons limiter le nombre de de features car le F1 Score avec tous les tags pour le modèle KNN est 0.013
# avec 10 features nous arrivons à un F1 Score de 37,8%
# On utilise un token_pattern different pour pouvoir récupérer le tag r
Y_train = vect_Y.transform(tag_train).toarray()

pickle.dump(vect_Y, open('API/models/vect_Y.pickle', 'wb')) # enregistre le modèle de transformation Y

print("Y_train:\n{}".format(repr(Y_train)))

Y_train:
array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])


In [30]:
feature_names_Y = vect_Y.get_feature_names()
print("Number of features: {}".format(len(feature_names_Y)))
print("First 20 features:\n{}".format(feature_names_Y[:20]))
print("Features 210 to 230:\n{}".format(feature_names_Y[800:820]))
print("Every 200th feature:\n{}".format(feature_names_Y[::200]))

Number of features: 1151
First 20 features:
['2', '2008', '2to3', '3', '3d', '4', '404', '5', '6', '7', '8', 'access', 'accuracy', 'active', 'address', 'admin', 'administration', 'aes', 'agent', 'aggregate']
Features 210 to 230:
['pyrserve', 'python', 'pytz', 'pywin32', 'qt', 'quantmod', 'queryset', 'queue', 'quotes', 'r', 'random', 'range', 'rapydscript', 'raster', 'ratio', 'rawstring', 'rbind', 'rcpp', 'rdata', 'rdbms']
Every 200th feature:
['2', 'css', 'global', 'migration', 'pyrserve', 'tab']


In [34]:
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, Y_train)

pickle.dump(knn_clf, open('API/models/knn_clf.pickle', 'wb'))

In [79]:
def text_preciction_labels(new_post,vect_X,vect_Y,model,df_questions):
    feature_names_Y = vect_Y.get_feature_names() # liste des tags
    Y_train = vect_Y.transform(df_questions.Tags) # liste des listes des Tags par post
    new_post_vect = vect_X.transform([new_post]) # vectorisation du nouveau post pour prediction du modèle
    y_predict = model.predict(new_post_vect) # prediction du modèle entrainé

    tags = np.argsort(y_predict[0,:])[::-1][:10].tolist()
    scores = np.sort(y_predict[0,:])[::-1][:10]

    print(df_questions.Title_raw[id_sample],'\n')
    print(df_questions.Body[id_sample],'\n')
    print(df_questions.Text[id_sample])
    print('\n','Tags prediction : ', '\n')
    for tag,score in zip(tags,scores) :
        if score > 0  :
            print(feature_names_Y[tag],score)
    print('\n','Tags labels : ','\n')
    y_labels = Y_train[id_sample].toarray()
    tags = np.argsort(y_labels[0,:])[::-1][:10].tolist()
    scores = np.sort(y_labels[0,:])[::-1][:10]
    for tag,score in zip(tags,scores) :
        if score > 0  :
            print(feature_names_Y[tag],score)

In [81]:
id_sample = 50
new_post = text_train[id_sample]
text_preciction_labels(new_post,vect_X,vect_Y,knn_clf,df_questions)


Code Coverage and Unit Testing of Python Code 

already visit python unittesting framework look unit test framework also coverage respect unit test far come across coveragepy better option interest option integrate cpython unit test code code coverage python code studio plugins something ironpython studio do achieve look suggestion 

code coverage unit test python code already visit python unittesting framework look unit test framework also coverage respect unit test far come across coveragepy better option interest option integrate cpython unit test code code coverage python code studio plugins something ironpython studio do achieve look suggestion

 Tags prediction :  

python 1

 Tags labels :  

code 1
unit 1
studio 1
visual 1
coverage 1
testing 1
python 1
2008 1


### Cross validation du modèle

In [82]:
%%time
# 50 secondes pour calculer 2000 lignes
# 15min 57s pour 20 000 lignes

from sklearn.model_selection import cross_val_predict

y_train_knn_pred = cross_val_predict(knn_clf, X_train, Y_train, cv=3)


CPU times: user 51.9 s, sys: 90.7 ms, total: 52 s
Wall time: 51.9 s


In [83]:
from sklearn.metrics import f1_score

y_train_knn_pred_ones = (y_train_knn_pred >0).astype(int)
f1_score(Y_train, y_train_knn_pred_ones, average="macro")


0.004908140130484893

## Création d'une pipeline pour choisir le meilleurs parametre pour ce modèle

In [84]:
%%time
# 6min 42s pour 20 000 post

from sklearn.model_selection import GridSearchCV
param_grid = {'n_neighbors' : [6, 8,32,64,128,255,500]}
grid = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='f1_micro')

best_score = 0
best_param = {}
best_feature_number = 0

for i in [2,4,6,8,16,32]:
    vect_Y = CountVectorizer(binary=True, token_pattern= "(?u)\\b\\w+\\b", max_features=i).fit(tag_train)
    Y_train = vect_Y.transform(text_train).toarray()
    grid.fit(X_train, Y_train)
    #if grid.best_score_>best_score:
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Best cross-validation score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param)
    print("Best nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names(),'\n')

Best cross-validation score: 0.68
Best parameters:  {'n_neighbors': 128}
Best nombre de features:  2
['python', 'r'] 

Best cross-validation score: 0.65
Best parameters:  {'n_neighbors': 128}
Best nombre de features:  4
['django', 'ggplot2', 'python', 'r'] 

Best cross-validation score: 0.62
Best parameters:  {'n_neighbors': 128}
Best nombre de features:  6
['dataframe', 'django', 'faq', 'ggplot2', 'python', 'r'] 

Best cross-validation score: 0.55
Best parameters:  {'n_neighbors': 8}
Best nombre de features:  8
['dataframe', 'django', 'faq', 'ggplot2', 'list', 'python', 'r', 'string'] 

Best cross-validation score: 0.48
Best parameters:  {'n_neighbors': 8}
Best nombre de features:  16
['c', 'data', 'dataframe', 'django', 'faq', 'file', 'ggplot2', 'list', 'matrix', 'plot', 'python', 'r', 'regex', 'statistics', 'string', 'windows'] 

Best cross-validation score: 0.41
Best parameters:  {'n_neighbors': 8}
Best nombre de features:  32
['c', 'class', 'data', 'dataframe', 'date', 'datetime',

In [90]:
KNNmodel = grid.best_estimator_

id_sample = 1002
new_post = text_train[id_sample]
text_preciction_labels(new_post,vect_X,vect_Y,KNNmodel,df_questions)

R define dimensions of empty data frame 

try collect data subset data set need create data frame collect result problem know create data frame define number column without actually data put c would like create df w column id max min collectid s subsetdf dfid collectmax maxssvalue collectmin minssvalue feel ask question almost feel like ask find would greatly appreciate help 

r define dimension empty data frame try collect data subset data set need create data frame collect result problem know create data frame define number column without actually data put c would like create df w column id max min collectid s subsetdf dfid collectmax maxssvalue collectmin minssvalue feel ask question almost feel like ask find would greatly appreciate help

 Tags prediction :  

data 1

 Tags labels :  

r 1


### pour 20 000 post
Cross validation CV=5
KNeighborsClassifier(), param_grid, cv=5, scoring='f1_micro')

Best cross-validation score: 0.78
Best parameters:  {'n_neighbors': 255}
Best nombre de features:  2
['python', 'r']

Best cross-validation score: 0.75
Best parameters:  {'n_neighbors': 255}
Best nombre de features:  4
['django', 'ggplot2', 'python', 'r']

Best cross-validation score: 0.71
Best parameters:  {'n_neighbors': 64}
Best nombre de features:  6
['django', 'ggplot2', 'list', 'numpy', 'python', 'r']

Best cross-validation score: 0.66
Best parameters:  {'n_neighbors': 32}
Best nombre de features:  8
['dataframe', 'django', 'ggplot2', 'list', 'numpy', 'python', 'r', 'string']

Best cross-validation score: 0.57
Best parameters:  {'n_neighbors': 32}
Best nombre de features:  16
['2', '3', 'data', 'dataframe', 'dictionary', 'django', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'python', 'r', 'regex', 'string', 'x']

Best cross-validation score: 0.50
Best parameters:  {'n_neighbors': 8}
Best nombre de features:  32
['2', '3', '7', 'class', 'data', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'faq', 'file', 'flask', 'function', 'ggplot2', 'import', 'list', 'matplotlib', 'models', 'numpy', 'pandas', 'performance', 'plot', 'python', 'r', 'regex', 'scipy', 'sqlalchemy', 'string', 'table', 'testing', 'x']

CPU times: user 6min 17s, sys: 28.4 s, total: 6min 45s
Wall time: 6min 45s

## Random Forest

In [91]:
%%time
# 4h 32min 57s pour 20 000 posts

from sklearn.ensemble import RandomForestClassifier

param_grid = {'n_estimators' : [100,500,1000],
              }
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5, scoring='f1_micro')

best_score = 0
best_param = {}
best_feature_number = 0

for i in [2,4,8,16,32]:
    vect_Y = CountVectorizer(binary=True, token_pattern= "(?u)\\b\\w+\\b", max_features=i).fit(tag_train)
    Y_train = vect_Y.transform(text_train).toarray()
    grid.fit(X_train, Y_train)
    #if grid.best_score_>best_score:
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Best cross-validation score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param)
    print("Best nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names(),'\n')

Best cross-validation score: 0.76
Best parameters:  {'n_estimators': 500}
Best nombre de features:  2
['python', 'r'] 

Best cross-validation score: 0.75
Best parameters:  {'n_estimators': 1000}
Best nombre de features:  4
['django', 'ggplot2', 'python', 'r'] 

Best cross-validation score: 0.69
Best parameters:  {'n_estimators': 1000}
Best nombre de features:  8
['dataframe', 'django', 'faq', 'ggplot2', 'list', 'python', 'r', 'string'] 

Best cross-validation score: 0.65
Best parameters:  {'n_estimators': 1000}
Best nombre de features:  16
['c', 'data', 'dataframe', 'django', 'faq', 'file', 'ggplot2', 'list', 'matrix', 'plot', 'python', 'r', 'regex', 'statistics', 'string', 'windows'] 

Best cross-validation score: 0.55
Best parameters:  {'n_estimators': 1000}
Best nombre de features:  32
['c', 'class', 'data', 'dataframe', 'date', 'datetime', 'dictionary', 'django', 'exception', 'faq', 'file', 'function', 'ggplot2', 'list', 'matrix', 'memory', 'models', 'oop', 'performance', 'plot', '

In [92]:
RandomForestmodel = grid.best_estimator_

id_sample = 1002
new_post = text_train[id_sample]
text_preciction_labels(new_post,vect_X,vect_Y,RandomForestmodel,df_questions)

R define dimensions of empty data frame 

try collect data subset data set need create data frame collect result problem know create data frame define number column without actually data put c would like create df w column id max min collectid s subsetdf dfid collectmax maxssvalue collectmin minssvalue feel ask question almost feel like ask find would greatly appreciate help 

r define dimension empty data frame try collect data subset data set need create data frame collect result problem know create data frame define number column without actually data put c would like create df w column id max min collectid s subsetdf dfid collectmax maxssvalue collectmin minssvalue feel ask question almost feel like ask find would greatly appreciate help

 Tags prediction :  

c 1
data 1
r 1

 Tags labels :  

r 1


### pour 20 000 post
Cross validation CV=5
RandomForestClassifier(), param_grid, cv=5, scoring='f1_micro')

Best cross-validation score: 0.83
Best parameters:  {'n_estimators': 1000}
Best nombre de features:  2
['python', 'r']

Best cross-validation score: 0.82
Best parameters:  {'n_estimators': 500}
Best nombre de features:  4
['django', 'ggplot2', 'python', 'r']

Best cross-validation score: 0.75
Best parameters:  {'n_estimators': 1000}
Best nombre de features:  8
['dataframe', 'django', 'ggplot2', 'list', 'numpy', 'python', 'r', 'string']

Best cross-validation score: 0.67
Best parameters:  {'n_estimators': 100}
Best nombre de features:  16
['2', '3', 'data', 'dataframe', 'dictionary', 'django', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'python', 'r', 'regex', 'string', 'x']

Best cross-validation score: 0.58
Best parameters:  {'n_estimators': 100}
Best nombre de features:  32
['2', '3', '7', 'class', 'data', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'faq', 'file', 'flask', 'function', 'ggplot2', 'import', 'list', 'matplotlib', 'models', 'numpy', 'pandas', 'performance', 'plot', 'python', 'r', 'regex', 'scipy', 'sqlalchemy', 'string', 'table', 'testing', 'x']

CPU times: user 4h 32min 40s, sys: 16 s, total: 4h 32min 56s
Wall time: 4h 32min 57s

## Multilayer Perceptron classifier

In [None]:
%%time
# 2d 7h 36min 50s pour 20 000 posts

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import RandomizedSearchCV

param_grid = {'alpha' : [0.001,0.01,0.1],
              'learning_rate_init' : [0.0001,0.001,0.01] }
grid = RandomizedSearchCV(MLPClassifier(max_iter=500), param_grid, cv=5, scoring='f1_micro')

best_score = 0
best_param = {}
best_feature_number = 0

for i in [2,4,8,16,32]:
    vect_Y = CountVectorizer(binary=True, token_pattern= "(?u)\\b\\w+\\b", max_features=i).fit(tag_train)
    Y_train = vect_Y.transform(text_train).toarray()
    grid.fit(X_train, Y_train)
    #if grid.best_score_>best_score:
    best_score = grid.best_score_
    best_param = grid.best_params_
    best_feature_number = i
    print("Best cross-validation score: {:.2f}".format(best_score))
    print("Best parameters: ", best_param)
    print("Best nombre de features: ", best_feature_number)
    print(vect_Y.get_feature_names(),'\n')

In [114]:
%%time
from sklearn.neural_network import MLPClassifier
MLP_clf = MLPClassifier(max_iter=1500, alpha=0.1, learning_rate_init=0.0001)
# la convergence du modèle n'est pas encore faite à 1000 itérations
vect_Y_32 = CountVectorizer(binary=True, token_pattern= "(?u)\\b\\w+\\b", max_features=32).fit(tag_train)
Y_train = vect_Y_32.transform(tag_train).toarray()
MLP_clf.fit(X_train, Y_train)

CPU times: user 16min 38s, sys: 9.6 s, total: 16min 47s
Wall time: 2min 48s


MLPClassifier(alpha=0.1, learning_rate_init=0.0001, max_iter=1500)

In [115]:
pickle.dump(vect_Y_32, open('API/models/vect_Y_32.pickle', 'wb'))
pickle.dump(MLP_clf, open('API/models/MLP_clf.pickle', 'wb'))

In [116]:
id_sample = 1002
new_post = text_train[id_sample]
text_preciction_labels(new_post,vect_X,vect_Y,MLP_clf,df_questions)

y_train_MLP_pred = MLP_clf.predict(X_train)
from sklearn.metrics import f1_score

print('F1 score : ')
y_train_MLP_pred_ones = (y_train_MLP_pred >0).astype(int)
f1_score(Y_train, y_train_MLP_pred_ones, average='weighted')

R define dimensions of empty data frame 

try collect data subset data set need create data frame collect result problem know create data frame define number column without actually data put c would like create df w column id max min collectid s subsetdf dfid collectmax maxssvalue collectmin minssvalue feel ask question almost feel like ask find would greatly appreciate help 

r define dimension empty data frame try collect data subset data set need create data frame collect result problem know create data frame define number column without actually data put c would like create df w column id max min collectid s subsetdf dfid collectmax maxssvalue collectmin minssvalue feel ask question almost feel like ask find would greatly appreciate help

 Tags prediction :  

r 1

 Tags labels :  

r 1
F1 score : 


1.0

### pour 20 000 post
Cross validation CV=5
MLPClassifier(max_iter=500), param_grid, cv=5, scoring='f1_micro')

Best cross-validation score: 0.87
Best parameters:  {'learning_rate_init': 0.001, 'alpha': 0.1}
Best nombre de features:  2
['python', 'r']

Best cross-validation score: 0.88
Best parameters:  {'learning_rate_init': 0.0001, 'alpha': 0.1}
Best nombre de features:  4
['django', 'ggplot2', 'python', 'r']

Best cross-validation score: 0.92
Best parameters:  {'learning_rate_init': 0.0001, 'alpha': 0.1}
Best nombre de features:  8
['dataframe', 'django', 'ggplot2', 'list', 'numpy', 'python', 'r', 'string']

Best cross-validation score: 0.87
Best parameters:  {'learning_rate_init': 0.0001, 'alpha': 0.1}
Best nombre de features:  16
['2', '3', 'data', 'dataframe', 'dictionary', 'django', 'ggplot2', 'list', 'matplotlib', 'numpy', 'pandas', 'python', 'r', 'regex', 'string', 'x']

Best cross-validation score: 0.91
Best parameters:  {'learning_rate_init': 0.0001, 'alpha': 0.1}
Best nombre de features:  32
['2', '3', '7', 'class', 'data', 'dataframe', 'datetime', 'dictionary', 'django', 'dplyr', 'faq', 'file', 'flask', 'function', 'ggplot2', 'import', 'list', 'matplotlib', 'models', 'numpy', 'pandas', 'performance', 'plot', 'python', 'r', 'regex', 'scipy', 'sqlalchemy', 'string', 'table', 'testing', 'x']

CPU times: user 13d 12h 11min 43s, sys: 2h 41min 2s, total: 13d 14h 52min 45s
Wall time: 2d 7h 36min 50s

In [None]:

import pickle

clf=grid.best_estimator_

pickle.dump(clf, open('models/final_prediction.pickle', 'wb'))

In [None]:
id_sample = 100
new_post = text_train[id_sample]
new_post_vect = vect_X.transform([new_post])
y_predict = clf.predict(new_post_vect)

tags = np.argsort(y_predict[0,:])[::-1][:10].tolist()
scores = np.sort(y_predict[0,:])[::-1][:10]
print(df_questions.Title_raw[id_sample],'\n')
print(df_questions.Body[id_sample],'\n')
print(text_train[id_sample])
for tag,score in zip(tags,scores) :
    if score > 0  :
        print(feature_names_Y[tag],score)

In [None]:
y_labels = Y_train[id_sample]
tags = np.argsort(y_labels)[::-1][:10].tolist()
scores = np.sort(y_labels)[::-1][:10]
for tag,score in zip(tags,scores) :
    if score > 0  :
        print(feature_names_Y[tag],score)

In [None]:
scores