В данном задании вам предстоит классифицировать тексты вакансии по классам. Вам даны два файла:

train.csv - тексты и классы test.csv - тексты по которым нужно сделать предсказания

В обоих файлах данные по файлу хранятся в одной строке. В файле с текстами в начале строки находится идентификатор текста. Разделитель - ; Ваша задача создать модель, которая будет определять класс вакансии по тексту вакансии.
Метрика - ROC-AUC

In [241]:
import pandas as pd
import numpy as np
from tqdm import tqdm_notebook as tqdm
import pickle
import re

from langdetect import detect

from bs4 import BeautifulSoup
from gensim.models.word2vec import Word2Vec
from gensim.models import KeyedVectors

import pymystem3
from pymystem3 import Mystem

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold,  train_test_split
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

import warnings

pd.options.display.max_rows = 200
warnings.filterwarnings('ignore')

##    Data loading

In [196]:
df_train  = pd.read_csv('train.csv', sep=';')
df_test  = pd.read_csv('test.csv', sep=';')

In [197]:
df_train['sample'] = 'train'
df_test['sample'] = 'test'
df = df_test.append(df_train).reset_index(drop=True)
print(' Размер тренировочной выборки:', df_train.shape, '\n',
    'Размер тестовой выборки:', df_test.shape, '\n',
    'Размер общей выборки:', df.shape, '\n')

 Размер тренировочной выборки: (31063, 4) 
 Размер тестовой выборки: (31064, 3) 
 Размер общей выборки: (62127, 4) 



In [27]:
df.head(5)

Unnamed: 0,id,text,sample,target
0,31063,<p><strong>В крупную компанию по организации и...,test,
1,31064,<p><strong>Обязанности:</strong></p> <ul> <li>...,test,
2,31065,<p> </p> <p><strong>Обязанности:</strong></p> ...,test,
3,31066,<p><strong>Обязанности:</strong></p> <ul> <li>...,test,
4,31067,<p><strong>Вакансия СРОЧНАЯ!</strong></p> <p><...,test,


## EDA (exploratory data analysis)

    В текстах вакансий присутствует html разметка, удалим её при помощи BeautifulSoup:

In [198]:
df['text'] = df['text'].map(lambda x: BeautifulSoup(x).text)

In [199]:
df.head(5)

Unnamed: 0,id,text,sample,target
0,31063,В крупную компанию по организации и приготовле...,test,
1,31064,Обязанности: Обеспечение необходимой функцион...,test,
2,31065,Обязанности: отгрузка и прием товара со скл...,test,
3,31066,Обязанности: приготовление холодных и горячих...,test,
4,31067,Вакансия СРОЧНАЯ! Внимание! Просьба подробно и...,test,


## Data preparation

    Данные будем готовить двумя методами - созданием словаря Word2Vec и перевод вакансий в это пространство: 
    если слово есть в вакансии - прибавим его к вектору "вакансии". Потом разделим полученный вектор на сумму слов.     
    Делаем подготовку данных для словаря Word2Vec с использованием морфологического анализа (библиотека pymystem3). Напишем функцию, которая будет разбивать описание вакансиий на слова, проводить их морфологический анализ 
    (определение части речи и начальной формы слова), если слово относится к глаголу, существительному или
    прилагательному, то вернем его начальную форму. Составим список слов вакансии и вернем этот список
    

In [142]:
m = Mystem()
word_dict = {}
def normalize(text):
    output = []
    sentence = re.sub(r'[^\w]', ' ', text).split()
    if detect(text) != 'ru':
        output.append(sentence)
    else:
        words = []
        for new_word in sentence:
            if len(new_word)>1:
                if word_dict.get(new_word):
                    words.append(word_dict[new_word])
                else:
                    token = m.analyze(new_word)
                    if token[0].get('analysis'):
                        word = token[0]['analysis'][0]['lex']
                        pos = token[0]['analysis'][0]['gr']
                        if pos[0] in ['A', 'S', 'V']:
                            words.append(word)
                            word_dict[new_word] = word
        output.append(list(set(words)))
    return output

In [143]:
all_sentences = []
for text in tqdm(df['text']):
    all_sentences.extend(normalize(text))

HBox(children=(FloatProgress(value=0.0, max=62127.0), HTML(value='')))




In [151]:
len(all_sentences)

62127

    Сохраним данные в pickle. 

In [153]:
with open("all_sentences.txt", "wb") as fp:
    pickle.dump(all_sentences, fp)

    Созданим собственный словарь в Word2Vec из слов в наших вакансиях (начальные формы существительных, глаголов и прилагательных)

In [154]:
%%time

# список параметров, которые можно менять по вашему желанию
num_features = 50  # итоговая размерность вектора каждого слова
min_word_count = 5  # минимальная частотность слова, чтобы оно попало в модель
num_workers = 5     # количество ядер вашего процессора, чтоб запустить обучение в несколько потоков
context = 5         # размер окна 
downsampling = 1e-3 # внутренняя метрика модели

my_model = Word2Vec(all_sentences, workers=num_workers, size=num_features,
                 min_count=min_word_count, window=context, sample=downsampling)

CPU times: user 1min 19s, sys: 581 ms, total: 1min 20s
Wall time: 31.5 s


    Оценим результат

In [155]:
print('Размер словаря(корпуса) -', my_model.corpus_total_words, 
      'Размерность вектора слов - ', my_model.vector_size)

Размер словаря(корпуса) - 5223562 Размерность вектора слов -  50


    "Заморозим" наш словарь векторов слов для дальнейшего использования

In [156]:
my_model.init_sims(replace=True)

    Время магии. Составляем общий ветор слов в нашей вакансии, приведённый к количеству слов в вакансии:

In [157]:
index2word_set = list(set(my_model.wv.index2word))

w2v_vectors = []
for sentense in tqdm(all_sentences):
    text_vec = np.zeros((my_model.vector_size), dtype="float32")
    n_words = 0
    for word in sentense:
        if word in index2word_set:
            n_words = n_words + 1
            text_vec = np.add(text_vec, my_model[word])
    if n_words != 0:
        text_vec /= n_words
    w2v_vectors.append(text_vec)


HBox(children=(FloatProgress(value=0.0, max=62127.0), HTML(value='')))




    Создадим из полученных векторов новый DataFrame, проведем его нормирование через 
    StandartScale() и добавим к нашим данным. 

In [158]:
df_w2v_vectors = pd.DataFrame(w2v_vectors)

In [160]:
df_new_w2v = pd.concat([df, df_w2v_vectors], axis=1)

In [161]:
df_new_w2v.head(5)

Unnamed: 0,id,text,sample,target,0,1,2,3,4,5,...,40,41,42,43,44,45,46,47,48,49
0,31063,В крупную компанию по организации и приготовле...,test,,-0.009692,-0.012096,0.011477,0.021942,0.034324,0.013835,...,-0.001376,-0.049844,0.007026,-0.063618,-0.015122,0.019125,0.013729,0.045449,-0.058340,-0.002166
1,31064,Обязанности: Обеспечение необходимой функцион...,test,,-0.027645,-0.006044,0.018141,0.000194,-0.013684,0.024177,...,-0.000145,0.011921,-0.015817,-0.000845,0.000322,0.002328,0.006252,-0.007417,-0.010404,-0.035867
2,31065,Обязанности: отгрузка и прием товара со скл...,test,,0.025459,-0.040618,-0.031368,-0.002854,0.016386,0.041660,...,-0.018855,-0.013222,0.055444,-0.060059,-0.007496,0.042994,-0.027436,0.024707,-0.038612,-0.030771
3,31066,Обязанности: приготовление холодных и горячих...,test,,-0.023435,-0.028566,-0.035077,0.068286,0.004282,-0.019079,...,0.004231,-0.039630,0.005180,-0.047777,-0.051409,0.056020,-0.016946,0.060479,-0.041780,0.006688
4,31067,Вакансия СРОЧНАЯ! Внимание! Просьба подробно и...,test,,-0.009003,0.015710,-0.006751,-0.007007,0.007899,0.010307,...,-0.035466,0.000730,0.006932,-0.001965,-0.006512,-0.003810,0.012426,0.016349,-0.024242,-0.032836
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
62122,31058,Обязанности: Создание атмосферы гостеприимств...,train,1.0,-0.041570,-0.081356,-0.066499,0.014493,0.004897,0.051060,...,-0.035452,0.028755,-0.010448,0.017975,0.000189,0.040819,0.047780,0.015575,-0.099956,-0.000112
62123,31059,"Обязанности: ведение бухгалтерского, налоговог...",train,1.0,0.028025,0.006941,0.024264,0.020101,-0.003867,-0.021826,...,0.034610,-0.071575,0.046674,-0.082472,-0.016335,0.062519,0.002847,-0.006985,-0.007888,-0.012490
62124,31060,"В жилой дом, расположенный у станции метро ""Зв...",train,0.0,-0.014316,-0.029265,0.015154,0.041256,0.001097,-0.022268,...,-0.032398,-0.047645,0.015745,-0.053868,-0.046522,0.040444,-0.023134,0.075914,-0.036307,-0.006032
62125,31061,В нашу дружную команду требуется архитектор-ди...,train,0.0,-0.032346,-0.038017,-0.001400,0.029337,0.011012,0.016577,...,-0.015414,-0.052418,-0.050502,-0.034396,-0.019782,-0.018623,-0.010110,-0.008327,-0.020328,-0.009507


In [163]:
with open("df_new_w2v.txt", "wb") as fp:
    pickle.dump(df_new_w2v, fp)

    Займемся подбором модели:
    1. Разделим обратно данные:

In [166]:
df_train = df_new_w2v[df_new_w2v['sample']=='train']
df_train = df_train.drop([ 'text', 'sample', 'id'], axis = 1)

target = df_train['target']
df_train = df_train.drop(['target'], axis = 1)

In [168]:
df_new_test = df_new_w2v[df_new_w2v['sample']=='test']
test_id = df_new_test['id']
df_new_test = df_new_test.drop(['text', 'target','sample', 'id'], axis = 1)

    2. Отделим валидационную выборку

In [169]:
X_train, X_test, y_train, y_test = train_test_split(df_train, target, 
                                                    test_size=0.2, random_state=123) 

    3.  По заданию у нас метрика ROC-AUC, а обучать сначала будем градиентный бустинг
    3.1 Созданим pipeline и начнем подбор параметров с помощью RandomizedSearchCV

In [170]:
mypipeline_gb = Pipeline([
    ('gb', GradientBoostingClassifier())
])
param_grid_gb = { 'gb__max_depth': range(17, 30),
                  'gb__n_estimators': range(150,200),
                  'gb__subsample':[0.3,0.5,0.7,0.9],
                  'gb__min_samples_leaf': range(1,10, 2),
                  'gb__min_samples_split': range(2, 21, 2),
                  'gb__min_impurity_decrease':[0, 0.000001],
                  'gb__learning_rate': [0.001, 0.01, 0.05, 0.1, 0.5, 0.9]
}



In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=123)

hyper_search_gb = RandomizedSearchCV(mypipeline_gb, param_grid_gb, n_iter=5, scoring='roc_auc', 
                                  cv=cv, n_jobs=4, refit=True, random_state=47,
                                  verbose=3)
hyper_search_gb.fit(X_train, y_train)

    3.2 Посмотрим на лучшее значение ROC-AUC, а также на влияние гиперпараметров на его значение

In [None]:
print('GBC', hyper_search_gb.best_score_)

In [None]:
print(hyper_search_gb.best_params_)

In [None]:
df_search_gb = pd.DataFrame(hyper_search_gb.cv_results_)

In [None]:
df_search_gb.sort_values(by='mean_test_score',ascending=False).T

    3.3 Параметры модели выбраны, обучим её на всех данных:

In [None]:
gb = GradientBoostingClassifier(subsample= 0.7, n_estimators=177, min_samples_split=16, min_samples_leaf= 7, 
                                      min_impurity_decrease= 1e-06, max_depth=19, learning_rate=0.1)
gb.fit(X_train, y_train)

    3.4 Проверим данные на отложенной (валидационной) выборке:

In [None]:
pridict = gb.predict(X_test)
roc_auc_score(y_test, pridict)

    3.5 Ну и сделаем предсказание

In [None]:
predict = gb.predict_proba(df_test)

In [None]:
pd.DataFrame(zip(test_id, predict[1]), columns=['id','target']).to_csv('predict_26-02.csv', sep=',', index=False)

     Теперь попробуем создать матрицу 𝑇𝐹∗𝐼𝐷𝐹 для всех наших вакансий. Напишем функцию, 
     похожую на предыдущую, но мы будем получать список строк вакансий, а не список 
     списков вакансий

In [212]:
m = Mystem()
word_dict = {}

def normalize2(text):
    sentence = re.sub(r'[^\w]', ' ', text).split()
    words = []
    if detect(text) != 'ru':
        words.append(re.sub(r'[^\w]', ' ', text))
    else:
        k = ''
        for new_word in sentence:
            if len(new_word)>1:
                if word_dict.get(new_word):
                    words.append(word_dict[new_word])  
                    k = k+word+' '
                else:
                    token = m.analyze(new_word)
                    if token[0].get('analysis'):
                        word = token[0]['analysis'][0]['lex']
                        pos = token[0]['analysis'][0]['gr']
                        if pos[0] in ['A', 'S', 'V']:
                            k = k+word+' '
        words.append(k)
    return words

    Применим функцию и сокраним всё в Pikcle

In [213]:
new_sentences = []
for text in tqdm(df['text']):
    new_sentences.extend(normalize2(text))

HBox(children=(FloatProgress(value=0.0, max=62127.0), HTML(value='')))




In [214]:
len(new_sentences)

62127

In [215]:
with open("new_sentences.txt", "wb") as fp:
    pickle.dump(new_sentences, fp)

    Обучим TfidfVectorizer на всех наших данных:

In [244]:
tfidf = TfidfVectorizer(sublinear_tf = True)

In [245]:
tfidf.fit(new_sentences)

TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.float64'>, encoding='utf-8',
                input='content', lowercase=True, max_df=1.0, max_features=None,
                min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                smooth_idf=True, stop_words=None, strip_accents=None,
                sublinear_tf=True, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, use_idf=True, vocabulary=None)

    И сделаем 𝑇𝐹∗𝐼𝐷𝐹 матрицу наших предложений (отделно для тренировочной и тестовой выборок)

In [246]:
train_lemmatized = df[df['sample']=='train']
test_lemmatized = df[df['sample']=='test']

In [247]:
train_tfidf = tfidf.transform(train_lemmatized['text'])
test_tfidf = tfidf.transform(test_lemmatized['text'])

In [248]:
train_target = train_lemmatized['target']

In [249]:
X_train, X_test, y_train, y_test = train_test_split(train_tfidf, train_target, 
                                                    test_size=0.2, random_state=123) 

In [250]:
mypipeline_lr = Pipeline([
    ('lr', LogisticRegression())])

param_grid_lr = { 'lr__penalty': ['l1', 'l2', 'elasticnet', 'none'],
                  'lr__tol': [0.00001, 0.0001, 0.001],
                  'lr__solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
                  'lr__C' : [0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 0.0, 1.0]
}

In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=123)

hyper_search_lr = RandomizedSearchCV(mypipeline_lr, param_grid_lr, n_iter=100, scoring='roc_auc', 
                                  cv=cv, n_jobs=4, refit=True, random_state=47,
                                  verbose=5)
hyper_search_lr.fit(X_train, y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:    2.4s


In [226]:
print('LogRes', hyper_search_lr.best_score_)
print(hyper_search_lr.best_params_)

LogRes 0.9788307449330185
{'lr__tol': 0.001, 'lr__solver': 'saga', 'lr__penalty': 'none', 'lr__C': 0.0}


In [227]:
lr = LogisticRegression(tol=0.001, solver='saga', penalty = 'none', C =0)

In [235]:
lr.fit(train_tfidf, train_target)

LogisticRegression(C=0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='none',
                   random_state=None, solver='saga', tol=0.001, verbose=0,
                   warm_start=False)

In [236]:
predict_lr = lr.predict_proba(X_test)

In [20]:
df_train_tfidf = tfidf.transform(train_tfidf)
df_test_tfidf = tfidf.transform(test_tfidf )

In [44]:
df_new_test = df_new[df_new['sample']=='test']
test_id = df_new_test['id']
df_new_test = df_new_test.drop(['text', 'target','sample', 'id'], axis = 1)

    2. Отделим валидационную выборку

In [457]:
X_train, X_test, y_train, y_test = train_test_split(df_train, target, 
                                                    test_size=0.2, random_state=123) 

In [None]:
pd.DataFrame(zip(test_id, predict[1]), columns=['id','target']).to_csv('predict_26-02.csv', sep=',', index=False)

    3.4 Попробуем теперь модель логистической регрессии (все шаги аналогичны):

In [22]:
from sklearn.linear_model import LogisticRegression

In [25]:
df_train = df_new[df_new['sample']=='train']
df_train = df_train.drop([ 'lang', 'text', 'sample', 'id'], axis = 1)

target = df_train['target']
df_train = df_train.drop(['target'], axis = 1)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,0.020300,0.048534,0.047571,0.058687,-0.018591,0.046081,-0.010539,-0.029725,-0.036074,0.022760,...,-0.064167,0.068819,-0.093172,0.012321,-0.014153,-0.068234,-0.007546,-0.033890,-0.064652,0.016229
1,-0.036070,0.025835,0.032085,0.032439,0.045227,-0.013449,0.018120,-0.052522,-0.028263,-0.009837,...,-0.050580,0.064305,-0.057780,0.090739,-0.037933,-0.014616,-0.029467,-0.029742,0.054784,-0.060642
2,0.010065,0.058150,0.044352,0.046919,-0.052414,-0.029630,0.006786,-0.032574,0.006919,-0.036406,...,-0.034074,0.029135,-0.020393,0.018462,-0.016672,-0.059624,0.047566,0.005071,-0.058323,0.006377
3,-0.026657,0.013812,0.025367,0.013324,-0.062354,0.098799,-0.017324,0.012411,0.000881,-0.008061,...,-0.049376,0.051521,-0.136965,0.012129,0.028002,-0.080966,0.028812,0.003304,-0.066382,0.031387
4,0.022742,-0.010745,0.006650,0.005026,0.034022,-0.011833,0.019907,-0.010415,-0.007886,-0.011825,...,0.000592,0.046487,-0.023325,0.038159,-0.006442,0.001059,0.007444,-0.024818,0.021036,-0.035649
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
31059,0.010257,-0.053726,0.075761,0.043358,-0.086408,0.053989,0.020136,0.037576,0.049897,0.080496,...,-0.013137,0.040359,-0.072743,-0.021994,-0.015918,-0.087914,-0.026765,0.001961,-0.097647,0.009055
31060,0.076382,0.034959,0.007231,-0.018965,0.100007,0.079176,0.040875,0.014493,0.034222,-0.025086,...,0.062635,-0.004796,-0.031663,0.014627,-0.012913,0.024049,0.023255,-0.002590,-0.007980,-0.034236
31061,-0.002675,0.046696,0.032751,0.025726,-0.067436,-0.057088,-0.013116,0.036461,0.020919,0.063970,...,-0.029849,0.039908,-0.030530,-0.052518,0.090020,-0.057650,0.034707,0.017355,-0.093878,0.049013
31062,0.067936,0.022239,-0.006599,-0.029971,0.056214,0.014212,0.028402,0.007138,0.017489,-0.011437,...,0.012897,0.033814,-0.030455,0.044858,0.018468,0.025513,0.006351,-0.043535,0.044649,-0.066918


In [48]:
df_new_test = df_new[df_new['sample']=='test']
test_id = df_new_test['id']
df_new_test = df_new_test.drop(['text', 'lang', 'target','sample', 'id'], axis = 1)

In [11]:
X_train, X_test, y_train, y_test = train_test_split(df_train, target, 
                                                    test_size=0.2, random_state=123) 

In [29]:
mypipeline_lr = Pipeline([
    ('lr', LogisticRegression())])

param_grid_lr = { 'lr__penalty': ['l1', 'l2', 'elasticnet', 'none'],
                  'lr__tol': [0.00001, 0.0001, 0.001],
                  'lr__max_iter': [50, 100, 250, 500],
                  'lr__solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
                  'lr__C' : [0.0001, 0.001, 0.01, 0.05, 0.1, 0.5, 0.0, 1.0]
}


In [None]:
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=123)

hyper_search_lr = RandomizedSearchCV(mypipeline_lr, param_grid_lr, n_iter=100, scoring='roc_auc', 
                                  cv=cv, n_jobs=4, refit=True, random_state=47,
                                  verbose=5)
hyper_search_lr.fit(X_train, y_train)

In [29]:
print('LogRes', hyper_search_lr.best_score_)

LogRes 0.9806369022492056


In [30]:
print(hyper_search_lr.best_params_)

{'lr__tol': 0.001, 'lr__solver': 'saga', 'lr__penalty': 'none', 'lr__max_iter': 100, 'lr__C': 1.0}


In [56]:
lr = LogisticRegression(tol=0.001, solver='saga', penalty = 'none', max_iter = 100, C =1.0)

In [46]:
lr.fit(df_train, target)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='none',
                   random_state=None, solver='saga', tol=0.001, verbose=0,
                   warm_start=False)

In [50]:
predict_lr = lr.predict_proba(df_new_test)

In [35]:
roc_auc_score(y_test, predict_lr)

0.9321354806308355

In [51]:
score_3 = []
for i in range(len(predict_lr)):
    score_3.append(predict_lr[i][1])

In [66]:
df_test = df_new[df_new['sample']=='test']

In [67]:
pd.DataFrame(zip(
    df_test['id'],
    score_3
), columns=['id','target']).to_csv('predict_27-02.csv', sep=',', index=False)

In [31]:
lr = LogisticRegression(tol=0.001, solver='saga', penalty = 'none', max_iter = 100, C =1.0)

In [32]:
lr.fit(df_train_tfidf, target)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='none',
                   random_state=None, solver='saga', tol=0.001, verbose=0,
                   warm_start=False)

In [None]:
predict_lr_2 = lr.predict_proba(df_test_tfidf)

In [33]:
predict_lr_1 = lr.predict_proba(df_test_tfidf)

In [28]:
df_train_tfidf = tfidf.transform(train_tfidf)
df_test_tfidf = tfidf.transform(test_tfidf )

In [34]:
predict_lr_1

array([[1.22747069e-08, 9.99999988e-01],
       [9.99999948e-01, 5.19170883e-08],
       [9.90821909e-05, 9.99900918e-01],
       ...,
       [9.99903721e-01, 9.62793540e-05],
       [7.86437733e-03, 9.92135623e-01],
       [4.38271940e-04, 9.99561728e-01]])

In [35]:
score_4 = []
for i in range(len(predict_lr_1)):
    score_4.append(predict_lr_1[i][1])

In [39]:
pd.DataFrame(zip(
    test_id,
    score_4
), columns=['id','target']).to_csv('predict_space_matrix.csv', sep=',', index=False)

In [40]:
predict_2 = pd.read_csv('predict_27-02.csv', sep=',')

In [55]:
d = pd.DataFrame(zip(score_3, score_4)) 

In [57]:
d['sum'] = (d[0]+d[1])/2

In [59]:
d

Unnamed: 0,0,1,sum
0,0.999504,1.000000e+00,0.999752
1,0.095038,5.191709e-08,0.047519
2,0.999741,9.999009e-01,0.999821
3,0.994797,1.000000e+00,0.997398
4,0.766031,8.577916e-01,0.811911
...,...,...,...
31059,0.991579,9.626039e-01,0.977091
31060,0.636897,1.557382e-04,0.318526
31061,0.013173,9.627935e-05,0.006634
31062,0.078997,9.921356e-01,0.535566


In [60]:
pd.DataFrame(zip(
    test_id,
    d['sum'] 
), columns=['id','target']).to_csv('predict_space_matrix_with_w2v.csv', sep=',', index=False)