Мы владельцы специфического Job-сайта и нам дали большой датасет вакансий. Одни вакансии нам интересны по своей тематике, другие не интересны (target 1 и 0 соответственно). Часть вакансий была размечена людскими ресурсами.
Ваша задача обучить классификатор, который на основе размеченной выборки умеет определять интересные вакансии для нашего сайта.
> -  Метрика качества ROC_AUC.
> -  ИСПОЛЬЗОВАТЬ ВНЕШНИЕ ДАННЫЕ С JOB-сайтов = ЗАПРЕЩЕНО
> -  ИСПОЛЬЗОВАТЬ другие ВНЕШНИЕ ДАННЫЕ = только с разрешения организатора (смотри Discussion)
> -  Результат засчитывается только при наличие кода, который этот результат повторяет

## Описание данных
-  train.csv - данные для обучения
-  test.csv - данные для подготовки самбита и проверки
-  sampleSubmission.csv - пример корректного но бесполезного сабмита
-  other.csv - необязательные данные для доп.статистик и прочих извращений (например обучение word2vec-а)

## Описание полей
-  id - внутренний идетификатор
-  name - название вакансии
-  description - текст вакансии
-  target - класс заинтересованности

In [1]:
import pandas as pd
from bs4 import BeautifulSoup
from tqdm import tqdm_notebook

In [2]:
df = pd.read_csv('train.csv', engine='python', sep = '\t', encoding = 'UTF-8')

In [3]:
df.head(3)

Unnamed: 0,id,name,description,target
0,0,Заведующий отделом/секцией в магазин YORK (Уру...,<p><strong>В НОВЫЙ МАГАЗИН YORK (хозтовары) пр...,1
1,1,Наладчик станков и манипуляторов с ПУ,Обязанности:работа на токарных станках с ЧПУ T...,0
2,2,Разработчик С++ (Криптограф),<strong>Требования:</strong> <ul> <li>Опыт про...,0


In [27]:
def Preprocessing (df):
    temp = df.copy()
    temp.description = temp.description.map(lambda x: BeautifulSoup(x, 'lxml').get_text().replace(u'\u200b', u''))
    return temp

def clean_df( df, algorithm = Preprocessing):
    return algorithm(df.copy())

In [28]:
df_preprocessed = clean_df(df)

In [26]:
import re
import nltk

In [31]:
from nltk.stem import snowball
corpus = []
for i in range(0, 200000):
    review = df_preprocessed['description'][i]
    review = review.lower()
    review = review.split()
    ps = snowball.RussianStemmer()
    review = [ps.stem(word) for word in review]
    review = ' '.join(review)
    corpus.append(review)

In [37]:
# Creating the Bag of Words model
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 500)

In [38]:
X = cv.fit_transform(corpus).toarray()
y = df_preprocessed['target'].values

In [39]:
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)



In [55]:
# Fitting Naive Bayes to the Training set
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=0, verbose=0, warm_start=False)

In [41]:
# Predicting the Test set results
y_pred = classifier.predict(X_test)

In [42]:
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

array([[20210,   995],
       [ 1023, 17772]], dtype=int64)

In [44]:
df2 = pd.read_csv('test.csv', engine='python', sep = '\t', encoding = 'UTF-8')

In [45]:
df2_preprocessed = clean_df(df2)

In [47]:
len(df2_preprocessed)

170179

In [48]:
from nltk.stem import snowball
corpus2 = []
for i in range(0, 170179):
    review = df2_preprocessed['description'][i]
    review = review.lower()
    review = review.split()
    ps = snowball.RussianStemmer()
    review = [ps.stem(word) for word in review]
    review = ' '.join(review)
    corpus2.append(review)

In [49]:
X_2 = cv.fit_transform(corpus2).toarray()

In [56]:
y_pred = classifier.predict(X_2)

In [57]:
final = pd.DataFrame()

In [58]:
final['id'] = df2_preprocessed['id']

In [59]:
final['target'] = y_pred

In [60]:
final.to_csv('submit.csv',index=False)

In [67]:
from joblib import Parallel, delayed
import multiprocessing

In [108]:
num_cores = multiprocessing.cpu_count()
n = 1000  #chunk row size
list_df = [df[i:i+n] for i in range(0,df.shape[0],n)]
texts = Parallel(n_jobs=num_cores, verbose=50)(delayed(
    parce_text)(i)for i in list_df)
#new_df = pd.concat(texts)

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   1 tasks      | elapsed:    2.0s
[Parallel(n_jobs=4)]: Done   2 tasks      | elapsed:    2.1s
[Parallel(n_jobs=4)]: Done   3 tasks      | elapsed:    2.1s
[Parallel(n_jobs=4)]: Done   4 tasks      | elapsed:    2.1s
[Parallel(n_jobs=4)]: Done   5 tasks      | elapsed:    4.7s
[Parallel(n_jobs=4)]: Done   6 tasks      | elapsed:    4.7s
[Parallel(n_jobs=4)]: Done   7 tasks      | elapsed:    4.9s
[Parallel(n_jobs=4)]: Done   8 tasks      | elapsed:    4.9s
[Parallel(n_jobs=4)]: Done   9 tasks      | elapsed:    7.2s
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:    7.2s
[Parallel(n_jobs=4)]: Done  11 tasks      | elapsed:    7.3s
[Parallel(n_jobs=4)]: Done  12 tasks      | elapsed:    7.4s
[Parallel(n_jobs=4)]: Done  13 tasks      | elapsed:    9.4s
[Parallel(n_jobs=4)]: Done  14 tasks      | elapsed:    9.5s
[Parallel(n_jobs=4)]: Done  15 tasks      | elapsed:    9.5s
[Parallel(

[Parallel(n_jobs=4)]: Done 135 tasks      | elapsed:  1.4min
[Parallel(n_jobs=4)]: Done 136 tasks      | elapsed:  1.4min
[Parallel(n_jobs=4)]: Done 137 tasks      | elapsed:  1.4min
[Parallel(n_jobs=4)]: Done 138 tasks      | elapsed:  1.4min
[Parallel(n_jobs=4)]: Done 139 tasks      | elapsed:  1.4min
[Parallel(n_jobs=4)]: Done 140 tasks      | elapsed:  1.4min
[Parallel(n_jobs=4)]: Done 141 tasks      | elapsed:  1.5min
[Parallel(n_jobs=4)]: Done 142 tasks      | elapsed:  1.5min
[Parallel(n_jobs=4)]: Done 143 tasks      | elapsed:  1.5min
[Parallel(n_jobs=4)]: Done 144 tasks      | elapsed:  1.5min
[Parallel(n_jobs=4)]: Done 145 tasks      | elapsed:  1.5min
[Parallel(n_jobs=4)]: Done 146 tasks      | elapsed:  1.5min
[Parallel(n_jobs=4)]: Done 147 tasks      | elapsed:  1.5min
[Parallel(n_jobs=4)]: Done 148 tasks      | elapsed:  1.5min
[Parallel(n_jobs=4)]: Done 149 tasks      | elapsed:  1.5min
[Parallel(n_jobs=4)]: Done 150 tasks      | elapsed:  1.6min
[Parallel(n_jobs=4)]: Do