Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

In [4]:
# import sys
# !{sys.executable} -m pip install spacy
# !{sys.executable} -m spacy download en
# import spacy

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.utils import resample
from sklearn.metrics import f1_score


In [6]:
from pprint import pprint
from IPython.display import Markdown, display, display_html, HTML

def printmd(string, color=None):
    if color != None:
        string = "<span style='color:{}'>{}</span>".format(color, string)
    display(Markdown(string))    

def print_bold(string, color=None):
    printmd("**{}**".format(string), color)
    
def print_italic(string, color=None):
    printmd("*{}*".format(string), color)
    
def print_header(string):
    printmd("### {}".format(string))

def display_side_by_side(*args):
    html_str=''
    for df in args:
        html_str += df.to_html()
    display_html(html_str.replace('table','table style="display:inline;margin-right: 10px;"'),raw=True)

def print_df(df):
    display_side_by_side(df)
    
# printmd('__bold__')
# print_bold('bold')
# print_header('print_header')
# print_italic('italic')

In [7]:
def sizeof_fmt(num, suffix='B'):
    for unit in ['','Ki','Mi','Gi','Ti','Pi','Ei','Zi']:
        if abs(num) < 1024.0:
            return "%3.1f%s%s" % (num, unit, suffix)
        num /= 1024.0
    return "%.1f%s%s" % (num, 'Yi', suffix)

def _sizeof_fmt(elem):
    elem = list(elem)
    elem[1] = sizeof_fmt(elem[1])
    return elem


def print_mem_usage_vars(_dir):
    # These are the usual ipython objects, including this one you are creating
    ipython_vars = ['In', 'Out', 'exit', 'quit', 'get_ipython', 'ipython_vars']

    mem_usage = [(x, sys.getsizeof(globals().get(x))) for x in _dir if not x.startswith('_') and x not in sys.modules and x not in ipython_vars]
    mem_usage = sorted(mem_usage, key=lambda x: x[1], reverse=True)

    pprint(list(map(_sizeof_fmt, mem_usage[:10])))

# print_mem_usage_vars(dir())

# Загрузка и изучение данных

In [8]:
data = pd.read_csv('/datasets/toxic_comments.csv')
data.head(3)

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0


In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [10]:
data.shape

(159571, 2)

In [11]:
data.isna().sum()

text     0
toxic    0
dtype: int64

In [12]:
data.duplicated(subset=['text'], keep=False).sum()

0

In [13]:
def print_class_balance(data):
    cnt_total = data.shape[0]
    cnt_toxic = data[data['toxic'] == 1]['toxic'].count()
    cnt_not_toxic = data[data['toxic'] == 0]['toxic'].count()
    prc_toxic = round((cnt_toxic*100)/cnt_total,2)

    print('Всего: {}, токсичных {} ({}%), не токсичных {}'.format(
        cnt_total,
        cnt_toxic,
        prc_toxic,
        cnt_not_toxic
    ))
    
print_class_balance(data)    

Всего: 159571, токсичных 16225 (10.17%), не токсичных 143346


# Выводы

Всего: 159571, токсичных 16225 (10.17%), не токсичных 143346

* Датасет несбалансирован
* Содержит только английские фразы
* Дубликатов нет
* Пустых строк нет

## Предобработка

In [14]:
stem = SnowballStemmer("english")

def lemmatize(sentence):
    return ' '.join([stem.stem(w) for w in nltk.word_tokenize(sentence)])

def clear_text(text):
    return " ".join(re.sub(r'[^A-Za-z`\' ]', ' ', text).split())

def text_process(sentence):
    return lemmatize(clear_text(sentence))

# test
# sentence = data.iloc[0]['text']
# print(sentence)
# print('--')
# print(text_process(sentence))


In [15]:
# nlp = spacy.load('en', disable=['parser', 'ner'])

# def lemmatize2(sentence):
#     return ' '.join([token.lemma_ for token in nlp(sentence)])

# def text_process2(sentence):
#     return lemmatize2(clear_text(sentence))

# test2
# sentence = data.iloc[0]['text']
# print(sentence)
# print('--')
# print(text_process2(sentence))

In [16]:
# %%time
#
# Using SnowballStemmer
#

# data['lemm_text'] = data['text'].apply(
#     lambda x: text_process(x)
# )
# data.to_csv(r'toxic_comments_lemmatized.csv', index = False)

# # CPU times: user 4min 4s, sys: 360 ms, total: 4min 4s
# # Wall time: 4min 7s

data = pd.read_csv('toxic_comments_lemmatized.csv')[['lemm_text', 'toxic']]
data.shape

(159571, 2)

In [17]:
# %%time
#
# Using spacy
#

# data['lemm_text'] = data['text'].apply(
#     lambda x: text_process2(x)
# )
# data.to_csv(r'toxic_comments_lemmatized2.csv', index = False)

# CPU times: user 15min 34s, sys: 1.56 s, total: 15min 36s
# Wall time: 15min 53s

# data = pd.read_csv('toxic_comments_lemmatized2.csv')[['lemm_text', 'toxic']]
# data.shape

In [18]:
data.isna().sum()

lemm_text    6
toxic        0
dtype: int64

## Выводы

При подготовке текста

* Удалил все символы по маске r'[^A-Za-z`\' ]'
* Использовал SnowballStemmer, хотя очень хотелось nltk wordnet

In [19]:
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()

# sentence = data.iloc[0]['text']
# sentence = clear_text(sentence) 
# word_list = nltk.word_tokenize(sentence)
# lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
# lemmatized_output = ' '.join([wordnet.synsets(w) for w in word_list])
#     return lemmatized_output
# print('**Source**')
# print(sentence)
# print()
# print('**lematized**')
# print(lemmatized_output)
# print()

print('**Sample**')
print(lemmatizer.lemmatize("bats"))
print(lemmatizer.lemmatize("feet"))
print(lemmatizer.lemmatize("edits"))
print(lemmatizer.lemmatize("voted"))
print(lemmatizer.lemmatize("driving"))
print()



**Sample**
bat
foot
edits
voted
driving



<font color=maroon>
По словам bats, feet я вижу что как-то оно работает, но дальше я ожидаю преобразования

* edits -> edit
* voted -> vote
* driving -> drive

А этого не происходит
</font>

<font color='green'>Да, чтобы не возникало ошибок, надо корректно задавать части речи для слов, см. ссылку https://stackoverflow.com/questions/32957895/wordnetlemmatizer-not-returning-the-right-lemma-unless-pos-is-explicit-python  </font>


## Анализ моделей

In [20]:
def do_upsampling(df):
    df_majority = df[df.toxic==0]
    df_minority = df[df.toxic==1]
    n_samples = df_majority.shape[0]

    df_minority_upsampled = resample(df_minority, 
                                     replace=True,     # sample with replacement
                                     n_samples=n_samples,    # to match majority class
                                     random_state=42) # reproducible results

    df_upsampled = pd.concat([df_majority, df_minority_upsampled])
    print_class_balance(df_upsampled)
    return df_upsampled


def do_downsampling(df):
    df_majority = df[df.toxic==0]
    df_minority = df[df.toxic==1]
    n_samples = df_minority.shape[0]

    df_majority_downsampled = resample(df_majority, 
                                     replace=False,    # sample without replacement
                                     n_samples=n_samples,     # to match minority class
                                     random_state=42) # reproducible results


    df_downsampled = pd.concat([df_majority_downsampled, df_minority])
    print_class_balance(df_downsampled)
    return df_downsampled


In [21]:
def create_tf_idf(df_train, df_test):
#     corpus_train = df_train['lemm_text'].values.astype('U')
#     corpus_test = df_test['lemm_text'].values.astype('U')
    corpus_train = df_train['lemm_text'].apply(lambda x: np.str_(x))
    corpus_test = df_test['lemm_text'].apply(lambda x: np.str_(x))

    
    # nltk.download('stopwords')
    stopwords = set(nltk_stopwords.words('english'))

    tf_idf_model = TfidfVectorizer(
        min_df=5, max_df=0.7, stop_words=stopwords
    )

    tf_idf_model.fit(corpus_train)
    tf_idf_train = tf_idf_model.transform(corpus_train)
    tf_idf_test = tf_idf_model.transform(corpus_test)

    print("Размер матрицы train:", tf_idf_train.shape, 'test:', tf_idf_test.shape)
    return (tf_idf_train, tf_idf_test)

In [22]:
def commit_result(nick, y_train, y_pred_train, y_test, y_pred_test):
    f1_score_train =  f1_score(y_train, y_pred_train)
    f1_score_test = f1_score(y_test, y_pred_test)
    print(nick)
    print('train f1_score', f1_score_train)
    print('test f1_score', f1_score_test)

    return (f1_score_train, f1_score_test)

# test
# commit_result('test', [1, 0], [1, 0], [1, 0], [1, 0])

In [23]:
def df_result_init():
    return pd.DataFrame(columns=['nick', 'f1_train', 'f1_test'])

def df_result_add(df_result, nick, f1_train, f1_test):
    return df_result.append(
        {'nick': nick, 'f1_train': f1_train, 'f1_test': f1_test}, 
        ignore_index=True
    )

df_result = df_result_init().copy()

# test
# commit_result(nick, y_train, y_pred_train, y_test, y_pred_test)
# f1_train, f1_test = commit_result('test', [1, 0], [1, 0], [1, 0], [1, 0])
# df_result = df_result_add(df_result, 'test', f1_train, f1_test)
# df_result


In [24]:
data_train_full, data_valid = train_test_split(
    data, test_size = 0.1, random_state = 42
)

In [25]:
print_italic('Баланс классов обучающего набора')
print_class_balance(data_train_full)

print_italic('Баланс классов валидационного набора')
print_class_balance(data_valid)


*Баланс классов обучающего набора*

Всего: 143613, токсичных 14650 (10.2%), не токсичных 128963


*Баланс классов валидационного набора*

Всего: 15958, токсичных 1575 (9.87%), не токсичных 14383


In [26]:
print_italic('Создание датафрейма с балансированными классами')
printmd(
    "* Я взял полные тренировочные данные\n"+
    "* Сделал на базе него датасет где присутствуют все данные toxic==1 и столько же строк toxic==1\n"+
    "* Результирующий датасет перемешал\n"
)

df_c1 = data_train_full[data_train_full['toxic']==1]
df_c0 = data_train_full[data_train_full['toxic']==0]

c1_cnt = df_c1.shape[0]

df_c0_balanced = df_c0.sample(c1_cnt, random_state=42)
data_train_balanced = pd.concat([df_c1, df_c0_balanced])
data_train_balanced = data_train_balanced.sample(frac=1)

print_class_balance(data_train_balanced)


*Создание датафрейма с балансированными классами*

* Я взял полные тренировочные данные
* Сделал на базе него датасет где присутствуют все данные toxic==1 и столько же строк toxic==1
* Результирующий датасет перемешал


Всего: 29300, токсичных 14650 (50.0%), не токсичных 14650


In [27]:
data_smp10k = data_train_full.sample(10000, random_state=42).copy()
data_smp10k_balanced = data_train_balanced.sample(10000, random_state=42).copy()

print_italic('Сэмпл данных 10000 строк')
print_class_balance(data_smp10k)

print_italic('Сэмпл данных 10000 строк (сбалансированных)')
print_class_balance(data_smp10k_balanced)


*Сэмпл данных 10000 строк*

Всего: 10000, токсичных 1014 (10.14%), не токсичных 8986


*Сэмпл данных 10000 строк (сбалансированных)*

Всего: 10000, токсичных 4966 (49.66%), не токсичных 5034


In [28]:
result = []

## LogisticRegression(class_weight='balanced')

In [29]:
data_train, data_test = train_test_split(
    data_smp10k, test_size = 0.1, random_state = 42
)

(X_train, X_test) = create_tf_idf(data_train, data_test)

y_train = data_train['toxic']
y_test = data_test['toxic']

Размер матрицы train: (9000, 4468) test: (1000, 4468)


In [30]:
model = LogisticRegression(class_weight='balanced', solver='lbfgs')
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

nick = 'LogisticRegression origin'
f1_train, f1_test = commit_result(nick, y_train, y_pred_train, y_test, y_pred_test)
result.append({
    'nick':nick, 'train_size':X_train.shape[0], 'test_size':X_test.shape[0],
    'f1_train':f1_train, 'f1_test':f1_test}
)
df_result = df_result_add(df_result, nick, f1_train, f1_test)
df_result


LogisticRegression origin
train f1_score 0.8539651837524178
test f1_score 0.7079646017699114


Unnamed: 0,nick,f1_train,f1_test
0,LogisticRegression origin,0.853965,0.707965


## LogisticRegression balanced

In [31]:
data_train, data_test = train_test_split(
    data_smp10k_balanced, test_size = 0.1, random_state = 42
)

(X_train, X_test) = create_tf_idf(data_train, data_test)

y_train = data_train['toxic']
y_test = data_test['toxic']

Размер матрицы train: (9000, 4254) test: (1000, 4254)


In [32]:
model = LogisticRegression(class_weight='balanced', solver='lbfgs')
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

nick = 'LogisticRegression balanced full'
f1_train, f1_test = commit_result(nick, y_train, y_pred_train, y_test, y_pred_test)
result.append({
    'nick':nick, 'train_size':X_train.shape[0], 'test_size':X_test.shape[0],
    'f1_train':f1_train, 'f1_test':f1_test}
)
df_result = df_result_add(df_result, nick, f1_train, f1_test)
df_result


LogisticRegression balanced full
train f1_score 0.9261929282526604
test f1_score 0.8891213389121339


Unnamed: 0,nick,f1_train,f1_test
0,LogisticRegression origin,0.853965,0.707965
1,LogisticRegression balanced full,0.926193,0.889121


## LogisticRegression upsampling

In [33]:
print_class_balance(data_smp10k)

Всего: 10000, токсичных 1014 (10.14%), не токсичных 8986


In [34]:
data_train, data_test = train_test_split(
    data_smp10k, test_size = 0.1, random_state = 42
)

data_train_upsampled = do_upsampling(data_train)

(X_train, X_test) = create_tf_idf(data_train_upsampled, data_test)

y_train = data_train_upsampled['toxic']
y_test = data_test['toxic']

Всего: 16194, токсичных 8097 (50.0%), не токсичных 8097
Размер матрицы train: (16194, 6620) test: (1000, 6620)


In [35]:
model = LogisticRegression(class_weight='balanced', solver='lbfgs')
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

nick = 'LogisticRegression upsampling'
f1_train, f1_test = commit_result(nick, y_train, y_pred_train, y_test, y_pred_test)
result.append({
    'nick':nick, 'train_size':X_train.shape[0], 'test_size':X_test.shape[0],
    'f1_train':f1_train, 'f1_test':f1_test}
)
df_result = df_result_add(df_result, nick, f1_train, f1_test)
df_result

LogisticRegression upsampling
train f1_score 0.9834700624464308
test f1_score 0.7053571428571429


Unnamed: 0,nick,f1_train,f1_test
0,LogisticRegression origin,0.853965,0.707965
1,LogisticRegression balanced full,0.926193,0.889121
2,LogisticRegression upsampling,0.98347,0.705357


## LogisticRegression downsampling

In [36]:
data_train, data_test = train_test_split(
    data_smp10k, test_size = 0.1, random_state = 42
)

data_train_downsampling = do_downsampling(data_train)

(X_train, X_test) = create_tf_idf(data_train_downsampling, data_test)

y_train = data_train_downsampling['toxic']
y_test = data_test['toxic']

Всего: 1806, токсичных 903 (50.0%), не токсичных 903
Размер матрицы train: (1806, 1619) test: (1000, 1619)


In [37]:
model = LogisticRegression(class_weight='balanced', solver='lbfgs')
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

nick = 'LogisticRegression downsampling'
f1_train, f1_test = commit_result(nick, y_train, y_pred_train, y_test, y_pred_test)
result.append({
    'nick':nick, 'train_size':X_train.shape[0], 'test_size':X_test.shape[0],
    'f1_train':f1_train, 'f1_test':f1_test}
)
df_result = df_result_add(df_result, nick, f1_train, f1_test)
df_result

LogisticRegression downsampling
train f1_score 0.931830985915493
test f1_score 0.6319444444444445


Unnamed: 0,nick,f1_train,f1_test
0,LogisticRegression origin,0.853965,0.707965
1,LogisticRegression balanced full,0.926193,0.889121
2,LogisticRegression upsampling,0.98347,0.705357
3,LogisticRegression downsampling,0.931831,0.631944


## Результаты

In [38]:
df_result

Unnamed: 0,nick,f1_train,f1_test
0,LogisticRegression origin,0.853965,0.707965
1,LogisticRegression balanced full,0.926193,0.889121
2,LogisticRegression upsampling,0.98347,0.705357
3,LogisticRegression downsampling,0.931831,0.631944


In [39]:
for row in result:
    print(row['nick'])
    print('f1_train', row['f1_train'])
    print('f1_test', row['f1_test'])
    print('train_size', row['train_size'])
    print('test_size', row['test_size'])
    print()

LogisticRegression origin
f1_train 0.8539651837524178
f1_test 0.7079646017699114
train_size 9000
test_size 1000

LogisticRegression balanced full
f1_train 0.9261929282526604
f1_test 0.8891213389121339
train_size 9000
test_size 1000

LogisticRegression upsampling
f1_train 0.9834700624464308
f1_test 0.7053571428571429
train_size 16194
test_size 1000

LogisticRegression downsampling
f1_train 0.931830985915493
f1_test 0.6319444444444445
train_size 1806
test_size 1000



## Выводы

#### На семпле данных в 10000 строк получились следующие результаты

LogisticRegression origin
* f1_train 0.8539651837524178
* f1_test 0.7079646017699114
* train_size 9000
* test_size 1000

LogisticRegression upsampling
* f1_train 0.9834700624464308
* f1_test 0.7053571428571429
* train_size 16194
* test_size 1000

LogisticRegression downsampling
* f1_train 0.931830985915493
* f1_test 0.6319444444444445
* train_size 1806
* test_size 1000


#### Сбалансированный датасет
* Я взял полные тренировочные данные
* Сделал на базе него датасет где присутствуют все данные toxic==1 и столько же строк toxic==1
* Результирующий датасет перемешал

LogisticRegression balanced full
* f1_train 0.9288389513108614
* f1_test 0.8617363344051446
* train_size 9000
* test_size 1000



Для обучения я использовал гиперпараметр class_weight='balanced'

# 2. Обучение

In [40]:
result_final = []

## LogisticRegression full balanced

In [41]:
print_class_balance(data_train_balanced)

Всего: 29300, токсичных 14650 (50.0%), не токсичных 14650


In [42]:
%%time

(X_train, X_test) = create_tf_idf(data_train_balanced, data_valid)

y_train = data_train_balanced['toxic']
y_test = data_valid['toxic']

Размер матрицы train: (29300, 8600) test: (15958, 8600)
CPU times: user 5.84 s, sys: 64 ms, total: 5.91 s
Wall time: 6.03 s


In [43]:
%%time

model = LogisticRegression(class_weight='balanced', solver='lbfgs')
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred = model.predict(X_test)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

nick = 'LogisticRegression full balanced'
f1_train, f1_test = commit_result(nick, y_train, y_pred_train, y_test, y_pred_test)
result_final.append({
    'nick':nick, 'train_size':X_train.shape[0], 'test_size':X_test.shape[0],
    'f1_train':f1_train, 'f1_test':f1_test}
)

LogisticRegression full balanced
train f1_score 0.9264588050862219
test f1_score 0.6862213458220361
CPU times: user 484 ms, sys: 4 ms, total: 488 ms
Wall time: 495 ms


## LogisticRegression full

In [44]:
print_class_balance(data_train_full)

Всего: 143613, токсичных 14650 (10.2%), не токсичных 128963


In [45]:
%%time

(X_train, X_test) = create_tf_idf(data_train_full, data_valid)

y_train = data_train_full['toxic']
y_test = data_valid['toxic']

Размер матрицы train: (143613, 23047) test: (15958, 23047)
CPU times: user 21 s, sys: 300 ms, total: 21.3 s
Wall time: 21.6 s


In [46]:
%%time

model = LogisticRegression(class_weight='balanced', solver='lbfgs')
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred = model.predict(X_test)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

nick = 'LogisticRegression full'
f1_train, f1_test = commit_result(nick, y_train, y_pred_train, y_test, y_pred_test)
result_final.append({
    'nick':nick, 'train_size':X_train.shape[0], 'test_size':X_test.shape[0],
    'f1_train':f1_train, 'f1_test':f1_test}
)



LogisticRegression full
train f1_score 0.8003239241092086
test f1_score 0.7518918918918919
CPU times: user 18.3 s, sys: 33.1 s, total: 51.5 s
Wall time: 51.5 s


## LogisticRegression full upsamling

In [47]:
data_train_upsampled = do_upsampling(data_train_full)


Всего: 257926, токсичных 128963 (50.0%), не токсичных 128963


In [48]:
%%time

(X_train, X_test) = create_tf_idf(data_train_upsampled, data_valid)

y_train = data_train_upsampled['toxic']
y_test = data_valid['toxic']

Размер матрицы train: (257926, 32483) test: (15958, 32483)
CPU times: user 30.8 s, sys: 0 ns, total: 30.8 s
Wall time: 31 s


In [49]:
%%time

model = LogisticRegression(class_weight='balanced', solver='lbfgs')
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred = model.predict(X_test)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

nick = 'LogisticRegression full upsampling'
f1_train, f1_test = commit_result(nick, y_train, y_pred_train, y_test, y_pred_test)
result_final.append({
    'nick':nick, 'train_size':X_train.shape[0], 'test_size':X_test.shape[0],
    'f1_train':f1_train, 'f1_test':f1_test}
)



LogisticRegression full upsampling
train f1_score 0.9635815409799358
test f1_score 0.756711873789095
CPU times: user 22.9 s, sys: 32.9 s, total: 55.7 s
Wall time: 55.8 s


## LogisticRegression full spacy

In [50]:
data_spacy = pd.read_csv('toxic_comments_lemmatized2.csv')[['lemm_text', 'toxic']]

In [51]:
print_class_balance(data_spacy)

Всего: 159571, токсичных 16225 (10.17%), не токсичных 143346


In [52]:
data_spacy_train_full, data_spacy_valid = train_test_split(
    data_spacy, test_size = 0.1, random_state = 42
)

In [53]:
%%time

(X_train, X_test) = create_tf_idf(data_spacy_train_full, data_spacy_valid)

y_train = data_spacy_train_full['toxic']
y_test = data_spacy_valid['toxic']

Размер матрицы train: (143613, 27657) test: (15958, 27657)
CPU times: user 21.3 s, sys: 0 ns, total: 21.3 s
Wall time: 21.4 s


In [54]:
%%time

model = LogisticRegression(class_weight='balanced', solver='lbfgs')
model.fit(X_train, y_train)

y_pred_train = model.predict(X_train)
y_pred = model.predict(X_test)

y_pred_train = model.predict(X_train)
y_pred_test = model.predict(X_test)

nick = 'LogisticRegression full spacy'
f1_train, f1_test = commit_result(nick, y_train, y_pred_train, y_test, y_pred_test)
result_final.append({
    'nick':nick, 'train_size':X_train.shape[0], 'test_size':X_test.shape[0],
    'f1_train':f1_train, 'f1_test':f1_test}
)



LogisticRegression full spacy
train f1_score 0.8067168298422475
test f1_score 0.7469421038325632
CPU times: user 16.3 s, sys: 26.7 s, total: 43 s
Wall time: 43.2 s


## Результаты

In [55]:
for row in result_final:
    print(row['nick'])
    print('f1_train', row['f1_train'])
    print('f1_test', row['f1_test'])
    print('train_size', row['train_size'])
    print('test_size', row['test_size'])
    print()

LogisticRegression full balanced
f1_train 0.9264588050862219
f1_test 0.6862213458220361
train_size 29300
test_size 15958

LogisticRegression full
f1_train 0.8003239241092086
f1_test 0.7518918918918919
train_size 143613
test_size 15958

LogisticRegression full upsampling
f1_train 0.9635815409799358
f1_test 0.756711873789095
train_size 257926
test_size 15958

LogisticRegression full spacy
f1_train 0.8067168298422475
f1_test 0.7469421038325632
train_size 143613
test_size 15958



# 3. Выводы

В основном была борьба с Dead kernel, чем с задачей. Но результат 0.75 достигнут! =)

### Борьба с Dead kernel
**Большая часть изменений в create_tf_idf(df_train, df_test)**<br>
Вместо ```corpus_train = df_train['lemm_text'].values.astype('U')``` делаю ```corpus_train = df_train['lemm_text'].apply(lambda x: np.str_(x))```

TfidfVectorizer использую с параметрами ```TfidfVectorizer(min_df=5, max_df=0.7, stop_words=stopwords)```


**Лемматизация \ стеммизация**<br>
Датасет с преобразованным столбцом сохранил в файле, для использования загружаю только нужные стольбцы<br>
```data = pd.read_csv('toxic_comments_lemmatized2.csv')[['lemm_text', 'toxic']]```

### Результаты
LogisticRegression full balanced
* f1_train 0.9264588050862219
* f1_test 0.6862213458220361
* train_size 29300
* test_size 15958

LogisticRegression full
* f1_train 0.8003239241092086
* f1_test 0.7518918918918919
* train_size 143613
* test_size 15958

LogisticRegression full upsampling
* f1_train 0.9635815409799358
* f1_test 0.756711873789095
* train_size 257926
* test_size 15958

LogisticRegression full spacy
* f1_train 0.8067168298422475
* f1_test 0.7469421038325632
* train_size 143613
* test_size 15958

### Итого
* Везде кроме одного обучения использовал стеммер SnowballStemmer
* Получается что размер и разнообразие датасета важнее баланса: *LogisticRegression full balanced vs LogisticRegression full*, думаю критическое значение имеет значения объем векторизованных представлений слов: TfidfVectorizer
* SnowballStemmer показал лучшие результаты чем spacy при одинаковых размерах выборок и прочих параметрах, возможно я не умею его готовить.
