<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Подготовка</a></span><ul class="toc-item"><li><span><a href="#Общий-обзор-данных" data-toc-modified-id="Общий-обзор-данных-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Общий обзор данных</a></span></li><li><span><a href="#Векторизация-текста" data-toc-modified-id="Векторизация-текста-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Векторизация текста</a></span></li></ul></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Обучение</a></span><ul class="toc-item"><li><span><a href="#Логистическая-регрессия" data-toc-modified-id="Логистическая-регрессия-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Логистическая регрессия</a></span></li><li><span><a href="#Случайный-лес" data-toc-modified-id="Случайный-лес-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Случайный лес</a></span></li></ul></li><li><span><a href="#Выводы" data-toc-modified-id="Выводы-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Выводы</a></span></li></ul></div>

## Подготовка

### Общий обзор данных

В рамках этого пункта получим общее представление о данных: ознакомимся с форматом таблицы, проверим данные на пропуски и дубликаты. 

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
import matplotlib.pyplot as plt
from tqdm import tqdm
import pymystem3
from pymystem3 import Mystem
import spacy


from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer 


from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer 
from nltk import pos_tag, word_tokenize
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\freak\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\freak\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\freak\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\freak\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [2]:
try: 
    data = pd.read_csv('/datasets/toxic_comments.csv')
except:
    data = pd.read_csv('C:\\Users\\freak\\Desktop\\Python\\ML_for_texts_project\\toxic_comments.csv')

In [3]:
data.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [5]:
data.duplicated().sum()

0

In [6]:
data.isna().sum()

text     0
toxic    0
dtype: int64

Можно заметить отсутствие пропусков и явных дубликатов в данных. 

### Векторизация текста

In [8]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [9]:
%%time
wnl = WordNetLemmatizer()

def clear_text(text):
    pattern = re.sub(r'[^a-zA-Z]', ' ', text)
    clear = pattern.split()
    lemm = []
    for i in range(len(clear)):
        lemm.append(wnl.lemmatize(clear[i], get_wordnet_pos(clear[i])))
    return " ".join(lemm)

Wall time: 0 ns


Добавим в таблицу столбец с очищенными и лемматизированными комментариями, чтобы использовать его для обучения моделей. 

In [10]:
%%time
data['lemm_text'] = data['text'].apply(clear_text)

Wall time: 59min 57s


In [11]:
data.head()

Unnamed: 0,text,toxic,lemm_text
0,Explanation\nWhy the edits made under my usern...,0,Explanation Why the edits make under my userna...
1,D'aww! He matches this background colour I'm s...,0,D aww He match this background colour I m seem...
2,"Hey man, I'm really not trying to edit war. It...",0,Hey man I m really not try to edit war It s ju...
3,"""\nMore\nI can't make any real suggestions on ...",0,More I can t make any real suggestion on impro...
4,"You, sir, are my hero. Any chance you remember...",0,You sir be my hero Any chance you remember wha...


Данные подготовлены к обучению.

## Обучение

In [12]:
train, test = train_test_split(data, random_state = 123, test_size = 0.4)
valid, test = train_test_split(test, random_state = 123, test_size = 0.5)

In [13]:
X_train = train['lemm_text']
X_valid = valid['lemm_text']
X_test = test['lemm_text']
y_train = train['toxic']
y_valid = valid['toxic']
y_test = test['toxic']

In [14]:
print('Обучающая выборка')
print(X_train.shape)
print(y_train.shape)
print('________')
print('Тестовая выборка')
print(X_test.shape)
print(y_test.shape)
print('________')
print('Валидационная выборка')
print(X_valid.shape)
print(y_valid.shape)

Обучающая выборка
(95742,)
(95742,)
________
Тестовая выборка
(31915,)
(31915,)
________
Валидационная выборка
(31914,)
(31914,)


In [15]:
nltk.download('stopwords')
stopwords = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\freak\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Логистическая регрессия

**TfidfVectorizer**

In [16]:
count_tf_idf = TfidfVectorizer(min_df = 0.0001, stop_words=stopwords, ngram_range=(1, 3))
tf_idf_model = count_tf_idf.fit_transform(X_train)
train_tfidf = count_tf_idf.transform(X_train)
test_tfidf = count_tf_idf.transform(X_test)
valid_tfidf = count_tf_idf.transform(X_valid)

In [17]:
print('Обучающая выборка')
print(train_tfidf.shape)
print('________')
print('Тестовая выборка')
print(test_tfidf.shape)
print('________')
print('Валидационная выборка')
print(valid_tfidf.shape)

Обучающая выборка
(95742, 45737)
________
Тестовая выборка
(31915, 45737)
________
Валидационная выборка
(31914, 45737)


In [18]:
regression = LogisticRegression(fit_intercept=True, 
                                class_weight='balanced', 
                                random_state=123,
                                solver='liblinear')
regression_parametrs = {'C': [0.1, 1, 10], 'max_iter':[1000,2000,100]}

regression_grid = GridSearchCV(regression, regression_parametrs, scoring='f1', cv=5)
regression_grid.fit(train_tfidf, y_train)

regression.fit(train_tfidf, y_train)

LogisticRegression(class_weight='balanced', random_state=123,
                   solver='liblinear')

In [19]:
reg_params = regression_grid.best_params_
print(reg_params)

{'C': 10, 'max_iter': 1000}


In [20]:
model_lr = LogisticRegression(fit_intercept=True, class_weight='balanced', n_jobs=-1, max_iter=reg_params['max_iter'], C=reg_params['C'], random_state=123).fit(train_tfidf, y_train)
y_pred = model_lr.predict(valid_tfidf)
lr_f1 = round(f1_score(y_valid, y_pred), 3)

In [21]:
print(lr_f1)

0.748


**CountVectorizer**

In [22]:
count_vect = CountVectorizer(min_df = 0.0001, stop_words=stopwords, ngram_range=(1, 3))
n_gramm_train = count_vect.fit_transform(X_train)
n_gramm_test = count_vect.transform(X_test)
n_gramm_valid = count_vect.transform(X_valid)

In [23]:
print('Обучающая выборка')
print(n_gramm_train.shape)
print('________')
print('Тестовая выборка')
print(n_gramm_test.shape)
print('________')
print('Валидационная выборка')
print(n_gramm_valid.shape)

Обучающая выборка
(95742, 45737)
________
Тестовая выборка
(31915, 45737)
________
Валидационная выборка
(31914, 45737)


In [24]:
%%time
regression = LogisticRegression(fit_intercept=True, 
                                class_weight='balanced', 
                                random_state=123,
                                solver='liblinear'
                               )
regression_parametrs = {'C': [0.1, 1, 10]} # хотела подобрать max_iter, но ячейка стала выполняться больше часа

regression_grid = GridSearchCV(regression, regression_parametrs, scoring='f1', cv=5)
regression_grid.fit(n_gramm_train, y_train)

regression.fit(n_gramm_train, y_train)



Wall time: 2min 18s




LogisticRegression(class_weight='balanced', random_state=123,
                   solver='liblinear')

In [25]:
reg_params = regression_grid.best_params_

In [26]:
model_lr2 = LogisticRegression(class_weight='balanced', n_jobs=-1, C=reg_params['C'], max_iter=1500, random_state=123).fit(n_gramm_train, y_train)
y_pred2 = model_lr2.predict(n_gramm_valid)
lr2_f1 = round(f1_score(y_valid, y_pred2), 3)

In [27]:
print(lr2_f1)

0.744


### Случайный лес

Обучим модель случайного леса с подбором гиперпараметров GridSearchCV. 

**TfidfVectorizer**

In [28]:
forest = RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=123)
forest_parametrs = { 'n_estimators': range(20, 40, 5),
                     'max_depth': range(4, 8, 2)}
forest_grid = GridSearchCV(forest, forest_parametrs, scoring='f1', cv=3)
forest_grid.fit(train_tfidf, y_train)

GridSearchCV(cv=3,
             estimator=RandomForestClassifier(class_weight='balanced',
                                              n_jobs=-1, random_state=123),
             param_grid={'max_depth': range(4, 8, 2),
                         'n_estimators': range(20, 40, 5)},
             scoring='f1')

In [29]:
forest_params = forest_grid.best_params_
print(forest_params)

{'max_depth': 6, 'n_estimators': 35}


In [30]:
forest_model = RandomForestClassifier(random_state=123, n_jobs=-1, class_weight='balanced',
                                     max_depth=forest_params['max_depth'],
                                     n_estimators = forest_params['n_estimators'])

forest_model.fit(train_tfidf, y_train)
forest_model_predictions = forest_model.predict(valid_tfidf)

In [31]:
forest_predictions = forest_model.predict(valid_tfidf)
forest_f1 =  round(f1_score(y_valid, forest_predictions), 3)
print(forest_f1)

0.32


**CountVectorizer**

In [32]:
forest2 = RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=123)
forest2_parametrs = { 'n_estimators': range(20, 40, 5),
                     'max_depth': range(4, 8, 2)}

forest2_grid = GridSearchCV(forest2, forest2_parametrs, scoring='f1', cv=3)
forest2_grid.fit(n_gramm_train, y_train)

GridSearchCV(cv=3,
             estimator=RandomForestClassifier(class_weight='balanced',
                                              n_jobs=-1, random_state=123),
             param_grid={'max_depth': range(4, 8, 2),
                         'n_estimators': range(20, 40, 5)},
             scoring='f1')

In [33]:
forest2_params = forest2_grid.best_params_
print(forest2_params)

{'max_depth': 6, 'n_estimators': 35}


In [34]:
forest2_model = RandomForestClassifier(random_state=123, n_jobs=-1, class_weight='balanced',
                                     max_depth=forest2_params['max_depth'],
                                     n_estimators = forest2_params['n_estimators'])

forest2_model.fit(n_gramm_train, y_train)
forest2_model_predictions = forest2_model.predict(n_gramm_valid)

In [35]:
forest2_predictions = forest2_model.predict(n_gramm_valid)
forest2_f1 =  round(f1_score(y_valid, forest2_predictions), 3)
print(forest2_f1)

0.322


## Выводы

**Предсказания на тестовой выборке**

In [36]:
logreg_tfidf_preds_test = model_lr.predict(test_tfidf)
logreg_ngramm_preds_test = model_lr2.predict(n_gramm_test)

In [37]:
rand_forest_tfidf_preds_test = forest_model.predict(test_tfidf)
rand_forest_ngramm_preds_test = forest2_model.predict(n_gramm_test)

**F1 на тестовой выборке**

In [38]:
logreg_tfidf_f1_test = round(f1_score(y_test, logreg_tfidf_preds_test), 3)
logreg_ngramm_f1_test = round(f1_score(y_test, logreg_ngramm_preds_test), 3)

In [39]:
forest_tfidf_f1_test = round(f1_score(y_test, rand_forest_tfidf_preds_test), 3)
forest_ngramm_f1_test = round(f1_score(y_test, rand_forest_ngramm_preds_test), 3)

Создадим таблицу с результатами работы моделей: 

In [40]:
results = pd.DataFrame(columns = ['LogReg TF-IDF','LogReg CountVectorizer'], index = ['F1'])
results.iloc[0] = [logreg_tfidf_f1_test, logreg_ngramm_f1_test]

In [41]:
results

Unnamed: 0,LogReg TF-IDF,LogReg CountVectorizer
F1,0.751,0.758


**Наилучший результат показала Логистическая регрессия с применением CountVectorizer**