Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

### Инструкция по выполнению проекта

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

### Описание данных

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

# 1. Подготовка

In [1]:
import pandas as pd
from tqdm import notebook
import nltk
import re
import numpy as np


df = pd.read_csv('../datasets/toxic_comments.csv')

In [2]:
df.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [4]:
df.isna().sum(), \
df.duplicated().sum()

(text     0
 toxic    0
 dtype: int64, 0)

In [5]:
import seaborn as sns

sns.countplot(df['toxic'])

<matplotlib.axes._subplots.AxesSubplot at 0x26f30da0388>

Разделим выборку на train / test.

In [6]:
from sklearn.model_selection import train_test_split

X = pd.DataFrame(data = df['text'], index = df.index)
y = df['toxic']

X_train,X_test,y_train,y_test = train_test_split(X,y, test_size = .2, random_state = 42)


In [7]:
X_train.shape ,X_test.shape

((127656, 1), (31915, 1))

# 2. Обучение


### TF-IDF 

Для обучения мы будем использовать sklearn pipelines, в которых будет произведена:
* Лемматизация (c POS тэгами)
* TFIDF векторизация
* Классификация

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
import re
import nltk

tokenizer = nltk.casual.TweetTokenizer(preserve_case=False, reduce_len=True)
count_vect = CountVectorizer(tokenizer=tokenizer.tokenize) 
classifier = LogisticRegression()

In [9]:
sentiment_pipeline = Pipeline([
        ('vectorizer', count_vect),
        ('classifier', classifier)
    ])

In [10]:
sentiment_pipeline.fit(X_train['text'],y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Pipeline(memory=None,
         steps=[('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<bound method T...f <nltk.tokenize.casual.TweetTokenizer object at 0x0000026F2D023148>>,
                                 vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                               

In [11]:
predicts = sentiment_pipeline.predict(X_test['text'])

In [48]:
from sklearn import metrics
print(metrics.classification_report(y_test, predicts), metrics.f1_score(predicts, y_test))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98     28671
           1       0.87      0.70      0.78      3244

    accuracy                           0.96     31915
   macro avg       0.92      0.85      0.88     31915
weighted avg       0.96      0.96      0.96     31915
 0.7756421160061234


Мы получили требуемое значение. Улучшаем качество путём тюнинга модели.

In [19]:
sentiment_pipeline

Pipeline(memory=None,
         steps=[('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<bound method T...f <nltk.tokenize.casual.TweetTokenizer object at 0x0000026F2D023148>>,
                                 vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                               

In [18]:
from sklearn.model_selection import GridSearchCV

In [39]:
parameters = {'vectorizer__ngram_range': [(1, 1), (1, 2)],
              'vectorizer__min_df': [0.01, 0.1, 0.2, 1],
              'vectorizer__max_features':[None, 1000, 10000],
              'classifier__class_weight':['auto', 'balanced']}

In [50]:
gs_clf = GridSearchCV(sentiment_pipeline, parameters, n_jobs=4, scoring='f1')

In [51]:
# На малой выборке для ускорения подбора

gs_clf = gs_clf.fit(X_train['text'][:4000], y_train[:4000])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [52]:
gs_clf.best_estimator_

Pipeline(memory=None,
         steps=[('vectorizer',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<bound method T....tokenize.casual.TweetTokenizer object at 0x0000026F52FDAD48>>,
                                 vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight='balanced', dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                

In [53]:
g_predicts = gs_clf.predict(X_test['text'])

In [54]:
print(metrics.classification_report(y_test, g_predicts), metrics.f1_score(predicts, y_test))

              precision    recall  f1-score   support

           0       0.96      0.96      0.96     28671
           1       0.65      0.63      0.64      3244

    accuracy                           0.93     31915
   macro avg       0.81      0.79      0.80     31915
weighted avg       0.93      0.93      0.93     31915
 0.7756421160061234


Наша модель не улучшилась, однако полученные параметры подходят под нашу задачу. 