

#Поиск токсичных комментариев 



Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

## Подготовка

In [None]:
! pip install catboost

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd

import nltk
from nltk.stem import WordNetLemmatizer 
from nltk.corpus import stopwords
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

import re
import string

from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier 

import lightgbm as lgb

from xgboost import XGBClassifier 

from catboost import Pool, cv
from catboost import CatBoostClassifier

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import make_scorer
from sklearn.metrics import f1_score

from tqdm import notebook

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
import warnings
warnings.filterwarnings('ignore')
pd.options.mode.chained_assignment = None

In [None]:
df_tweets = pd.read_csv('/content/toxic_comments.csv')

In [None]:
df_tweets.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


**Описание данных**

'text' - текст комментария 

'toxic' - целевой признак

In [None]:
df_tweets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [None]:
df_tweets.duplicated().sum()

0

In [None]:
df_tweets['toxic'].value_counts(normalize=True)

0    0.898321
1    0.101679
Name: toxic, dtype: float64

Наблюдаем дисбаланс классов.

---------

Очистим тексты от специальных символов, лишних пробелов и одиночных букв.

In [None]:
df_tweets['text'][0]

"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"

In [None]:
text_clear = df_tweets['text'].apply(
    lambda x: ' '.join(re.sub(r'\s+[a-zA-Z]\s', ' ', str(x)).split()))

In [None]:
text_clear[0]

"Explanation Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"

In [None]:
text_clear = text_clear.apply(
    lambda x: ' '.join(re.sub(r'[^a-zA-Z]', ' ', str(x).lower()).split()))

In [None]:
text_clear[0]

'explanation why the edits made under my username hardcore metallica fan were reverted they weren t vandalisms just closure on some gas after voted at new york dolls fac and please don t remove the template from the talk page since i m retired now'

In [None]:
text_clear = text_clear.apply(
    lambda x: ' '.join(re.sub(r'\s+[a-zA-Z]\s', ' ', str(x)).split()))

In [None]:
text_clear[0]

'explanation why the edits made under my username hardcore metallica fan were reverted they weren vandalisms just closure on some gas after voted at new york dolls fac and please don remove the template from the talk page since m retired now'

-----

Выполним токенизацию каждого текста.

In [None]:
tokenized = text_clear.apply(lambda x: nltk.word_tokenize(x))

In [None]:
tokenized

0         [explanation, why, the, edits, made, under, my...
1         [d, aww, he, matches, this, background, colour...
2         [hey, man, m, really, not, trying, to, edit, w...
3         [more, can, make, any, real, suggestions, on, ...
4         [you, sir, are, my, hero, any, chance, you, re...
                                ...                        
159566    [and, for, the, second, time, of, asking, when...
159567    [you, should, be, ashamed, of, yourself, that,...
159568    [spitzer, umm, theres, no, actual, article, fo...
159569    [and, it, looks, like, it, was, actually, you,...
159570    [and, really, don, think, you, understand, cam...
Name: text, Length: 159571, dtype: object

--------

Лемматизируем слова.

In [None]:
lemmatizer = WordNetLemmatizer()

In [None]:
lemmatized = tokenized.apply(
    lambda x: ' '.join([lemmatizer.lemmatize(w) for w in x]))

In [None]:
lemmatized

0         explanation why the edits made under my userna...
1         d aww he match this background colour m seemin...
2         hey man m really not trying to edit war it jus...
3         more can make any real suggestion on improveme...
4         you sir are my hero any chance you remember wh...
                                ...                        
159566    and for the second time of asking when your vi...
159567    you should be ashamed of yourself that is horr...
159568    spitzer umm there no actual article for prosti...
159569    and it look like it wa actually you who put on...
159570    and really don think you understand came here ...
Name: text, Length: 159571, dtype: object

--------

Разделим выборку на тренировочную и обучающую.

In [None]:
features = lemmatized

In [None]:
target = df_tweets['toxic']

In [None]:
features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.2, random_state=12345, stratify = target) 

In [None]:
features.shape, features_train.shape, features_test.shape

((159571,), (127656,), (31915,))

In [None]:
target.shape, target_train.shape, target_test.shape

((159571,), (127656,), (31915,))

-------

## Обучение

In [None]:
stop_words = stopwords.words('english')

In [None]:
scoring = make_scorer(f1_score, greater_is_better=True)

In [None]:
skf = StratifiedKFold(n_splits=5)

In [None]:
df = [] # таблица с результатами
col_data = ['model', 'type_data', 'F1']

### LogisticRegression

In [None]:
pipe_lr = Pipeline([('tfidf', TfidfVectorizer(stop_words=stop_words)), # создадим признаки
                    ('model', LogisticRegression(random_state=12345))])

In [None]:
params = {'model__penalty' : ['l2'], #'l1', 
          'model__C' : [1, 10, 50],#[0.001,0.01,0.1,1,10,100]
          'model__solver' : ['lbfgs', 'sag', 'saga']
}

In [None]:
model = GridSearchCV(
    pipe_lr, params, cv=skf, scoring=scoring, verbose=1)

In [None]:
model.fit(features_train, target_train)

Fitting 5 folds for each of 9 candidates, totalling 45 fits


GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
             estimator=Pipeline(steps=[('tfidf',
                                        TfidfVectorizer(stop_words=['i', 'me',
                                                                    'my',
                                                                    'myself',
                                                                    'we', 'our',
                                                                    'ours',
                                                                    'ourselves',
                                                                    'you',
                                                                    "you're",
                                                                    "you've",
                                                                    "you'll",
                                                                    "you'd",
                 

In [None]:
model.best_params_

{'model__C': 10, 'model__penalty': 'l2', 'model__solver': 'lbfgs'}

In [None]:
model.best_score_

0.7702656433820121

In [None]:
df.append(['LogisticRegression', 'train', model.best_score_])

-------

### Дерево решений

In [None]:
best_model_DT = None
best_f1 = 0
best_depth = 0

for depth in notebook.tqdm(range(1, 100, 5)): # в цикле меняем гиперпараметр - максимальную глубину дерева
    
    pipe_dt = Pipeline([('tfidf', TfidfVectorizer(stop_words=stop_words)), 
                        ('model', DecisionTreeClassifier(random_state=12345, 
                                                         max_depth=depth))])

    f1 = sum(cross_val_score(pipe_dt, features_train, target_train, 
            scoring=scoring, cv=skf)) / 5

    if f1 > best_f1:
        best_model_DT = pipe_dt[1]
        best_f1 = f1
        best_depth = depth
        
print("Модель 'Дерево решений', F1:", best_f1, "Глубина дерева:", best_depth)

  0%|          | 0/20 [00:00<?, ?it/s]

Модель 'Дерево решений', F1: 0.7206939912383155 Глубина дерева: 96


In [None]:
df.append(['DecisionTreeClassifier', 'train', best_f1])

------

### Случайный лес

In [None]:
best_modelRF = None
best_depthRF = 0
best_est = 0
best_f1 = 0

for est in notebook.tqdm(range(1, 4, 2)): # в цикле меняем гиперпараметр - максимальное количество деревьев
    for depth in notebook.tqdm(range(1, 80, 2)): # в цикле меняем гиперпараметр - максимальную глубину дерева
 
        pipe_rf = Pipeline([('tfidf', TfidfVectorizer(stop_words=stop_words)), 
                            ('model', RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=depth))])
      #  model_ = RandomForestClassifier(
      #      random_state=12345, n_estimators=est, max_depth=depth) # случайный лес

        f1 = sum(cross_val_score(
            pipe_rf, features_train, target_train, 
            scoring=scoring, cv=skf)) / 5

        if f1 > best_f1:
            best_modelRF = pipe_rf[1]
            best_depthRF = depth
            best_est = est
            best_f1 = f1

            
print("Модель 'Случайный лес', F1: ", best_f1, 
      ", количество деревьев:", best_est, 
      ", глубина дерева:", best_depthRF)

  0%|          | 0/2 [00:00<?, ?it/s]

  0%|          | 0/40 [00:00<?, ?it/s]

  0%|          | 0/40 [00:00<?, ?it/s]

Модель 'Случайный лес', F1:  0.35488718721489987 , количество деревьев: 1 , глубина дерева: 79


Модель 'Случайный лес', F1:  0.31033143076975883 , количество деревьев: 1 , глубина дерева: 75

In [None]:
df.append(['RandomForestClassifier', 'train', best_f1])

------

### XGBoost

In [None]:
#model_ = XGBClassifier()
pipe_xgb = Pipeline([('tfidf', TfidfVectorizer(stop_words=stop_words)), 
                     ('model', XGBClassifier())])

In [None]:
f1 = sum(cross_val_score(pipe_xgb, features_train, target_train, 
            scoring=scoring, cv=skf)) / 5

In [None]:
f1

0.570393403674864

In [None]:
df.append(['XGBClassifier', 'train', f1])

------

### LightGBM

In [None]:
#model_ = lgb.LGBMClassifier()
pipe_lgb = Pipeline([('tfidf', TfidfVectorizer(stop_words=stop_words)), 
                     ('model', lgb.LGBMClassifier())])

In [None]:
f1 = sum(cross_val_score(pipe_lgb, features_train, target_train, 
            scoring=scoring, cv=skf)) / 5

In [None]:
f1

0.7476268472334835

In [None]:
df.append(['LGBMClassifier', 'train', f1])

-----------

In [None]:
df_result = pd.DataFrame(data=df, columns=col_data) # таблица с результатами

In [None]:
df_result

Unnamed: 0,model,type_data,F1
0,LogisticRegression,train,0.770266
1,DecisionTreeClassifier,train,0.720694
2,RandomForestClassifier,train,0.354887
3,XGBClassifier,train,0.570393
4,LGBMClassifier,train,0.747627


Модель логистической регрессии, показала лучшее значение метрики F1 на тренировочной выборке: 0.770266.

--------

## Тестирование

Тестирование проведем на модели логистической регрессии, показавшей лучшее значение метрики F1 на тренировочной выборке.

In [None]:
df.append(['-', '-', '-'])

In [None]:
model

GridSearchCV(cv=StratifiedKFold(n_splits=5, random_state=None, shuffle=False),
             estimator=Pipeline(steps=[('tfidf',
                                        TfidfVectorizer(stop_words=['i', 'me',
                                                                    'my',
                                                                    'myself',
                                                                    'we', 'our',
                                                                    'ours',
                                                                    'ourselves',
                                                                    'you',
                                                                    "you're",
                                                                    "you've",
                                                                    "you'll",
                                                                    "you'd",
                 

In [None]:
predictions = model.predict(features_test) # предсказания на тесте

In [None]:
f1_score(target_test, predictions) # f1

0.7774327122153211

In [None]:
df.append(['LogisticRegression', 'test', f1_score(target_test, predictions)])

In [None]:
df_result = pd.DataFrame(data=df, columns=col_data) # таблица с результатами

In [None]:
df_result

Unnamed: 0,model,type_data,F1
0,LogisticRegression,train,0.770266
1,DecisionTreeClassifier,train,0.720694
2,RandomForestClassifier,train,0.354887
3,XGBClassifier,train,0.570393
4,LGBMClassifier,train,0.747627
5,-,-,-
6,LogisticRegression,test,0.777433


На тестовой выборке моделью логистической регрессии получено значение метрики F1: 0.777433.

-------

## Выводы

* Тексты комментариев в представленной выборке предварительно обработаны для обучения моделей: очищены от специальных символов, лишних пробелов, одиночных букв, стоп-слов. Проведена токенизация и лемматизация. 

* Созданы признаки с помощью TfidfVectorizer.

* Построено пять моделей: Логистическая регрессия, Дерево решений, Случайный лес,  XGBClassifier, LightGBMClassifier.

* По условиям задания необходимо построить модель со значением метрики качества F1 не меньше 0.75. На тренировочной выборке лучшее значение метрики F1 показала модель логистической регрессии - 0.770266.

* Проведено тестирование модели логистической регрессии, показавшей лучшее значение метрики F1 на тренировочной выборке. Получено значение метрики F1 - 0.777433.

------