# Обнаружение токсичных комментариев

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Необходимо обучить модель классифицировать комментарии на позитивные и негативные. В распоряжении набор данных с разметкой о токсичности правок.

Построить модель со значением метрики качества *F1* не меньше 0.75. 

План: 

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария на английском языке, а *toxic* — целевой признак.

In [1]:
#!pip install pattern

In [2]:
from pattern.en import lemma

import pandas as pd
import re

from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords 
from sklearn.model_selection  import cross_val_score,train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

import warnings
warnings.filterwarnings("ignore")

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Vasekk\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Vasekk\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Vasekk\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Подготовка
План 
1. загрузка данных, просмотр
2. лемматизация
3. очистка от лишних символов 
4. очистка от стопслов
5. разделение на обучающую и тестовую выборки
6. расчет tf-idf
7. подготовка признаков

In [3]:
df = pd.read_csv('f:/yandex-practicum/datasets/toxic_comments.csv')
df

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
...,...,...,...
159287,159446,""":::::And for the second time of asking, when ...",0
159288,159447,You should be ashamed of yourself \n\nThat is ...,0
159289,159448,"Spitzer \n\nUmm, theres no actual article for ...",0
159290,159449,And it looks like it was actually you who put ...,0


In [4]:
# Функция лемматизация и очистки от лишних символов


# функция принимает на вход текст и возвращает текст, в котором остались буквы и апострофы, join и split обеспечивают удаление лишних пробелов между леммами слов, леммы определяются функцией библиотеки pattern
def lemm_clear_text(text):
    return " ".join(lemma(word) for word in re.sub(r"[^a-zA-Z']", ' ', text).split())


lemmatizer = WordNetLemmatizer()

def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

def lemm_clear_text_nltk(text):
    return " ".join(lemmatizer.lemmatize(word, get_wordnet_pos(word)) for word in re.sub(r"[^a-zA-Z']", ' ', text).split())

In [5]:
# Готовый словарь сокращений
contractions_dict = { "ain't": "are not", "'s":" is", "aren't": "are not", "can't": "cannot", 
                     "can't've": "cannot have", "‘cause": "because", "could've": "could have", 
                     "couldn't": "could not", "couldn't've": "could not have", "didn't": "did not", 
                     "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hadn't've": "had not have", 
                     "hasn't": "has not", "haven't": "have not", "he'd": "he would", "he'd've": "he would have", 
                     "he'll": "he will", "he'll've": "he will have", "how'd": "how did", "how'd'y": "how do you", 
                     "how'll": "how will", "I'd": "I would", "I'd've": "I would have", "I'll": "I will", 
                     "I'll've": "I will have", "I'm": "I am", "I've": "I have", "isn't": "is not", "it'd": "it would", 
                     "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have", "let's": "let us", 
                     "ma'am": "madam", "mayn't": "may not", "might've": "might have", "mightn't": "might not", 
                     "mightn't've": "might not have", "must've": "must have", "mustn't": "must not", 
                     "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have", 
                     "o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", 
                     "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", 
                     "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", 
                     "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", 
                     "so've": "so have", "that'd": "that would", "that'd've": "that would have", "there'd": "there would", 
                     "there'd've": "there would have", "they'd": "they would", "they'd've": "they would have",
                     "they'll": "they will","they'll've": "they will have", "they're": "they are", "they've": "they have", 
                     "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", 
                     "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", 
                     "weren't": "were not","what'll": "what will", "what'll've": "what will have", "what're": "what are", 
                     "what've": "what have", "when've": "when have", "where'd": "where did", "where've": "where have", 
                     "who'll": "who will", "who'll've": "who will have", "who've": "who have", "why've": "why have", 
                     "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", 
                     "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would", 
                     "y'all'd've": "you all would have", "y'all're": "you all are", "y'all've": "you all have", 
                     "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", 
                     "you're": "you are", "you've": "you have"}

In [6]:
# функция восстановления сокращений
contractions_re = re.compile('(%s)'%'|'.join(contractions_dict.keys()))
def expand_contractions(s, contractions_dict=contractions_dict):
    def replace(match):
        return contractions_dict[match.group(0)].lower()
    return contractions_re.sub(replace, s)

In [8]:
#просмотр результатов лемматизации
ind = 3
print('До лематизации:\n', df.loc[ind,'text'],'\n\nПосле лемматизации:')
print('NLTK:\n', lemm_clear_text_nltk(expand_contractions(df.loc[ind,'text'])))
print('Pattern:\n', lemm_clear_text(expand_contractions(df.loc[ind,'text'])))


До лематизации:
 "
More
I can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of ""types of accidents""  -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.

There appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport  " 

После лемматизации:
NLTK:
 More I cannot make any real suggestion on improvement I wonder if the section statistic should be later on or a subsection of type of accident I think the reference may need tidy so that they be all in the exact same format ie date format etc I can do that later on if no one else do first if you have any preference for format sty

In [9]:
%%time
# corpus = list(df['text'])
# corpus_lemm = []
# for i in corpus:
#     corpus_lemm.append(lemm_clear_text(expand_contractions(i)))
    
corpus_lemm = df['text'].apply(expand_contractions).apply(lemm_clear_text)
# CPU times: total: 59min 58s
# Wall time: 1h 12s NLTK&POS
#Pattern Wall time: 19.5 s

CPU times: total: 19.6 s
Wall time: 19.6 s


Cлова приведены к единому регистру, восстановлены сокращения.
Лемматизация с учетом POS выполняется более 60 минут, поэтому заменил лемматизатор NLTK на Pattern (время работы 20 секунд)

In [10]:
# Разделение на обучающую и тестовую выборку
X_train, X_test, y_train, y_test = train_test_split(corpus_lemm, df['toxic'], test_size=0.25, random_state=12345)

### Вывод

1. Данные загружены и просмотрены. В выборке содержится 159571 записей. 
2. Подготовлена функция, выполняющая лемматизацию и очистку от лишних символов
3. Выполнена лемматизация


## Обучение
Обучение разных моделей с помощью Pipeline для расчета кросс-валидации.
Подбор параметров


In [11]:
# словарь стоп слов nltk
stop_words = set(stopwords.words('english'))
# Модель подсчета tf-idf
count_tf_idf = TfidfVectorizer(stop_words=stop_words)

In [12]:
%%time
clf = Pipeline(steps=[("TFIDF", count_tf_idf), 
                      ("classifier", LogisticRegression(
                          random_state=12345, solver='saga', 
                          penalty='l2', class_weight='balanced'))])
print(cross_val_score(clf, X_train, y_train, cv=3, scoring='f1',n_jobs=-1).mean())
#print(cross_val_score(clf, corpus_lemm, df['toxic'], cv=3, scoring='f1').mean())

# 0.7411370737064806
# CPU times: total: 28.1 s
# Wall time: 28.1 s

0.7411370737064806
CPU times: total: 609 ms
Wall time: 13.1 s


In [13]:
%%time
clf = Pipeline(steps=[("TFIDF", count_tf_idf), 
                      ("classifier", RandomForestClassifier(random_state=12345, max_depth=40, max_features=4000))])

print(cross_val_score(clf, X_train, y_train, cv=3, scoring='f1',n_jobs=-1).mean())
#print(cross_val_score(clf, corpus_lemm, df['toxic'], cv=3, scoring='f1').mean())
# 0.6963064457043298
# CPU times: total: 281 ms
# Wall time: 2min

0.6963064457043298
CPU times: total: 312 ms
Wall time: 1min 55s


In [14]:
%%time
clf = Pipeline(steps=[("TFIDF", count_tf_idf), 
                      ("classifier", DecisionTreeClassifier(random_state=12345, max_depth=80))])
print(cross_val_score(clf, X_train, y_train, cv=3, scoring='f1',n_jobs=-1).mean())
#print(cross_val_score(clf, corpus_lemm, df['toxic'], cv=3, scoring='f1').mean())
# 0.7087821503051828
# CPU times: total: 469 ms
# Wall time: 35.8 s

0.7087821503051828
CPU times: total: 344 ms
Wall time: 30.9 s


In [15]:
%%time
clf = Pipeline(steps=[("TFIDF", count_tf_idf), 
                      ("classifier", LogisticRegression(random_state=12345, solver='saga',  penalty='l1'))])
print(cross_val_score(clf, X_train, y_train, cv=3, scoring='f1',n_jobs=-1).mean())
# 0.759614706290583
# CPU times: total: 250 ms
# Wall time: 22.8 s

0.759614706290583
CPU times: total: 328 ms
Wall time: 20.1 s


In [16]:
# Метрика на тестовой выборке
clf.fit(X_train, y_train)
f1_score(y_test, clf.predict(X_test))
# 0.7851436580238262

0.7851436580238262

In [17]:
# Константная модель
f1_score(df['toxic'], [1]*len(df['toxic']))
# 0.1845889553800997

0.18447896602423097

### Вывод

Обучены модели LogisticRegression, DecisionTreeClassifier, LogisticRegression.
подобраны параметры, обеспечивающие F1> 0.75.


## Выводы
Выполнена загрузка данных, подготовлены признаки для обучения.

С использованием пайплайнов и кросс-валидации проверено качество моделей LogisticRegression, DecisionTreeClassifier, LogisticRegression. Лучшая модель - LogisticRegression с подобранными гиперпараметрами (solver='saga', penalty='l1') имеет cv_F1 = 0.7596. На тестовой выборке модель имеет метрику 0.7851. Для сравнения F1 константной модели равна 0.1845 

Разработанная модель позволит выявлять токсичные комментарии и отправлять их на модерацию.
