# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75.


**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

Импортируем библиотеки, загрузим данные и посмотрим на них

In [1]:
import pandas as pd
import numpy as np

import nltk
from nltk import pos_tag
from nltk.corpus import stopwords as nltk_stopwords
from nltk.corpus import wordnet

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

import re

from sklearn.model_selection import train_test_split

from sklearn.utils import shuffle
from sklearn.metrics import f1_score

from sklearn.model_selection import GridSearchCV

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression




nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /home/jovyan/nltk_data...
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [2]:
try:
    df = pd.read_csv("/datasets/toxic_comments.csv",index_col=[0])
except:
    df = pd.read_csv("toxic_comments.csv",index_col=[0])

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 159292 entries, 0 to 159450
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 3.6+ MB


In [4]:
df.duplicated().sum() 

0

Ок, дубликатов нет, пропусков нет - дольчевита. 

Определим признаки и целевую функцию.

In [5]:
features = df['text']
target = df['toxic']

In [6]:
target.value_counts()

0    143106
1     16186
Name: toxic, dtype: int64

Большинство комментариев в выборке нетоксичные. Налицо перекос классов. 

Разделим выборку на обучающую и тестовую. Преобразуем обучающую выборку техникой downsampling.

In [7]:
#features_test, features_train, target_test, target_train = train_test_split(features, target, test_size = 0.20, random_state = 12345, stratify = target)
features_test, features_train, target_test, target_train = train_test_split(features, target, test_size = 0.20, random_state = 12345)

In [8]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

In [9]:
features_downsampled, target_downsampled = downsample(features_train, target_train, 0.4)

In [10]:
target_downsampled.value_counts()

0    11452
1     3230
Name: toxic, dtype: int64

Ок, теперь классы сбалансированы. Посмотрим на данные.

In [11]:
features_downsampled

136215    yeah I did read that source but not read that ...
150369    Conspiracy/Complicity\nHi, I just posted a mes...
65337     "\n\n Ambedkar's response to the Poona Pact \n...
142056    "\nHere's a good look at the next-generation T...
76182     Was in Scotland, Loch ness, Inverness, Edinbor...
                                ...                        
8734         because Wikipedia is poop-racist!99.99.167.130
129231    "\n\n More 2012 \n\n... and ... ""He spends a ...
15303     "\nI'm not trying to make a point here, I'm tr...
69365     "\n\n Douchebag award \n\n  Being a massive di...
51737     "\n\nBiased Viewponit\n\nJust reading this art...
Name: text, Length: 14682, dtype: object

In [12]:
features_test

97494     Bushranger you're a GRASS with no sense of hum...
4383      "\n\n Need administrative help \n\nI have been...
103777    I'd also like to point out that he has used a ...
38619      You cant block me you fucking retard. BRB nigger
128443    I believe that the frequency of the wave needs...
                                ...                        
110090    Hahaha. ) I dont live in a lie like you and do...
85493                    March 2006 – March 2006]]\n \n\n|}
133387    "\n\nAgreed.  We really should try to stick to...
130469    "\n\n Umm killer \n\nDo you not like that he c...
77361     Bradford City \n\nI am removing unreferanced c...
Name: text, Length: 127433, dtype: object

Уберем лишние символы - пунктуацию и др. Оставим только буквы и пробелы.

In [13]:
def re_func(text):
    return re.sub(r'[^a-zA-Z ]',"", str(text))

In [14]:
features_downsampled = features_downsampled.apply(re_func)
features_test = features_test.apply(re_func)

Пора лемматизировать комментарии - привести слова к начальной форме

In [15]:
def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [16]:
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    return (" ".join(lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in w_tokenizer.tokenize(text)))

In [17]:
%%time
features_downsampled = features_downsampled.apply(lemmatize_text)

CPU times: user 1min 26s, sys: 8.61 s, total: 1min 34s
Wall time: 1min 34s


In [18]:
%%time
features_test = features_test.apply(lemmatize_text)

CPU times: user 12min 18s, sys: 1min 16s, total: 13min 34s
Wall time: 13min 34s


Уберем лишние пробелы

In [19]:
def re_space(text):
    return re.sub("r'\s+'", " ", str(text))

In [20]:
features_downsampled = features_downsampled.apply(re_space)
features_test = features_test.apply(re_space)

In [21]:
features_downsampled

136215    yeah I do read that source but not read that t...
150369    ConspiracyComplicityHi I just post a message h...
65337     Ambedkars response to the Poona Pact A transla...
142056    Heres a good look at the nextgeneration Toyota...
76182     Was in Scotland Loch ness Inverness Edinboroug...
                                ...                        
8734                        because Wikipedia be poopracist
129231    More and He spends a lot of time online and ha...
15303     Im not try to make a point here Im try to ensu...
69365     Douchebag award Being a massive dick YEEEY You...
51737     Biased ViewponitJust reading this article Ive ...
Name: text, Length: 14682, dtype: object

In [22]:
features_test

97494     Bushranger youre a GRASS with no sense of humo...
4383      Need administrative help I have be block iniqu...
103777    Id also like to point out that he have use a t...
38619          You cant block me you fuck retard BRB nigger
128443    I believe that the frequency of the wave need ...
                                ...                        
110090    Hahaha I dont live in a lie like you and dont ...
85493                                           March March
133387    Agreed We really should try to stick to the su...
130469    Umm killer Do you not like that he copy your w...
77361        Bradford City I be remove unreferanced content
Name: text, Length: 127433, dtype: object

Уберем стоп-слова и выполним tf-idf-преобразование текста.

In [23]:
stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(stop_words=stopwords)

#count_tf_idf = TfidfVectorizer(stop_words=stopwords, lowercase=True, min_df=0.0001)

In [24]:
features_train_tf_idf = count_tf_idf.fit_transform(features_downsampled)
features_test_tf_idf = count_tf_idf.transform(features_test)

## Обучение

Протестируем три модели - LogisticRegression, RandomForestClassifier, KNeighborsClassifier - и вычсилим для каждой f1-меру.

In [41]:
clf = LogisticRegression()

parametrs = { 'C': np.arange(5.0, 10.0, 1.0)}

grid = GridSearchCV(clf, parametrs, cv=5)
grid.fit(features_train_tf_idf, target_downsampled)
grid.best_params_

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

{'C': 9.0}

In [63]:
model = LogisticRegression(C=8.5, multi_class = 'ovr', random_state = 12345, class_weight='dict')
model.fit(features_train_tf_idf, target_downsampled)
predictions = pd.DataFrame(model.predict(features_test_tf_idf))
print("F1:", f1_score(target_test, predictions))

F1: 0.7376842789262071


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [27]:
clf = RandomForestClassifier()

parametrs = { 'n_estimators': range (70, 101, 10),
              'max_depth': range (1,13, 2)}

grid = GridSearchCV(clf, parametrs, cv=5)
grid.fit(features_train_tf_idf, target_downsampled)
grid.best_params_

{'max_depth': 11, 'n_estimators': 70}

In [31]:
model = RandomForestClassifier(random_state = 12345, n_estimators = 70, max_depth = 11)
model.fit(features_train_tf_idf, target_downsampled)
predictions = pd.DataFrame(model.predict(features_test_tf_idf))
print("F1:", f1_score(target_test, predictions))

F1: 0.0007715454054471106


In [35]:
clf = KNeighborsClassifier()

parametrs = {'n_neighbors': range (10, 41, 10)}

grid = GridSearchCV(clf, parametrs, cv=5)
grid.fit(features_train_tf_idf, target_downsampled)
grid.best_params_

{'n_neighbors': 30}

In [34]:
model = KNeighborsClassifier(n_neighbors = 30)
model.fit(features_train_tf_idf, target_downsampled)
predictions = pd.DataFrame(model.predict(features_test_tf_idf))
print("F1:", f1_score(target_test, predictions))

F1: 0.6088184845121581


Самый лучший результат показала модель логистической регрессии:
- F1 LogisticRegression = 0.74
- F1 KNeighborsClassifier = 0.61
- F1 RandomForestClassifier = 0.00

## Выводы

- В исходных данных представлены почти 160 тыс твитов с разметкой на позитивные и негативные, нет пропусков и дубликатов
- Разбили выборку на обучающую и тестовую, тестируем на 20% данных
- Негативных комментариев в 9 раз меньше. В связи с этим выполнили даунсемплинг - сбалансировали классы, сузили обучающую выборку до 14 тыс. твитов
- Тексты комментариев токенизировали, лемматизировали, убрали лишние символы и стоп-слова
- Протестировали три модели - LogisticRegression, RandomForestClassifier, KNeighborsClassifier - и вычислили для каждой f1-меру
- Самый лучший результат показала модель логистической регрессии - F1 = 0.74