Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

### Описание данных

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

# 1. Подготовка

In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv('/datasets/toxic_comments.csv')
data.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [2]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
text     159571 non-null object
toxic    159571 non-null int64
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


### Для ускорения процесса далее будем работать с уменьшенной выборкой

In [3]:
df = data.head(40000)
df['text'][:2]

0    Explanation\nWhy the edits made under my usern...
1    D'aww! He matches this background colour I'm s...
Name: text, dtype: object

### Создадим корпус текстов

In [4]:
corpus = df['text'].values.astype('U')
corpus[:2]

array(["Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",
       "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)"],
      dtype='<U5000')

### Лемматизация. Воспользуемся лексической базой для английского языка - wordnet.
### Она находится в библиотеке NLTK

In [5]:
import nltk

nltk.download('wordnet') # загрузим базу

from nltk.stem import WordNetLemmatizer #импортируем лемматизатор

lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


### Так как функция lemmatize может лемматизировать только отдельные слова, то токенизируем тексты нашего корпуса по словам и найдём леммы каждого

In [6]:
from tqdm import notebook
def lemm(corpus):
    corpus_lem = []
    for i in notebook.tqdm(range(len(corpus))): # отобразим процесс с помощью notebook.tqdm()
        word_list = nltk.word_tokenize(corpus[i]) # токенизируем текст корпуса, получаем список слов
        #лемматизируем каждое слово списка и методом join объединим их в строку, разделив пробелом
        lemmatized_output = ' '.join([lemmatizer.lemmatize(j) for j in word_list])
        corpus_lem.append(lemmatized_output) #добавлем строку в список
    return corpus_lem

In [7]:
corpus = np.array(lemm(corpus))
corpus[:2]

HBox(children=(FloatProgress(value=0.0, max=40000.0), HTML(value='')))




array(["Explanation Why the edits made under my username Hardcore Metallica Fan were reverted ? They were n't vandalism , just closure on some GAs after I voted at New York Dolls FAC . And please do n't remove the template from the talk page since I 'm retired now.89.205.38.27",
       "D'aww ! He match this background colour I 'm seemingly stuck with . Thanks . ( talk ) 21:51 , January 11 , 2016 ( UTC )"],
      dtype='<U6748')

### Уберем из корпуса ненужные символы

In [8]:
import re
corpus_lem = []
for i in range(len(corpus)):
    word_eng = re.sub(r'[^a-zA-Z ]', ' ', corpus[i]) #оставляем в тексте только английские буквы
    split = ' '.join(word_eng.split()) #убираем лишние пробелы
    low = split.lower() #приводим все буквы к нижнему регистру
    corpus_lem.append(low)

In [9]:
corpus = np.array(corpus_lem)
corpus[:3]

array(['explanation why the edits made under my username hardcore metallica fan were reverted they were n t vandalism just closure on some gas after i voted at new york dolls fac and please do n t remove the template from the talk page since i m retired now',
       'd aww he match this background colour i m seemingly stuck with thanks talk january utc',
       'hey man i m really not trying to edit war it s just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page he seems to care more about the formatting than the actual info'],
      dtype='<U5000')

# TF-IDF и удаление стоп-слов

## Для обучения моделей преобразуем корпус в матрицу

In [10]:
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('stopwords')
stopwords = set(stopwords.words('english'))

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [11]:
count = TfidfVectorizer(stop_words=stopwords)

In [12]:
# X_train = count.fit_transform(corpus)
# y_train = df['toxic']

In [13]:
X_train = corpus
y_train = df['toxic']

## <span style="color:purple"> Разбиваю на выборки </span>

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

## <span style="color:purple"> Векторизация </span>

In [15]:
X_train = count.fit_transform(X_train)
X_test = count.transform(X_test)

# 2. Обучение

### С помощью кроссвалидации найдем лучшую модель на небольшой выборке. Потом обучим и протестируем её на полном датасете.

## 2.1 LogisticRegression

In [16]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression

model_log = LogisticRegression()
cv = 4
cv_model = cross_val_score(model_log, X_train, y_train, cv=cv, scoring='f1')
cv_model



array([0.59718776, 0.59      , 0.585     , 0.59228188])

In [18]:
cv_model.mean() #Среднее

0.5911174094181779

## 2.2 DecisionTreeClassifier

In [19]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

In [20]:
param_grid = {'max_depth': list(range(1,10)), 'random_state': [12345] }

model_tree = DecisionTreeClassifier()

grid = GridSearchCV(model_tree, param_grid, refit = True, verbose = 3, cv=cv, scoring='f1')


In [21]:
grid.fit(X_train, y_train)
display(grid.best_params_, grid.best_score_)

Fitting 4 folds for each of 9 candidates, totalling 36 fits
[CV] max_depth=1, random_state=12345 .................................


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


[CV] ..... max_depth=1, random_state=12345, score=0.272, total=   1.7s
[CV] max_depth=1, random_state=12345 .................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.7s remaining:    0.0s


[CV] ..... max_depth=1, random_state=12345, score=0.281, total=   1.2s
[CV] max_depth=1, random_state=12345 .................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:    2.8s remaining:    0.0s


[CV] ..... max_depth=1, random_state=12345, score=0.288, total=   1.3s
[CV] max_depth=1, random_state=12345 .................................
[CV] ..... max_depth=1, random_state=12345, score=0.264, total=   1.2s
[CV] max_depth=2, random_state=12345 .................................
[CV] ..... max_depth=2, random_state=12345, score=0.388, total=   1.8s
[CV] max_depth=2, random_state=12345 .................................
[CV] ..... max_depth=2, random_state=12345, score=0.377, total=   1.3s
[CV] max_depth=2, random_state=12345 .................................
[CV] ..... max_depth=2, random_state=12345, score=0.377, total=   1.3s
[CV] max_depth=2, random_state=12345 .................................
[CV] ..... max_depth=2, random_state=12345, score=0.376, total=   1.4s
[CV] max_depth=3, random_state=12345 .................................
[CV] ..... max_depth=3, random_state=12345, score=0.439, total=   1.4s
[CV] max_depth=3, random_state=12345 .................................
[CV] .

[Parallel(n_jobs=1)]: Done  36 out of  36 | elapsed:  1.1min finished


{'max_depth': 9, 'random_state': 12345}

0.5701900509811563

## <span style="color:purple"> Тестируем модели </span>

In [22]:
model_log = LogisticRegression()
model_log.fit(X_train, y_train)
predict_log = model_log.predict(X_test)
F1_log = f1_score(y_test, predict_log)

predict_tree = grid.predict(X_test)
F1_tree = f1_score(y_test, predict_tree)



In [23]:
print("Значение F1 на тесте:")
print('Логистическая регрессия:', F1_log)
print('Дерево решений:', F1_tree)

Значение F1 на тесте:
Логистическая регрессия: 0.6370967741935484
Дерево решений: 0.5768282662284305


## <span style="color:purple"> Протестировал модели, обученные на неполном датасете, в этом случае также лучшей оказалась лог. регрессия :)</span>

In [24]:
#param_grid = {'max_depth': list(range(1,20)), 'random_state': [12345] }

#model = DecisionTreeClassifier()

#grid = GridSearchCV(model, param_grid, refit = True, verbose = 3, scoring='f1', cv=cv)

#grid.fit(X_train, y_train)
#display(grid.best_params_, grid.best_score_)

# Повторим предобработку, но уже для полного датасета и немного изменим написание функций, для экономии памяти

In [25]:
def clean(text):
    word_eng = re.sub(r'[^a-zA-Z ]', ' ', text) #оставляем в тексте только английские буквы
    split = ' '.join(word_eng.split()) #убираем лишние пробелы
    low = split.lower() #приводим все буквы к нижнему регистру
    return low
clean(data['text'][3])

'more i can t make any real suggestions on improvement i wondered if the section statistics should be later on or a subsection of types of accidents i think the references may need tidying so that they are all in the exact same format ie date format etc i can do that later on if no one else does first if you have any preferences for formatting style on references or want to do it yourself please let me know there appears to be a backlog on articles for review so i guess there may be a delay until a reviewer turns up it s listed in the relevant form eg wikipedia good article nominations transport'

In [26]:
data['text'] = data['text'].apply(clean)

In [27]:
data.head()

Unnamed: 0,text,toxic
0,explanation why the edits made under my userna...,0
1,d aww he matches this background colour i m se...,0
2,hey man i m really not trying to edit war it s...,0
3,more i can t make any real suggestions on impr...,0
4,you sir are my hero any chance you remember wh...,0


In [28]:
def lemm(text):
    word_list = nltk.word_tokenize(text)
    lemmatized_output = ' '.join([lemmatizer.lemmatize(j) for j in word_list])
    return lemmatized_output

In [29]:
data['text'] = data['text'].apply(lemm)

In [30]:
data.head()

Unnamed: 0,text,toxic
0,explanation why the edits made under my userna...,0
1,d aww he match this background colour i m seem...,0
2,hey man i m really not trying to edit war it s...,0
3,more i can t make any real suggestion on impro...,0
4,you sir are my hero any chance you remember wh...,0


## Поделим датасет на трейн и тест (70 : 30)

In [31]:
from sklearn.model_selection import train_test_split
target = data['toxic']
features = data['text']

In [32]:
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3, random_state=42)

In [33]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(111699,)
(111699,)
(47872,)
(47872,)


In [34]:
y_train.mean()

0.1016839900088631

### 10% - объекты с положительным классом, сильный дисбаланс, надо исправлять

### Увеличим объекты с положительным классом

In [35]:
from sklearn.utils import shuffle

def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

X_train, y_train = upsample(X_train, y_train, 6)

In [36]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(168489,)
(168489,)
(47872,)
(47872,)


In [37]:
y_train.mean()

0.40446557342022327

## 40% это уже лучше)

# TF-IDF

In [38]:
X_train = count.fit_transform(X_train)
X_test = count.transform(X_test)

In [39]:
model = LogisticRegression()
model.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

In [40]:
predict = model.predict(X_test)
F1 = f1_score(y_test, predict)

In [41]:
F1

0.764038405586267

## Проверим модель на адекватность, сравним её с моделью, которая все значения предсказывает, как положительный класс

In [42]:
predict = [1 for i in range(len(y_test))]

In [43]:
F1 = f1_score(y_test, predict)

In [44]:
F1

0.18456929407080153

# 3. Выводы

## На данный момент это проект дался мне тяжелее всего, под конец просто хотелось получить f1 более 0,75 и забыть про него, как страшный сон). К сожалению, моих мозгов хватило только на 2 модели). Логистическая регрессия оказалась лучшей из 2-х, да еще и обучается за считаные секунды и показывает метрику = 0,76, против 0,18 у константной модели, которая все твиты промечает, как негативные. Собственно, не знаю, что тут еще можно добавить))