# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок.

Постройте модель со значением метрики качества *F1* не меньше 0.75. 

**Инструкция по выполнению проекта**

1. Загрузите и подготовьте данные.
2. Обучите разные модели. 
3. Сделайте выводы.

Для выполнения проекта применять *BERT* необязательно, но вы можете попробовать.

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import re
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords as nltk_stopwords

nltk.download('punkt')
nltk.download('wordnet')

True

In [2]:
lem = WordNetLemmatizer()
def lemmatization(text):
    word_list = nltk.word_tokenize(text)
    return ' '.join([lem.lemmatize(w) for w in word_list])

In [3]:
def clean(text):
    text = text.lower()    
    text = re.sub(r"(?:\n|\r)", " ", text)
    text = re.sub(r"[^a-zA-Z ]+", "", text).strip()
    return text

In [4]:
df_toxic = pd.read_csv(url, index_col=0)

In [5]:
df_toxic.head()

Unnamed: 0,text,toxic
0,Explanation\nWhy the edits made under my usern...,0
1,D'aww! He matches this background colour I'm s...,0
2,"Hey man, I'm really not trying to edit war. It...",0
3,"""\nMore\nI can't make any real suggestions on ...",0
4,"You, sir, are my hero. Any chance you remember...",0


In [6]:
df_toxic.info()

Пропусков нет, очистим и леммантизируем текст в колонке `text`

In [7]:
df_toxic['clean_text'] = df_toxic['text'].apply(clean)

In [8]:
df_toxic['lemmatize_text'] = df_toxic['clean_text'].apply(lemmatization)

In [9]:
df_toxic

Unnamed: 0,text,toxic,clean_text,lemmatize_text
0,Explanation\nWhy the edits made under my usern...,0,explanation why the edits made under my userna...,explanation why the edits made under my userna...
1,D'aww! He matches this background colour I'm s...,0,daww he matches this background colour im seem...,daww he match this background colour im seemin...
2,"Hey man, I'm really not trying to edit war. It...",0,hey man im really not trying to edit war its j...,hey man im really not trying to edit war it ju...
3,"""\nMore\nI can't make any real suggestions on ...",0,more i cant make any real suggestions on impro...,more i cant make any real suggestion on improv...
4,"You, sir, are my hero. Any chance you remember...",0,you sir are my hero any chance you remember wh...,you sir are my hero any chance you remember wh...
...,...,...,...,...
159446,""":::::And for the second time of asking, when ...",0,and for the second time of asking when your vi...,and for the second time of asking when your vi...
159447,You should be ashamed of yourself \n\nThat is ...,0,you should be ashamed of yourself that is a ...,you should be ashamed of yourself that is a ho...
159448,"Spitzer \n\nUmm, theres no actual article for ...",0,spitzer umm theres no actual article for pro...,spitzer umm there no actual article for prosti...
159449,And it looks like it was actually you who put ...,0,and it looks like it was actually you who put ...,and it look like it wa actually you who put on...


In [10]:
X_train, X_test, y_train, test_y = train_test_split(df_toxic.drop(['text', 'toxic', 'clean_text'],axis=1),
                                                    df_toxic['toxic'], 
                                                    test_size=0.2, stratify=df_toxic['toxic'],
                                                    random_state=1234)

In [11]:
X_train

Unnamed: 0,lemmatize_text
49286,question i believe that ive understood and cor...
92850,before returning home and being disbanded in june
138647,i have respectfully appended a reassurance to ...
2562,a i mentioned in the edit summary again did yo...
6835,hi dutchbloke i said might give reason to pres...
...,...
63165,one will undoubtedly come out a soon a the dem...
134039,i wa reasonable with him they got a week notic...
63145,where the four grouping are actually found
93913,can you source that like i did with mine becau...


Сделаем TF-IDF векторизацию

In [12]:
stop_words = set(nltk_stopwords.words('english'))
# tfidf_vectorizer = TfidfVectorizer(stop_words=stop_words, ngram_range=(1,1))
# X_train_vectorizer = tfidf_vectorizer.fit_transform(X_train['lemmatize_text'].astype('U'))
# X_test_vectorizer =  tfidf_vectorizer.transform(X_test['lemmatize_text'].astype('U'))

## Обучение

### LinearSVC

In [13]:
pipe_svc = Pipeline([('vectorizer', TfidfVectorizer(stop_words=stop_words, ngram_range=(1,1))),
                     ('model', LinearSVC(random_state=42))])
grid_params_svc = [{'model__max_iter' :[500, 1000],
                    'model__C': [1, 7, 9]}]

In [14]:
jobs = -1

SVC = GridSearchCV(estimator=pipe_svc,
                   param_grid=grid_params_svc,
                   scoring='f1',
                   cv=10,
                   n_jobs=jobs)

In [15]:
SVC.fit(X_train['lemmatize_text'], y_train)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('vectorizer',
                                        TfidfVectorizer(stop_words={'a',
                                                                    'about',
                                                                    'above',
                                                                    'after',
                                                                    'again',
                                                                    'against',
                                                                    'ain',
                                                                    'all', 'am',
                                                                    'an', 'and',
                                                                    'any',
                                                                    'are',
                                                                    'aren',
   

In [21]:
(SVC.best_params_, SVC.best_score_)

({'model__C': 1, 'model__max_iter': 500}, 0.775550132219754)

### LogisticRegression

In [22]:
pipe_lr = Pipeline([('tfidf', TfidfVectorizer(stop_words=stop_words, ngram_range=(1,1))),
                    ('clf', LogisticRegression(random_state=42))])
grid_params_lr = [{'clf__C': [1, 7, 9],
                    'clf__solver': ['lbfgs','liblinear'],
                    'clf__max_iter': [500,1000]}]

In [23]:
LR = GridSearchCV(estimator=pipe_lr,
                   param_grid=grid_params_lr,
                   scoring='f1',
                   cv=10,
                   n_jobs=jobs)

In [25]:
LR.fit(X_train['lemmatize_text'], y_train)

GridSearchCV(cv=10,
             estimator=Pipeline(steps=[('tfidf',
                                        TfidfVectorizer(stop_words={'a',
                                                                    'about',
                                                                    'above',
                                                                    'after',
                                                                    'again',
                                                                    'against',
                                                                    'ain',
                                                                    'all', 'am',
                                                                    'an', 'and',
                                                                    'any',
                                                                    'are',
                                                                    'aren',
        

## Выводы

In [26]:
predict_logistic = LR.predict(X_test['lemmatize_text'])

In [27]:
f1_score(test_y, predict_logistic)

0.7760524499654934

In [19]:
predict_SVC = SVC.predict(X_test['lemmatize_text'])
f1_score(test_y, predict_SVC)

0.7814478863597466

In [28]:
pd.DataFrame({'LinearSVC': [f1_score(test_y, predict_SVC)], 'LogisticRegression': [f1_score(test_y, predict_logistic)]}, index=["F1_score"])

Unnamed: 0,LinearSVC,LogisticRegression
F1_score,0.781448,0.776052


### Вывод
- Данные были очищены и лемматизированны.
- Применена TF-IDF векторизация.
- Удалось достич F1 больше 0.75 с LinearSVC и LogisticRegression, но LinearSVC показал лучшую точность.