# Проект для «Викишоп»

Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 

Обучите модель классифицировать комментарии на позитивные и негативные. В вашем распоряжении набор данных с разметкой о токсичности правок, значение метрики качества *F1* не меньше 0.75. 

**Описание данных**

Данные находятся в файле `toxic_comments.csv`. Столбец *text* в нём содержит текст комментария, а *toxic* — целевой признак.

## Подготовка

In [1]:
import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from tqdm import tqdm
import re 
from nltk.corpus import stopwords as nltk_stopwords
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import f1_score, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform as sp_randFloat
from scipy.stats import randint as sp_randInt 
from sklearn.model_selection import train_test_split
import transformers
from sklearn.model_selection import cross_val_score


In [2]:
try:
    data = pd.read_csv('/C:/Users/Lena/Downloads/toxic_comments.csv') 
except:
    data = pd.read_csv('/datasets/toxic_comments.csv')

In [3]:
print(f"\n Информация о выборке \n")
display(data.head())
print("\n")
display(data.info())
print("\n")
display(data.describe())
print("\n")
print(data.shape)
print(f"\n Явных дубликатов", data.duplicated().sum())
print(f"\n Пропусков", data.isna().sum())
print(f"\n Пропорции данных целевого признака\n", data['toxic'].value_counts(normalize=True))


 Информация о выборке 



Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0




<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


None





Unnamed: 0.1,Unnamed: 0,toxic
count,159292.0,159292.0
mean,79725.697242,0.101612
std,46028.837471,0.302139
min,0.0,0.0
25%,39872.75,0.0
50%,79721.5,0.0
75%,119573.25,0.0
max,159450.0,1.0




(159292, 3)

 Явных дубликатов 0

 Пропусков Unnamed: 0    0
text          0
toxic         0
dtype: int64

 Пропорции данных целевого признака
 0    0.898388
1    0.101612
Name: toxic, dtype: float64


In [4]:
data = data.drop(['Unnamed: 0'], axis = 1)

In [5]:
data['text'] = data['text'].str.lower() # строчной регистр

In [6]:
# https://webdevblog.ru/podhody-lemmatizacii-s-primerami-v-python/
# Лемматизация с POS-тегом - тегирование частей речи (маркировка каждого слова в предложении соответствующей частью речи)

def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][0].upper() # 
    # словарь, с пом которого назначим каждому слову список потенциальных частей речи
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)

nltk.download('averaged_perceptron_tagger')

lemmatizer = WordNetLemmatizer()

def lemm_text(text):
    lemm_text = ' '.join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in nltk.word_tokenize(text)])
    return lemm_text

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [7]:
tqdm.pandas()

In [8]:
#леммализация
data['text'] = data['text'].progress_apply(lemm_text) 

100%|██████████| 159292/159292 [19:33<00:00, 135.77it/s]


In [9]:
# оставим в тексте только текстовые символы и проблелы
def clear_text(text):
    text = re.sub(r'[^a-zA-Z0-9]', ' ', text)
    return " ".join(text.split())

In [10]:
data['text'] = data['text'].apply(clear_text) 

In [11]:
display(data.head())

Unnamed: 0,text,toxic
0,explanation why the edits make under my userna...,0
1,d aww he match this background colour i m seem...,0
2,hey man i m really not try to edit war it s ju...,0
3,more i ca n t make any real suggestion on impr...,0
4,you sir be my hero any chance you remember wha...,0


In [12]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159292 non-null  object
 1   toxic   159292 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


In [13]:
nltk.download('stopwords')
stopwords = set(nltk_stopwords.words('english'))
count_tf_idf = TfidfVectorizer(stop_words=stopwords)


[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [14]:
target = data['toxic'].values
features = data.drop(['toxic'], axis=1)

features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=.1, random_state=12345)
features_train = count_tf_idf.fit_transform(features_train['text'].values)
features_test = count_tf_idf.transform(features_test['text'].values)
print('размер: ',features_train.shape, features_test.shape, target_train.shape, target_test.shape )


размер:  (143362, 158945) (15930, 158945) (143362,) (15930,)


## Обучение

### LogisticRegression()

In [15]:
%%time
model_lr = LogisticRegression(random_state=12345)
model_lr.fit(features_train, target_train)
f1_lr = cross_val_score(model_lr,features_train, target_train, 
                         cv = 3, 
                         scoring = 'f1').mean()
print('F1 для логистической регрессии с помощью кросс-валидации', f1_lr)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


F1 для логистической регрессии с помощью кросс-валидации 0.7084849543194324
CPU times: user 1min 21s, sys: 1min 35s, total: 2min 57s
Wall time: 2min 57s


In [17]:
%%time
lr_pipeline = Pipeline([ 
    ("clf", LogisticRegression(random_state=12345, class_weight='balanced'))
     ])
    
para_grid = {'clf__C': [5, 10]} 

grid_ = GridSearchCV(estimator=lr_pipeline, param_grid=para_grid, scoring='f1', cv=3, n_jobs=-1)

grid_.fit(features_train, target_train)

print(grid_.best_params_, 'F1 для логистической регрессии', grid_.best_score_)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

{'clf__C': 10} F1 для логистической регрессии 0.7573672414487199
CPU times: user 2min 28s, sys: 2min 54s, total: 5min 22s
Wall time: 5min 22s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Не смотря на варнинги, не буду увеличивать количество итераций.

### Дерево решений

In [18]:
%%time
model_3 = DecisionTreeClassifier()
pipelin_3 = Pipeline(steps=[("model_3", model_3)])
param_grid_3 = {"model_3__max_depth":range(1,10,1)} 
search_3 = GridSearchCV(pipelin_3, param_grid_3, scoring = 'f1', cv=3, n_jobs=-1)
search_3.fit(features_train, target_train)
print(search_3.best_params_, search_3.best_score_)


{'model_3__max_depth': 9} 0.5942222209625517
CPU times: user 3min 49s, sys: 1.65 s, total: 3min 51s
Wall time: 3min 51s


### CatBoostClassifier

In [19]:
%%time

model_1 = CatBoostClassifier()
parameters = {'depth' : sp_randInt(4, 6),
                  'learning_rate' : sp_randFloat(.03, .5),
                  'iterations'    : sp_randInt(1,5)
                 }
    
randm = RandomizedSearchCV(estimator=model_1, param_distributions = parameters, 
                               cv = 3, scoring = 'f1', n_iter = 5)
randm.fit(features_train, target_train)
print("\n f1",randm.best_score_)
print("\n параметры\n", randm.best_params_)


# parameters = {'learning_rate': (.03, .1),
#         'depth': (4, 6, 10),
#         'iterations': (1, 10,5)}

# random = RandomizedSearchCV()

# grid_search_result = model_1.GridSearchCV(params,
#                                        X=features_train,
#                                        y=target_train,
#                                        cv=3,
#                                        plot=False)

# print("\nBest Params : ", max(grid_search_result.cv_results_['mean_test_score'])

0:	learn: 0.5535818	total: 1.6s	remaining: 4.8s
1:	learn: 0.4634901	total: 2.76s	remaining: 2.76s
2:	learn: 0.3984969	total: 4.07s	remaining: 1.35s
3:	learn: 0.3524282	total: 5.28s	remaining: 0us
0:	learn: 0.5591919	total: 1.53s	remaining: 4.58s
1:	learn: 0.4649362	total: 2.78s	remaining: 2.78s
2:	learn: 0.4013784	total: 4.05s	remaining: 1.35s
3:	learn: 0.3545599	total: 5.34s	remaining: 0us
0:	learn: 0.5594747	total: 1.53s	remaining: 4.59s
1:	learn: 0.4673014	total: 2.73s	remaining: 2.73s
2:	learn: 0.4010165	total: 3.97s	remaining: 1.32s
3:	learn: 0.3522806	total: 5.22s	remaining: 0us
0:	learn: 0.3475511	total: 1.55s	remaining: 4.65s
1:	learn: 0.2660086	total: 2.74s	remaining: 2.74s
2:	learn: 0.2376642	total: 3.99s	remaining: 1.33s
3:	learn: 0.2244736	total: 5.22s	remaining: 0us
0:	learn: 0.3588459	total: 1.53s	remaining: 4.6s
1:	learn: 0.2704098	total: 2.78s	remaining: 2.78s
2:	learn: 0.2422842	total: 4.01s	remaining: 1.34s
3:	learn: 0.2294573	total: 5.26s	remaining: 0us
0:	learn: 0.3

## Выводы

На тестовой выборке лучший показатель F1-метрики у модели Логистической регрессии, значит, она точнее, поэтому тестировать буду эту модель.

In [21]:
%%time
model_lr = LogisticRegression(random_state=12345, C = 10)
model_lr.fit(features_train, target_train)
prediction = model_lr.predict(features_test)
print('F1', f1_score(target_test, prediction))
print(f'\n Матрица ошибок')
print(confusion_matrix(target_test, prediction))

F1 0.7789766794291679

 Матрица ошибок
[[14176   142]
 [  493  1119]]
CPU times: user 22.7 s, sys: 26.6 s, total: 49.3 s
Wall time: 49.4 s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Логистическая модель протестирована на тестовой выборке, достигнуто требуемое значение метрики F1.