<h1>Содержание<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Введение" data-toc-modified-id="Введение-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Введение</a></span></li><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Подготовка</a></span><ul class="toc-item"><li><span><a href="#Проверка-типов-и-пропусков" data-toc-modified-id="Проверка-типов-и-пропусков-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Проверка типов и пропусков</a></span></li><li><span><a href="#Проверка-дубликатов" data-toc-modified-id="Проверка-дубликатов-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Проверка дубликатов</a></span></li></ul></li><li><span><a href="#Подготовка" data-toc-modified-id="Подготовка-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Подготовка</a></span></li><li><span><a href="#Обучение" data-toc-modified-id="Обучение-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Обучение</a></span></li><li><span><a href="#Вывод" data-toc-modified-id="Вывод-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Вывод</a></span></li></ul></div>

# Введение
Интернет-магазин «Викишоп» запускает новый сервис. Теперь пользователи могут редактировать и дополнять описания товаров, как в вики-сообществах. То есть клиенты предлагают свои правки и комментируют изменения других. Магазину нужен инструмент, который будет искать токсичные комментарии и отправлять их на модерацию. 


In [1]:

import pandas as pd
import numpy as np
import nltk
import re
import matplotlib.pyplot as plt
from tqdm import tqdm

from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer 


from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer 
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
stop_words = stopwords.words('english')

pd.set_option('display.max_colwidth', 1000)

[nltk_data] Downloading package stopwords to /Users/anton/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/anton/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/anton/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


#  Подготовка

In [2]:
frame = pd.read_csv("toxic_comments.csv")

In [3]:
frame.head(10)

Unnamed: 0,text,toxic
0,"Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27",0
1,"D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)",0
2,"Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info.",0
3,"""\nMore\nI can't make any real suggestions on improvement - I wondered if the section statistics should be later on, or a subsection of """"types of accidents"""" -I think the references may need tidying so that they are all in the exact same format ie date format etc. I can do that later on, if no-one else does first - if you have any preferences for formatting style on references or want to do it yourself please let me know.\n\nThere appears to be a backlog on articles for review so I guess there may be a delay until a reviewer turns up. It's listed in the relevant form eg Wikipedia:Good_article_nominations#Transport """,0
4,"You, sir, are my hero. Any chance you remember what page that's on?",0
5,"""\n\nCongratulations from me as well, use the tools well. · talk """,0
6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,"Your vandalism to the Matt Shirvington article has been reverted. Please don't do it again, or you will be banned.",0
8,"Sorry if the word 'nonsense' was offensive to you. Anyway, I'm not intending to write anything in the article(wow they would jump on me for vandalism), I'm merely requesting that it be more encyclopedic so one can use it for school as a reference. I have been to the selective breeding page but it's almost a stub. It points to 'animal breeding' which is a short messy article that gives you no info. There must be someone around with expertise in eugenics? 93.161.107.169",0
9,alignment on this subject and which are contrary to those of DuLithgow,0


## Проверка типов и пропусков

In [4]:
frame.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    159571 non-null  object
 1   toxic   159571 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 2.4+ MB


## Проверка дубликатов

In [5]:
print(f"Количество дубликатов: {frame.duplicated().sum()}")

Количество дубликатов: 0


# Подготовка
        
    

In [10]:
class toxic_classification():
    """Класс реализующий все этапы предобработки и классификацию комментариев
    """
    def __init__(self,
                 models_and_params: list,
                 score: str,
                 solvers: list,
                 stop_words,
                 start_frame,
                 target_column,
                 data_column):
        # Инициализация всех переданных параметров
        self.models_and_params = models_and_params
        self.score = score
        self.solvers = solvers
        self.stop_words = stop_words
        self.start_frame = start_frame
        self.target_column = target_column
        self.data_column = data_column
        self.lemm_corpus = None
        self.splited_data = None
        self.vect = None
        self.max_score = -1
        self.best_model = None
        
        # preprocessing
        self.__text_clearning()
        print("Первый этап пройден")
        self.__lemmatisation()
        print("Второй этап пройден")
        self.__splitter()
        print("Третий этап пройден")
        self.__vectorisation()
        print("Подготовка завершена")

        
    def __splitter(self):
        """ Функция, разделяющая обработанные данные на тренировочную,
        валидационную и тестовую
        """

        presplited_data = train_test_split(self.lemm_corpus,
                                            self.start_frame[self.target_column],
                                            test_size = 0.2,random_state = 42)
        splited_data_w_val = train_test_split(presplited_data[1],
                                            presplited_data[3],
                                            test_size = 0.5,random_state = 42)
        self.splited_data = [presplited_data[0],splited_data_w_val[0],splited_data_w_val[1],
                                 presplited_data[2],splited_data_w_val[2],splited_data_w_val[3]]

        
    def __lemmatisation(self):
        """Функция, отвечающая за лемматизацию слов корпуса
        """
        # Инициализация лемматизатора
        lemmatizer = WordNetLemmatizer()
        # Лемматизация корпуса
        self.lemm_corpus = self.corpus.apply(lambda sentence: " ".join([lemmatizer.lemmatize(w,"n") for w in nltk.word_tokenize(sentence)]))

        
    def __text_clearning(self):
        """ Функция, отвечающая за очистку корпуса от лишних символов
        """
        # Выделение корпуса для дальнейшего анализа
        corpus = self.start_frame[self.data_column]
        # Очистка корпуса от лишних символов
        self.corpus = corpus.apply(lambda sentence: re.sub(r'[^a-zA-Z]',' ',sentence))
        
    def __vectorisation(self):
        """Функция, отвечающая за векторизацию корпуса
        """
        # Создание словаря со словарями, которые хранят в себе векторизованные данные от разных векторизаторов 
        self.vect = {str(i()):{} for i in self.solvers}
        # Векторизация данных разными методами
        for vectorizer in self.solvers:
            # Инициализация векторизатора и установка стоп-слов
            vector = vectorizer(stop_words = self.stop_words)
            # Обучение и трансформация на обучающей выборке
            self.vect[str(vectorizer())]['train_data'] = vector.fit_transform(self.splited_data[0])
            self.vect[str(vectorizer())]['train_target'] = self.splited_data[3]
            # Трансформация тестовой выборки
            self.vect[str(vectorizer())]['test_data'] = vector.transform(self.splited_data[1])
            self.vect[str(vectorizer())]['test_target'] = self.splited_data[4]
            # Трансформация валидационной выборки
            self.vect[str(vectorizer())]['valid_data'] = vector.transform(self.splited_data[2])
            self.vect[str(vectorizer())]['valid_target'] = self.splited_data[5]
            
            
    def fit(self):
        """Тренируем все переданные модели
        """
        # Инициализация словаря
        self.black_boxes = {str(name):{} for name,_ in self.models_and_params}
        # Перебор всех моделей
        for model, params in tqdm(self.models_and_params):
            # Перебор всех векторизаторов
            for vectorizer, data in tqdm(self.vect.items(),desc = str(model)):
                # Инициализация внутреннего словаря
                self.black_boxes[str(model)][str(vectorizer)] = {}
                # Инициализация грида
                self.black_boxes[str(model)][str(vectorizer)]["grid_object"] = GridSearchCV(model,params,cv = 3,scoring=self.score)
                # Тренировка грида
                self.black_boxes[str(model)][str(vectorizer)]["grid_object"].fit(self.vect[str(vectorizer)]['train_data'],
                                                                            self.vect[str(vectorizer)]['train_target'])
                # Сохранение лучшей модели
                self.black_boxes[str(model)][str(vectorizer)]["best_model"] = self.black_boxes[str(model)][str(vectorizer)]["grid_object"].best_estimator_
                # Сохранение лучшего скора на разных выборках
                self.black_boxes[str(model)][str(vectorizer)]["best_score_train"] = self.black_boxes[str(model)][str(vectorizer)]["grid_object"].best_score_
                self.black_boxes[str(model)][str(vectorizer)]["best_score_valid"] = f1_score(self.vect[str(vectorizer)]['valid_target'],self.black_boxes[str(model)][str(vectorizer)]["best_model"].predict(self.vect[str(vectorizer)]['valid_data']))
                self.black_boxes[str(model)][str(vectorizer)]["best_score_test"] = f1_score(self.vect[str(vectorizer)]['test_target'],self.black_boxes[str(model)][str(vectorizer)]["best_model"].predict(self.vect[str(vectorizer)]['test_data']))
                # Поиск максимального скора на валидационной выборке
                if self.black_boxes[str(model)][str(vectorizer)]["best_score_valid"] > self.max_score:
                    self.max_score = self.black_boxes[str(model)][str(vectorizer)]["best_score_test"]
                    self.best_model = self.black_boxes[str(model)][str(vectorizer)]["best_model"]
                    
        return {"max_score":self.max_score,"best_model":self.best_model}
    
    def get_info(self):
        """ Функция, возвращающая всю собранную информацию об обучении
        """
        return self.black_boxes
            
        
    

#  Обучение

In [11]:
# Инициализация параметров
params_Log = {"max_iter":[1000,2000,100]}
params_RF = {"n_estimators":[40,200,20],"max_depth":[2,10]}

In [12]:
# Инициализация класса
cl = toxic_classification([(LogisticRegression(random_state = 42,class_weight='balanced',n_jobs = -1),params_Log),
                           (RandomForestClassifier(random_state = 42,class_weight='balanced',n_jobs = -1),params_RF)],
                          'f1',[TfidfVectorizer,CountVectorizer],stop_words,frame,'toxic',"text")

Первый этап пройден
Второй этап пройден
Третий этап пройден
Подготовка завершена


In [13]:
cl.fit()

  0%|          | 0/2 [00:00<?, ?it/s]
LogisticRegression(class_weight='balanced', n_jobs=-1, random_state=42):   0%|          | 0/2 [00:00<?, ?it/s][A
LogisticRegression(class_weight='balanced', n_jobs=-1, random_state=42):  50%|█████     | 1/2 [00:25<00:25, 25.99s/it][A
LogisticRegression(class_weight='balanced', n_jobs=-1, random_state=42): 100%|██████████| 2/2 [01:52<00:00, 56.24s/it][A
 50%|█████     | 1/2 [01:52<01:52, 112.47s/it]
RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=42):   0%|          | 0/2 [00:00<?, ?it/s][A
RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=42):  50%|█████     | 1/2 [00:34<00:34, 34.40s/it][A
RandomForestClassifier(class_weight='balanced', n_jobs=-1, random_state=42): 100%|██████████| 2/2 [01:07<00:00, 33.85s/it][A
100%|██████████| 2/2 [03:00<00:00, 90.09s/it] 


{'max_score': 0.758893280632411,
 'best_model': LogisticRegression(class_weight='balanced', max_iter=1000, n_jobs=-1,
                    random_state=42)}

# Вывод


Необходимая модель найдена, скор полученный на тестовой выборке удовлетворяет условию, в данных не найдены пропуски и дубликаты. Лучшая модель основанна на алгоритме LogisticRegression

- Полученный скор: 0.758