# Proyecto para "Wikishop"

Internet-shop "Wikishop" lanza un nuevo servicio. Ahora los usuarios pueden editar y complementar descripciones de productos, como en wiki-comunidades. Es decir, los clientes ofrecen sus correcciones y comentarios sobre los cambios de los demás. La tienda necesita una herramienta que busque comentarios tóxicos y los envíe a moderación. 

Entrene el modelo para clasificar los comentarios positivos y negativos. Tiene a su disposición un conjunto de datos que marcan la toxicidad de las revisiones.

Construir un modelo con un valor métrico de calidad de *F1* al menos 0,75. 

**Instrucciones de ejecución del proyecto*

1. Descargar y preparar los datos.
2. Entrenar diferentes modelos. 
3. Saque sus conclusiones.

No es necesario usar *BERT* para ejecutar el proyecto, pero puede intentarlo.

**Descripción de los datos**

Los datos están en el archivo ¹toxic_comments.csv[. La columna *text* contiene el texto del comentario y *toxic* el rasgo objetivo.

## Preparación

In [1]:
import numpy as np
import pandas as pd
import re
import nltk

from nltk import pos_tag
from nltk.corpus import stopwords as nltk_stopwords
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
nltk.download('averaged_perceptron_tagger')

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import f1_score
from sklearn.utils import shuffle

from tqdm import tqdm
tqdm.pandas()

import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/dstrokov/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
df = pd.read_csv('/datasets/toxic_comments.csv')

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159292 entries, 0 to 159291
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   Unnamed: 0  159292 non-null  int64 
 1   text        159292 non-null  object
 2   toxic       159292 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 3.6+ MB


In [4]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,text,toxic
0,0,Explanation\nWhy the edits made under my usern...,0
1,1,D'aww! He matches this background colour I'm s...,0
2,2,"Hey man, I'm really not trying to edit war. It...",0
3,3,"""\nMore\nI can't make any real suggestions on ...",0
4,4,"You, sir, are my hero. Any chance you remember...",0
5,5,"""\n\nCongratulations from me as well, use the ...",0
6,6,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1
7,7,Your vandalism to the Matt Shirvington article...,0
8,8,Sorry if the word 'nonsense' was offensive to ...,0
9,9,alignment on this subject and which are contra...,0


In [5]:
# Eliminar la columna sin nombre - No veo ningún uso en la numeración

df = df.drop(['Unnamed: 0'], axis=1)

In [6]:
print('Дубликатов -', df.duplicated().sum())

Дубликатов - 0


In [7]:
print('Пропусков:')
display(df.isna().sum())

Пропусков:


text     0
toxic    0
dtype: int64

In [8]:
print('Соотношение признаков:')
display(df.toxic.value_counts(normalize=True))

Соотношение признаков:


0    0.898388
1    0.101612
Name: toxic, dtype: float64

Lo que tenemos es:

* 159572 comentarios
* No hay duplicados ni pases
* 2 columnas: tóxico (1 - tóxico, 0 - no), texto (comentario)
* 90% de comentarios no son tóxicos

In [9]:
# Hacer una función para borrar comentarios y aplicarlo

def clear_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z]', ' ', text)   
    text = ' '.join(text.split())
    return text

In [10]:
%%time
df['text'] = df['text'].apply(clear_text)

CPU times: user 3.17 s, sys: 25 ms, total: 3.2 s
Wall time: 3.2 s


In [11]:
# resultado primario

df.head(10)

Unnamed: 0,text,toxic
0,explanation why the edits made under my userna...,0
1,d aww he matches this background colour i m se...,0
2,hey man i m really not trying to edit war it s...,0
3,more i can t make any real suggestions on impr...,0
4,you sir are my hero any chance you remember wh...,0
5,congratulations from me as well use the tools ...,0
6,cocksucker before you piss around on my work,1
7,your vandalism to the matt shirvington article...,0
8,sorry if the word nonsense was offensive to yo...,0
9,alignment on this subject and which are contra...,0


In [12]:
lemmatizer = WordNetLemmatizer()

In [13]:
def get_wordnet_pos(treebank_tag):
    """
    Определение соответствия между POS-тегами Treebank и WordNet.
    """
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN 

In [14]:
def lemmatize_text(text):
    words = text.split()
    tagged_words = pos_tag(words)
    lemmatized_words = []
    for word, tag in tagged_words:
        pos = get_wordnet_pos(tag)
        lemmatized_words.append(lemmatizer.lemmatize(word, pos)) 
    return " ".join(lemmatized_words)

In [15]:
lemmatized_df = df['text'].progress_apply(lemmatize_text)

lemmatized_df.head()

100%|██████████████████████████████████| 159292/159292 [05:47<00:00, 458.40it/s]


0    explanation why the edits make under my userna...
1    d aww he match this background colour i m seem...
2    hey man i m really not try to edit war it s ju...
3    more i can t make any real suggestion on impro...
4    you sir be my hero any chance you remember wha...
Name: text, dtype: object

In [16]:
df['text'] = lemmatized_df

In [18]:
features = df['text'] 
target = df['toxic']

features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=.1, random_state=42, stratify=target)

In [21]:

for i in [features_train, target_train, features_test, target_test]:
    print(i.shape)

(143362,)
(143362,)
(15930,)
(15930,)


## Formación

### Decision tree

In [22]:
%%time

pipeline = Pipeline([("vect", TfidfVectorizer(stop_words='english')), 
                     ("dtc", DecisionTreeClassifier())])
    
parameters = {'dtc__max_depth': ([x for x in range(1, 25)]),
              'dtc__random_state': ([42]), 
              'dtc__class_weight': (['balanced'])}

dtc = GridSearchCV(pipeline, parameters, scoring='f1', cv=3, n_jobs=-1, return_train_score=True)
dtc.fit(features_train, target_train)

mts = dtc.cv_results_['mean_test_score']
dtc_train_f1 = max(mts)

print('F1 дерева решений =', round(dtc_train_f1,2))
print('при параметрах', dtc.best_params_)
print()

F1 дерева решений = 0.63
при параметрах {'dtc__class_weight': 'balanced', 'dtc__max_depth': 23, 'dtc__random_state': 42}

CPU times: user 22.3 s, sys: 6.21 s, total: 28.5 s
Wall time: 3min 28s


### RandomForestClassifier

In [23]:
%%time

pipeline = Pipeline([("vect", TfidfVectorizer(stop_words='english')), 
                     ("rfc", RandomForestClassifier())])
parameters = {'rfc__n_estimators': ([x for x in range(10, 30)]),
              'rfc__random_state': ([12345]),
              'rfc__max_depth': ([x for x in range(1, 10)]),
              'rfc__criterion': (['entropy']), 
              'rfc__class_weight': (['balanced'])}

rfc = GridSearchCV(pipeline, parameters, scoring='f1', cv=3, n_jobs=-1)
rfc.fit(features_train, target_train)
mts = rfc.cv_results_['mean_test_score']
rfc_train_f1 = max(mts)

print('F1 случайного леса =', round(rfc_train_f1,2))
print('при параметрах', rfc.best_params_)
print()

F1 случайного леса = 0.32
при параметрах {'rfc__class_weight': 'balanced', 'rfc__criterion': 'entropy', 'rfc__max_depth': 9, 'rfc__n_estimators': 28, 'rfc__random_state': 12345}

CPU times: user 45.2 s, sys: 46.9 s, total: 1min 32s
Wall time: 10min 41s


### LogisticRegression

In [24]:
%%time

pipeline = Pipeline([("vect", TfidfVectorizer(stop_words='english')), 
                     ("lr", LogisticRegression())])
    
parameters = {'lr__C': (10, 15), 
              'lr__class_weight': (['balanced'])}

lr = GridSearchCV(pipeline, parameters, scoring='f1', cv=3, n_jobs=-1)
lr.fit(features_train, target_train)
mts = lr.cv_results_['mean_test_score']
lr_train_f1 = max(mts)

print('F1 логистической регрессии =', round(lr_train_f1,2))
print('при параметрах', lr.best_params_)
print()

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

F1 логистической регрессии = 0.75
при параметрах {'lr__C': 10, 'lr__class_weight': 'balanced'}

CPU times: user 23.5 s, sys: 3.29 s, total: 26.8 s
Wall time: 19.4 s


### Las mejores pruebas de modelos

In [26]:
%%time

predictions_test = lr.predict(features_test)
lr_test_f1 = f1_score(target_test, predictions_test)
print('финальный F1 логистической регрессии =', round(lr_test_f1,2))

финальный F1 логистической регрессии = 0.77
CPU times: user 538 ms, sys: 58.8 ms, total: 596 ms
Wall time: 604 ms


## Conclusiones

Como resultado de la capacitación sobre los tres modelos se obtuvieron los siguientes resultados: 

* F1 en el árbol de decisión = 0.63  
* F1 en regresión logística = 0,75  
* F1 en bosque aleatorio = 0.32  
    
Las pruebas de regresión logística arrojaron 0,77 - recomendado.