# **Práctica 1**: Detección de lenguaje abusivo en Twitter




Este tutorial muestra un modelo basado en machine learning que permite la detección del lenguaje abusivo en Twitter, tomando como base los datos recogidos por Waseem and Hovy (2016). 

In [None]:
#opcional: puedes subir los datos o montar tu Drive
#from google.colab import drive
#drive.mount('/content/drive')

Importamos algunos de los paquetes necesarios para este tutorial. Utilizaremos la libreria de módulos de Python **SciKitLearn**. 

In [2]:
import numpy as np
import pandas as pd 

from collections import Counter
import matplotlib.pyplot as plt

!pip install sklearn
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_fscore_support
from sklearn.datasets import make_blobs
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer

In [4]:
# Tal como lo hicimos la clase pasada, podemos instalar e importar el módulo de preprocesamiento para tuits.
!pip install nlpretext
from nlpretext.social.preprocess import remove_emoji
from nlpretext.social.preprocess import remove_hashtag
from nlpretext.social.preprocess import remove_mentions
from nlpretext.basic.preprocess import replace_urls

Collecting nlpretext
  Downloading nlpretext-1.0.4-py3-none-any.whl (92 kB)
[K     |████████████████████████████████| 92 kB 227 kB/s 
[?25hCollecting nlpaug==1.0.1
  Downloading nlpaug-1.0.1-py3-none-any.whl (376 kB)
[K     |████████████████████████████████| 376 kB 29.5 MB/s 
Collecting regex==2019.8.19
  Downloading regex-2019.08.19.tar.gz (654 kB)
[K     |████████████████████████████████| 654 kB 41.1 MB/s 
[?25hCollecting mosestokenizer==1.1.0
  Downloading mosestokenizer-1.1.0.tar.gz (37 kB)
Collecting spacy==2.3.4
  Downloading spacy-2.3.4-cp37-cp37m-manylinux2014_x86_64.whl (10.4 MB)
[K     |████████████████████████████████| 10.4 MB 2.7 MB/s 
[?25hCollecting nltk>=3.4.5
  Downloading nltk-3.6.2-py3-none-any.whl (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 44.4 MB/s 
[?25hCollecting flashtext==2.7
  Downloading flashtext-2.7.tar.gz (14 kB)
Collecting sacremoses==0.0.13
  Downloading sacremoses-0.0.13.tar.gz (118 kB)
[K     |████████████████████████████████| 1

In [None]:
# opcional: aquí verificamos el directorio en el que estamos
import os
os.getcwd()

'/content'

Asignamos variables para los dos sets, train y test. Los datos para este tutorial están divididos en varios archivos de texto.

In [None]:
import pandas as pd
import numpy as np

data_tweets_train = open('/content/drive/MyDrive/ML_Florida_Tutorial/Data/Waseem/waseemtrain.txt').read()
data_labels_train = open('/content/drive/MyDrive/ML_Florida_Tutorial/Data/Waseem/waseemtrainGold.txt').read()

data_tweets_test = open('/content/drive/MyDrive/ML_Florida_Tutorial/Data/Waseem/waseemtest.txt').read()
data_labels_test = open('/content/drive/MyDrive/ML_Florida_Tutorial/Data/Waseem/waseemtestGold.txt').read()

Limpiamos los datos quitando lo que no nos interese, como emojis, hashtags, urls... Podemos utilizar nuestra función de la clase pasada o hacerlo manualmente mediante *for loops* como aquí.

In [None]:
tweets_train = []
labels_train = []

tweets_test = []
labels_test = []

for line in data_tweets_train.split("\n"):
    line = remove_emoji(line)
    line = remove_hashtag(line)
    line = remove_mentions(line)
    line = replace_urls(line, "")
    tweets_train.append(line)
    
for label in data_labels_train.split("\n"):
    labels_train.append(label)
    

for line in data_tweets_test.split("\n"):
    line = remove_emoji(line)
    line = remove_hashtag(line)
    line = remove_mentions(line)
    line = replace_urls(line, "")
    tweets_test.append(line)
    
    
for label in data_labels_test.split("\n"):
    labels_test.append(label)

Aquí tenemos algunos ejemplos para ambos sets:

In [None]:
# 1 = abusive; 2 = non-abusive
print(tweets_train[23], labels_train[23])
print(tweets_train[12394], labels_train[12394])

Come and get your Jizya scumbag. I have it waiting in 0.4 cal copper and lead. 1
Read about the Muslim invasion of India from historian Will Durant. 2


In [None]:
# 1 = abusive; 2 = non-abusive
print(tweets_test[23], labels_test[23])
print(tweets_test[754], labels_test[754])

Can we kill for Don? And does he give us a bunch of virgins in heaven for doing it? 1
RT : Bianca is feeling sick. She tried the baked Greek eggs. 2


Ahora crearemos un dataframe con los datos que hemos limpiado usando Pandas. 

In [None]:
corpus = pd.DataFrame()
corpus['tweet'] = tweets_train
corpus['label'] = labels_train

Como vemos, el corpus está compuesto por 14143 filas (todos los tweets) y dos columnas (texts y labels).

In [None]:
print('Shape of train set {}'.format(corpus.shape))

Shape of train set (14143, 2)


Este es el balance con respecto a las labels. 

In [None]:
corpus['label'].value_counts()

2    9683
1    4460
Name: label, dtype: int64


---



Ahora que tenemos nuestros datos limpios y ordenados en un dataframe, podemos empezar a experimentar con diferentes métodos de modelaje estadístico. En este tutorial, usaremos solo un modelo (Support Vector Classification o SVC), pero jugaremos un poco con los features (variables o labels) y parameters (optimización del módelo). También puedes intentar probar con otros modelos estadísticos, como Naive Bayes, Random Forest, etc. (https://scikit-learn.org/stable/supervised_learning.html#supervised-learning)

Parametros: **Unigrams**

Vamos a empezar usando unigrams para nuestro modelo. 

In [None]:
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

scikit-learn solo funciona con vectores, no con palabras, por lo que primero necesitamos convertir nuestros datos en vectores para después entrenar el modelo. 

In [None]:
svc = Pipeline([
        ("count_vectorizer", CountVectorizer(analyzer = 'word', 
                                             ngram_range = (1,1),
                                             token_pattern=r'\b\w+\b', 
                                             min_df=1)),
        ("linear svc", SVC(kernel="rbf")) # try with "linear", "poly", "rbf" or "sigmoid"
    ])

In [None]:
model = svc
model.probability = True
model.fit(tweets_train, labels_train)
y_pred = model.predict(tweets_test)
print(accuracy_score(labels_test, y_pred))
print(classification_report(labels_test, y_pred))

0.8150031786395423
              precision    recall  f1-score   support

                   0.00      0.00      0.00         1
           1       0.79      0.56      0.66       496
           2       0.82      0.93      0.87      1076

    accuracy                           0.82      1573
   macro avg       0.54      0.50      0.51      1573
weighted avg       0.81      0.82      0.80      1573



  


¿Cómo leer la tabla?   
**Precision**: the ratio of correctly predicted positive observations to the total predicted positive observations.   
**Recall**: the ratio of correctly predicted positive observations to the all observations in actual class - yes.   
**F1-score**: the weighted average of Precision and Recall.    

También podemos usar oraciones para ver la predicción asignada por el modelo. 

In [None]:
sentence = (['Islam is worse than the Nazi party ever was.'])
label_prediction = model.predict(sentence)
print(label_prediction) 

['1']


In [None]:
sentence = (['Islamic State Executes 5 Men In Mosul After Their Wives Fail To Wear New “Afghan-Style” Hijab…'])
label_prediction = model.predict(sentence)
print(label_prediction) 

['2']


---

Ahora, para intentar mejorar nuestro modelo, podemos intentar con bigrams:

Parametros: **Bigrams**


In [None]:
svc = Pipeline([
        ("count_vectorizer", CountVectorizer(analyzer = 'word',
                                             ngram_range=(2, 2), 
                                             token_pattern=r'\b\w+\b', 
                                             min_df=1)),
        ("linear svc", SVC(kernel="rbf")) # try with "linear", "poly", "rbf" or "sigmoid"
    ])

In [None]:
model = svc
model.probability = True
model.fit(tweets_train, labels_train)
y_pred = model.predict(tweets_test)
print(accuracy_score(labels_test, y_pred))
print(classification_report(labels_test, y_pred))

0.7768595041322314
              precision    recall  f1-score   support

                   0.00      0.00      0.00         1
           1       0.87      0.35      0.50       496
           2       0.76      0.98      0.86      1076

    accuracy                           0.78      1573
   macro avg       0.54      0.44      0.45      1573
weighted avg       0.80      0.78      0.74      1573



  


La precisión en general ha bajado. Probemos con un modelo de trigrams:

Parametros: **Trigrams**

In [None]:
svc = Pipeline([
        ("count_vectorizer", CountVectorizer(analyzer = 'word',
                                             ngram_range=(3, 3), 
                                             token_pattern=r'\b\w+\b', 
                                             min_df=1)),
        ("linear svc", SVC(kernel="rbf")) # try with "linear", "poly", "rbf" or "sigmoid"
    ])

In [None]:
model = svc
model.probability = True
model.fit(tweets_train, labels_train)
y_pred = model.predict(tweets_test)
print(accuracy_score(labels_test, y_pred))
print(classification_report(labels_test, y_pred))

0.7457088366179275
              precision    recall  f1-score   support

                   0.00      0.00      0.00         1
           1       0.88      0.23      0.36       496
           2       0.73      0.99      0.84      1076

    accuracy                           0.75      1573
   macro avg       0.54      0.40      0.40      1573
weighted avg       0.78      0.75      0.69      1573



  


Tampoco nos deja muy buenos resultados en comparacion con el modelo de unigrams. Intentemos con uno basado en caracteres:

Parametro: **Character-gram**

In [None]:
svc = Pipeline([
        ("count_vectorizer", CountVectorizer(analyzer = 'char',
                                             ngram_range=(3, 5))),
        ("linear svc", SVC(kernel="rbf")) # try with "linear", "poly", "rbf" or "sigmoid"
    ])

In [None]:
model = svc
model.probability = True
model.fit(tweets_train, labels_train)
y_pred = model.predict(tweets_test)
print(accuracy_score(labels_test, y_pred))
print(classification_report(labels_test, y_pred))