# Primera aproximación al modelo de aprendizaje automático de textos sexistas

En esta primera aproximación realizaremos un entrenamiento con el dataset del que disponemos para obtener un resultado decente para la clasificación de textos machistas.

El dataset en concreto cuenta con los siguientes datos:
1. Id de la frase
2. Dataset del que se ha extraido
3. Texto (entrada que tendremos en el posterior uso)
4. Toxicidad (posible salida de interés)
5. Bandera de machista (Salida obligatoria)
6. "of_id" (no aporta nada)

Debemos construir un clasificador que de una forma u otra nos diga si un texto el cual vamos a analizar es machista o no, para ello usaremos las columnas de texto y la salida que será nuestra bandera. Usaremos un buen preprocesamiento para obtener los mejores resultados posibles.

## Imports y consideraciones previas

In [8]:
# Imports de control de datasets
import pandas as pd
import numpy as np

# Imports de preprocesamiento de textos
import re
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer

# Imports para el entrenamiento del modelo
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Guardar el modelo
import pickle

## Carga de datos y transformación del dataset

In [9]:
# Para ver los textos completos
pd.set_option('display.max_colwidth', None)

df = pd.read_csv("Datasets\sexism_data.csv")
df.head(10)

Unnamed: 0,id,dataset,text,toxicity,sexist,of_id
0,0,other,MENTION3481 i didn't even know random was an option!,0.11818,False,-1
1,1,other,Bottom two should've gone! #mkr,0.25185,False,-1
2,2,callme,MENTION3111 MENTION3424 ladyboner deserves so much more credit than dudeboner. #bonerdebate #reddetails,0.113331,False,-1
3,3,other,She shall now be known as Sourpuss #MKR #KatAndre #FailedFoodies,0.531153,False,-1
4,4,other,Tarah W threw a bunch of women under the bus so she could get Wadhwa's support for her Women in Tech book.,0.118718,False,-1
5,5,callme,I just don't trust an adult who uses coupons.,0.11959,False,13227
6,6,other,Not again #MKR,0.198415,False,-1
7,7,other,"#MKR2016 returns in 2020 once the all the couples, intruders, gate crashers and second-chancers are eliminated MENTION786 #MKR #MKR2015",0.447262,False,-1
8,8,other,MENTION526 most abuse comes from gamergate accounts that are 30-90 days old.,0.254602,False,-1
9,9,other,Great to see the local National Park workers tucking into a free feed. How about you empty those loo's instead. #MKR,0.392942,False,-1


Si nos quedamos unicamente con las columnas que nos interesan:

In [10]:
df = df[['text', 'sexist']]
df.head(10)

Unnamed: 0,text,sexist
0,MENTION3481 i didn't even know random was an option!,False
1,Bottom two should've gone! #mkr,False
2,MENTION3111 MENTION3424 ladyboner deserves so much more credit than dudeboner. #bonerdebate #reddetails,False
3,She shall now be known as Sourpuss #MKR #KatAndre #FailedFoodies,False
4,Tarah W threw a bunch of women under the bus so she could get Wadhwa's support for her Women in Tech book.,False
5,I just don't trust an adult who uses coupons.,False
6,Not again #MKR,False
7,"#MKR2016 returns in 2020 once the all the couples, intruders, gate crashers and second-chancers are eliminated MENTION786 #MKR #MKR2015",False
8,MENTION526 most abuse comes from gamergate accounts that are 30-90 days old.,False
9,Great to see the local National Park workers tucking into a free feed. How about you empty those loo's instead. #MKR,False


In [11]:
# Análisis del dataset
df.shape

(13631, 2)

In [12]:
df['sexist'].value_counts()

False    11822
True      1809
Name: sexist, dtype: int64

Nuestro dataset cuenta con 13631 registros que nos pueden venir genial para realizar nuestras pruebas.

Hay que tener en cuenta que puede haber registros nulos que pueden empeorar en mucho el entranamiento, por lo que en el preprocesado nos encargareemos de ellos.

También tenemos que tener en cuenta que la gran mayoría de los textos son de índole no machista, por lo que debemos centrarnos muy bien en la detección de los que si lo son.

In [13]:
# Vemos algunas de las frases machistas
df[df['sexist'] == True].head(10)

Unnamed: 0,text,sexist
10,All my sons have grown up with computer games but I'm not interested. I see them as a male thing. MENTION203,True
14,Women have more intuition than men.,True
30,Women do not belong in politics,True
45,"On the average, men are more arrogant than women",True
55,A woman will never be truly fulfilled in life if she doesn’t have a committed long-term relationship with a man,True
86,"mmmm Good luck MENTION3536 😄 let us know how you go? Still think u & MENTION395 should enter #MKR 😄 more cooking, less bitching mmmm",True
104,I hate having guys for teachers,True
105,A female can't tell me nothing bout sports. Sorry I grew up in a sports crazed house.,True
119,women can't drive. The only exception: danica patrick.,True
123,RT MENTION2547 I'm not sexist but the Men's skill sets and abilities are just leagues above the women's. Much more entertaining. #Olympics,True


In [14]:
# Vemos algunas de las frases que no son machistas
df[df['sexist'] == False].head(10)

Unnamed: 0,text,sexist
0,MENTION3481 i didn't even know random was an option!,False
1,Bottom two should've gone! #mkr,False
2,MENTION3111 MENTION3424 ladyboner deserves so much more credit than dudeboner. #bonerdebate #reddetails,False
3,She shall now be known as Sourpuss #MKR #KatAndre #FailedFoodies,False
4,Tarah W threw a bunch of women under the bus so she could get Wadhwa's support for her Women in Tech book.,False
5,I just don't trust an adult who uses coupons.,False
6,Not again #MKR,False
7,"#MKR2016 returns in 2020 once the all the couples, intruders, gate crashers and second-chancers are eliminated MENTION786 #MKR #MKR2015",False
8,MENTION526 most abuse comes from gamergate accounts that are 30-90 days old.,False
9,Great to see the local National Park workers tucking into a free feed. How about you empty those loo's instead. #MKR,False


## Preprocesamiento de textos

En el dataset tenemos los textos junto a la etiqueta que nos interesa obtener sobre los textos que usemos más adelante. Para que funcione todo de la mejor forma posible debemos utilizar un buen preprocesamiento de textos para evitar aprendizaje en factores que no sean relevantes para nosotros, como por ejemplo los "#" o las menciones, que a simple vista es lo primero que debemos eliminar

El preprocesamiento se basará en primera instancia en un limpiado sencillo, donde quitaremos los símbolos de puntuación, los hashtags, las menciones y realizaremos conversiones al texto para eliminar espacios y palabras que puedan ser creadas con estas transformaciones.

Posteriormente usaremos la lematización y la bolsa de palabras para eliminar palabras que no aporten mucho significado

### Prueba de eliminación de textos, antes de crear la función general

In [15]:
prueba = df['text'].head(20)

In [16]:
ser = pd.Series(df['text'][86])

In [17]:
# Para ver los textos completos
pd.set_option('display.max_colwidth', None)
prueba = prueba.append(ser)
prueba

0                                                                                        MENTION3481 i didn't even know random was an option!
1                                                                                                            Bottom two should've gone!  #mkr
2                                     MENTION3111 MENTION3424 ladyboner deserves so much more credit than dudeboner. #bonerdebate #reddetails
3                                                                            She shall now be known as Sourpuss #MKR #KatAndre #FailedFoodies
4                                  Tarah W threw a bunch of women under the bus so she could get Wadhwa's support for her Women in Tech book.
5                                                                                               I just don't trust an adult who uses coupons.
6                                                                                                                              Not again #MKR
7     

1. Siguiendo un orden, lo primero que vamos a eliminar son los enlaces

In [18]:
# Iremos ampliando la funcion conforme más cosas le añadamos

def preprocessor(title_text):
    # Primero  eliminiación de enlaces
    title_text = title_text.apply(lambda x: re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', x, flags=re.MULTILINE))
    return title_text

prueba_noLink = preprocessor(prueba.copy())
prueba_noLink

0                                                                                        MENTION3481 i didn't even know random was an option!
1                                                                                                            Bottom two should've gone!  #mkr
2                                     MENTION3111 MENTION3424 ladyboner deserves so much more credit than dudeboner. #bonerdebate #reddetails
3                                                                            She shall now be known as Sourpuss #MKR #KatAndre #FailedFoodies
4                                  Tarah W threw a bunch of women under the bus so she could get Wadhwa's support for her Women in Tech book.
5                                                                                               I just don't trust an adult who uses coupons.
6                                                                                                                              Not again #MKR
7     

2. Eliminación de las menciones

In [19]:
# Iremos ampliando la funcion conforme más cosas le añadamos

def preprocessor(title_text):
    # Primero eliminiación de enlaces
    title_text = title_text.apply(lambda x: re.sub(r'(https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b', '', x, flags=re.MULTILINE))
    # Segundo eliminiación de menciones
    title_text = title_text.apply(lambda x: re.sub(r'MENTION[0-9]*', '', x, flags=re.MULTILINE))
    return title_text

prueba_noMEN = preprocessor(prueba.copy())
prueba_noMEN

0                                                                                          i didn't even know random was an option!
1                                                                                                  Bottom two should've gone!  #mkr
2                                                   ladyboner deserves so much more credit than dudeboner. #bonerdebate #reddetails
3                                                                  She shall now be known as Sourpuss #MKR #KatAndre #FailedFoodies
4                        Tarah W threw a bunch of women under the bus so she could get Wadhwa's support for her Women in Tech book.
5                                                                                     I just don't trust an adult who uses coupons.
6                                                                                                                    Not again #MKR
7     #MKR2016 returns in 2020 once the all the couples, intruders, gate cra

In [20]:
# Los compactamos
def preprocessor(title_text):
    # Primero eliminiación de enlaces
    title_text = title_text.apply(lambda x: re.sub(r'((https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b)|(MENTION[0-9]*)|(#[a-zA-Z0-9]*)', ' ', x, flags=re.MULTILINE))

    return title_text

prueba_noMEN = preprocessor(prueba.copy())
prueba_noMEN

0                                                                                   i didn't even know random was an option!
1                                                                                              Bottom two should've gone!   
2                                                                 ladyboner deserves so much more credit than dudeboner.    
3                                                                                   She shall now be known as Sourpuss      
4                 Tarah W threw a bunch of women under the bus so she could get Wadhwa's support for her Women in Tech book.
5                                                                              I just don't trust an adult who uses coupons.
6                                                                                                                Not again  
7                returns in 2020 once the all the couples, intruders, gate crashers and second-chancers are eliminated      


3. Eliminamos los emojis

In [21]:
# Funcion para la creacion del patron
def removeEmoji(text):
    emoji_pattern = re.compile("["
      u"\U0001F600-\U0001F64F"  # emoticons
      u"\U0001F300-\U0001F5FF"  # symbols & pictographs
      u"\U0001F680-\U0001F6FF"  # transport & map symbols
      u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
      u"\U00002500-\U00002BEF"  # chinese char
      u"\U00002702-\U000027B0"
      u"\U00002702-\U000027B0"
      u"\U000024C2-\U0001F251"
      u"\U0001f926-\U0001f937"
      u"\U00010000-\U0010ffff"
      u"\u2640-\u2642"
      u"\u2600-\u2B55"
      u"\u200d"
      u"\u23cf"
      u"\u23e9"
      u"\u231a"
      u"\ufe0f"  # dingbats
      u"\u3030"
      "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'',text)

In [22]:
def preprocessor(title_text):
    # Primero y segundo, eliminiación de enlaces y menciones
    title_text = title_text.apply(lambda x: re.sub(r'((https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b)|(MENTION[0-9]*)', '', x, flags=re.MULTILINE))
    # Tercero, eliminiación de emojis
    title_text = title_text.apply(lambda x: removeEmoji(x))
    return title_text

prueba_noEmo = preprocessor(prueba.copy())
prueba_noEmo

0                                                                                          i didn't even know random was an option!
1                                                                                                  Bottom two should've gone!  #mkr
2                                                   ladyboner deserves so much more credit than dudeboner. #bonerdebate #reddetails
3                                                                  She shall now be known as Sourpuss #MKR #KatAndre #FailedFoodies
4                        Tarah W threw a bunch of women under the bus so she could get Wadhwa's support for her Women in Tech book.
5                                                                                     I just don't trust an adult who uses coupons.
6                                                                                                                    Not again #MKR
7     #MKR2016 returns in 2020 once the all the couples, intruders, gate cra

4. Proseguiremos con la eliminacion de los **#hashtags** propios de twitter.

In [23]:
def preprocessor(title_text):
    # Primero, segundo y cuarto, eliminiación de enlaces, menciones y hashtags
    title_text = title_text.apply(lambda x: re.sub(r'((https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b)|(MENTION[0-9]*)|(#[a-zA-Z0-9]*)', '', x, flags=re.MULTILINE))
    # Tercero, eliminiación de emojis
    title_text = title_text.apply(lambda x: removeEmoji(x))
    return title_text

prueba_noHas = preprocessor(prueba.copy())
prueba_noHas

0                                                                                  i didn't even know random was an option!
1                                                                                              Bottom two should've gone!  
2                                                                  ladyboner deserves so much more credit than dudeboner.  
3                                                                                     She shall now be known as Sourpuss   
4                Tarah W threw a bunch of women under the bus so she could get Wadhwa's support for her Women in Tech book.
5                                                                             I just don't trust an adult who uses coupons.
6                                                                                                                Not again 
7                  returns in 2020 once the all the couples, intruders, gate crashers and second-chancers are eliminated   
8       

5. Eliminación de símbolos de puntuación, eliminacion de espacios dobles, números y simbolos tipo \n

In [24]:
def preprocessor(title_text):
    
    # Primero, segundo y cuarto, eliminiación de enlaces, menciones y hashtags - Incluimos eliminar espacios y caracteres raros
    title_text = title_text.apply(lambda x: re.sub(r'((https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b)|(MENTION[0-9]*)|(#[a-zA-Z0-9]*)', '', x, flags=re.MULTILINE))
    title_text = title_text.str.replace('\n','')
    title_text = title_text.str.replace('\t','')

     # Quitamos simbolos de puntuacion
    title_text = title_text.str.translate(str.maketrans('', '', string.punctuation))

    # Eliminamos los restos que no hayan sido eliminados y algunas expresiones que no aportan informacion
    title_text = title_text.apply(lambda x: re.sub(r'([^0-9a-zA-Z:,\s]+)|(rt)|(lol)|(lmao)|(lmfao)', '', x, flags=re.MULTILINE))

    # Despues de eliminar los simbolos de puntuacion, eliminamos los números
    title_text = title_text.apply(lambda x: re.sub(r'([0-9]+)', ' ', x, flags=re.MULTILINE))
    # Tercero, eliminiación de emojis
    title_text = title_text.apply(lambda x: removeEmoji(x))
    # Quitamos palabras sueltas con letras repetidas
    title_text = title_text.apply(lambda x: re.sub(r'(.)\1{3,}', ' ', x, flags=re.MULTILINE))

    # Ponemos todo en minusculas y aplicamos trim
    title_text = title_text.str.lower()
    title_text = title_text.str.strip()
    # Quitamos espacios sobrantes
    title_text = title_text.apply(lambda x: ' '.join([y for y in x.split(' ') if y != '']))
    
    return title_text


prueba_clean = preprocessor(prueba.copy())
prueba_clean

0                                                                           i didnt even know random was an option
1                                                                                         bottom two shouldve gone
2                                                            ladyboner deserves so much more credit than dudeboner
3                                                                               she shall now be known as sourpuss
4           tarah w threw a bunch of women under the bus so she could get wadhwas suppo for her women in tech book
5                                                                      i just dont trust an adult who uses coupons
6                                                                                                        not again
7                    returns in once the all the couples intruders gate crashers and secondchancers are eliminated
8                                                       most abuse comes from ga

### Uso con el dataset en conjunto

Con la función creada, podemos aplicarlo al dataset en conjunto

In [25]:
# Auxiliar function for remove all emoji characters
# https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def removeEmoji(text):
    emoji_pattern = re.compile("["
      u"\U0001F600-\U0001F64F"  # emoticons
      u"\U0001F300-\U0001F5FF"  # symbols & pictographs
      u"\U0001F680-\U0001F6FF"  # transport & map symbols
      u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
      u"\U00002500-\U00002BEF"  # chinese char
      u"\U00002702-\U000027B0"
      u"\U00002702-\U000027B0"
      u"\U000024C2-\U0001F251"
      u"\U0001f926-\U0001f937"
      u"\U00010000-\U0010ffff"
      u"\u2640-\u2642"
      u"\u2600-\u2B55"
      u"\u200d"
      u"\u23cf"
      u"\u23e9"
      u"\u231a"
      u"\ufe0f"  # dingbats
      u"\u3030"
      "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'',text)

# Preprocessor function
def preprocessor(title_text):

    # Primero, segundo y cuarto, eliminiación de enlaces, menciones y hashtags - Incluimos eliminar espacios y caracteres raros
    title_text = title_text.apply(lambda x: re.sub(r'((https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b)|(MENTION[0-9]*)|(#[a-zA-Z0-9]*)', ' ', x, flags=re.MULTILINE))
    title_text = title_text.str.replace('\n',' ')
    title_text = title_text.str.replace('\t',' ')

    # Quitamos simbolos de puntuacion
    title_text = title_text.str.translate(str.maketrans('', '', string.punctuation))

    # Eliminamos los restos que no hayan sido eliminados y algunas expresiones que no aportan informacion
    title_text = title_text.apply(lambda x: re.sub(r'([^0-9a-zA-Z:,\s]+)|(rt)|(lol)|(lmao)|(lmfao)', '', x, flags=re.MULTILINE))

    # Despues de eliminar los simbolos de puntuacion, eliminamos los números
    title_text = title_text.apply(lambda x: re.sub(r'([0-9]+)', ' ', x, flags=re.MULTILINE))
    # Tercero, eliminiación de emojis
    title_text = title_text.apply(lambda x: removeEmoji(x))
    # Quitamos palabras sueltas con letras repetidas
    title_text = title_text.apply(lambda x: re.sub(r'(.)\1{3,}', ' ', x, flags=re.MULTILINE))

    # Ponemos todo en minusculas y aplicamos trim
    title_text = title_text.str.lower()
    title_text = title_text.str.strip()
    # Quitamos espacios sobrantes
    title_text = title_text.apply(lambda x: ' '.join([y for y in x.split(' ') if y != '']))

    # Lematizacion
    #title_text = title_text.apply(lambda x: lemmatization(x))

    return title_text

In [26]:
df['text'] = preprocessor(df['text'])

Con el preprocesamiento hecho podemos aplicar un filtro al dataframe para eliminar las filas que esten vacias o que contengan una longitud menor de 10 (por ejemplo)

In [27]:
min_lenght = 10

df = df[df['text'].str.len()>10]

Tras el filtrado vamos a ver cuantos registros han quedado de cada una de las clases

In [28]:
df['sexist'].value_counts()

False    11218
True      1807
Name: sexist, dtype: int64

In [29]:
df.head(10)

Unnamed: 0,text,sexist
0,i didnt even know random was an option,False
1,bottom two shouldve gone,False
2,ladyboner deserves so much more credit than dudeboner,False
3,she shall now be known as sourpuss,False
4,tarah w threw a bunch of women under the bus so she could get wadhwas suppo for her women in tech book,False
5,i just dont trust an adult who uses coupons,False
7,returns in once the all the couples intruders gate crashers and secondchancers are eliminated,False
8,most abuse comes from gamergate accounts that are days old,False
9,great to see the local national park workers tucking into a free feed how about you empty those loos instead,False
10,all my sons have grown up with computer games but im not interested i see them as a male thing,True


### Lemmatization y Bag of Words

Usaremos las técnicas que hemos visto a lo largo del máster para terminar el preprocesamiento de los datos

Lematización: Reducimos las palabras a la raiz (was -> wa)

In [30]:
stemmer = WordNetLemmatizer()

def lemmatization(text):
    text = text.split()

    text = [stemmer.lemmatize(word) for word in text]
    text = ' '.join(text)
    return text
    

df['text'] = df['text'].apply(lambda x: lemmatization(x))

In [31]:
df.head()


Unnamed: 0,text,sexist
0,i didnt even know random wa an option,False
1,bottom two shouldve gone,False
2,ladyboner deserves so much more credit than dudeboner,False
3,she shall now be known a sourpuss,False
4,tarah w threw a bunch of woman under the bus so she could get wadhwas suppo for her woman in tech book,False


In [32]:
X_texto = df['text']
vectorizer = CountVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = vectorizer.fit_transform(X_texto).toarray()

Usamos la transformacion TFIDF para obtener un valor más representativo de la frecuencia que una palabra aparece en los textos:

Term frequency = (Number of Occurrences of a word)/(Total words in the document)

Inverse Document Frecuency
IDF(word) = Log((Total number of documents)/(Number of documents containing the word))


In [33]:
tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()

### Conclusiones preprocesamiento de textos

In [34]:
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('brown')

stemmer = WordNetLemmatizer()
def lemmatization(text):
    
    text = text.split()
    text = [stemmer.lemmatize(word) for word in text]
    text = ' '.join(text)
    return text

# Auxiliar function for remove all emoji characters
# https://gist.github.com/slowkow/7a7f61f495e3dbb7e3d767f97bd7304b
def removeEmoji(text):
    emoji_pattern = re.compile("["
      u"\U0001F600-\U0001F64F"  # emoticons
      u"\U0001F300-\U0001F5FF"  # symbols & pictographs
      u"\U0001F680-\U0001F6FF"  # transport & map symbols
      u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
      u"\U00002500-\U00002BEF"  # chinese char
      u"\U00002702-\U000027B0"
      u"\U00002702-\U000027B0"
      u"\U000024C2-\U0001F251"
      u"\U0001f926-\U0001f937"
      u"\U00010000-\U0010ffff"
      u"\u2640-\u2642"
      u"\u2600-\u2B55"
      u"\u200d"
      u"\u23cf"
      u"\u23e9"
      u"\u231a"
      u"\ufe0f"  # dingbats
      u"\u3030"
      "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'',text)

# Preprocessor function
def preprocessor2(title_text):

    # Primero, segundo y cuarto, eliminiación de enlaces, menciones y hashtags - Incluimos eliminar espacios y caracteres raros
    title_text = title_text.apply(lambda x: re.sub(r'((https|http)?:\/\/(\w|\.|\/|\?|\=|\&|\%)*\b)|(MENTION[0-9]*)|(#[a-zA-Z0-9]*)', ' ', x, flags=re.MULTILINE))
    title_text = title_text.str.replace('\n',' ')
    title_text = title_text.str.replace('\t',' ')

    # Quitamos simbolos de puntuacion
    title_text = title_text.str.translate(str.maketrans('', '', string.punctuation))

    # Eliminamos los restos que no hayan sido eliminados y algunas expresiones que no aportan informacion
    title_text = title_text.apply(lambda x: re.sub(r'([^0-9a-zA-Z:,\s]+)|(rt)|(lol)|(lmao)|(lmfao)', '', x, flags=re.MULTILINE))

    # Despues de eliminar los simbolos de puntuacion, eliminamos los números
    title_text = title_text.apply(lambda x: re.sub(r'([0-9]+)', ' ', x, flags=re.MULTILINE))
    # Tercero, eliminiación de emojis
    title_text = title_text.apply(lambda x: removeEmoji(x))
    # Quitamos palabras sueltas con letras repetidas
    title_text = title_text.apply(lambda x: re.sub(r'(.)\1{3,}', ' ', x, flags=re.MULTILINE))

    # Ponemos todo en minusculas y aplicamos trim
    title_text = title_text.str.lower()
    title_text = title_text.str.strip()
    # Quitamos espacios sobrantes
    title_text = title_text.apply(lambda x: ' '.join([y for y in x.split(' ') if y != '']))

    # Lematizacion
    title_text = title_text.apply(lambda x: lemmatization(x))

    return title_text

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Danie\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Danie\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Danie\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Danie\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Danie\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


Usos

In [35]:
min_lenght = 10

# Eliminacion de twits cortos
df['text'] = preprocessor(df['text'])
df = df[df['text'].str.len()>min_lenght]

In [36]:
X_texto = df['text']
X_texto = preprocessor(X_texto)

# Bag of Words (inc. stopwords)
vectorizer = CountVectorizer(max_features=300, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))
X = vectorizer.fit_transform(X_texto).toarray()

# TFID
tfidfconverter = TfidfTransformer()
X = tfidfconverter.fit_transform(X).toarray()

In [37]:
X[0].shape

(300,)

## Creación del modelo

Con el preprocesamiento tenemos ya en el vector X las frases listas para realizar el entrenamiento

In [38]:
# Dividimos los datos con train test split
# X lo tenemos del preprocesamiento
y = df['sexist']
y = y.astype(int)
y

0        0
1        0
2        0
3        0
4        0
        ..
13625    0
13626    0
13627    0
13629    0
13630    0
Name: sexist, Length: 13014, dtype: int32

In [39]:
X.shape , y.shape

((13014, 300), (13014,))

In [40]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

### Modelo Random Forest

In [42]:
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)
classifier.fit(X_train, y_train) 

RandomForestClassifier(n_estimators=1000, random_state=0)

In [43]:
pred_RF = classifier.predict(X_test)

**EVALUACION**

In [44]:
print(confusion_matrix(y_test, pred_RF))
print(classification_report(y_test, pred_RF))
print('Accuracy Score:', accuracy_score(y_test, pred_RF)) 

[[3200  143]
 [ 275  287]]
              precision    recall  f1-score   support

           0       0.92      0.96      0.94      3343
           1       0.67      0.51      0.58       562

    accuracy                           0.89      3905
   macro avg       0.79      0.73      0.76      3905
weighted avg       0.88      0.89      0.89      3905

Accuracy Score: 0.8929577464788733


**GUARDADO DEL MODELO**

In [None]:
# Write
with open('randomForestSimpleINT', 'wb') as picklefile:
    pickle.dump(classifier,picklefile)

In [153]:
# Read
with open('randomForestSimple', 'rb') as training_model:
    model = pickle.load(training_model)


### Modelo Gaussiano (peor)

In [45]:
from sklearn.naive_bayes import GaussianNB
# Contruimos un modelo gaussiano simple
NB = GaussianNB()
NB.fit(X_train, y_train)

GaussianNB()

In [46]:
# Predecimos
NB_pred= NB.predict(X_test)


**Evaluamos**

In [47]:
print(confusion_matrix(y_test, NB_pred))
print(classification_report(y_test, NB_pred))
print('Accuracy Score:', accuracy_score(y_test, pred_RF)) 

[[1392 1951]
 [  50  512]]
              precision    recall  f1-score   support

           0       0.97      0.42      0.58      3343
           1       0.21      0.91      0.34       562

    accuracy                           0.49      3905
   macro avg       0.59      0.66      0.46      3905
weighted avg       0.86      0.49      0.55      3905

Accuracy Score: 0.8929577464788733


In [29]:
# Write
with open('gaussianSimple', 'wb') as picklefile:
    pickle.dump(classifier,picklefile)

### Logistic Regression

In [48]:
from sklearn.linear_model import LogisticRegression


logreg = LogisticRegression()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.33, random_state=42)

In [49]:
logreg.fit(X_train, y_train)

LogisticRegression()

In [50]:
logreg.score(X_test, y_test)

0.8873108265424913

In [51]:
from sklearn.metrics import classification_report

print(classification_report(y_test, logreg.predict(X_test)))

              precision    recall  f1-score   support

           0       0.91      0.96      0.94      3718
           1       0.62      0.41      0.49       577

    accuracy                           0.89      4295
   macro avg       0.77      0.69      0.72      4295
weighted avg       0.87      0.89      0.88      4295



In [52]:
import numpy as np
coef = pd.DataFrame({"nombre_columna":vectorizer.get_feature_names(), 
                     "coeficientes" : np.abs(logreg.coef_[0])} )

coef.sort_values(by = "coeficientes", ascending=False)



Unnamed: 0,nombre_columna,coeficientes
76,female,5.331903
281,woman,4.548504
91,girl,4.253068
162,men,4.170220
156,man,3.666417
...,...,...
4,actually,0.036290
79,find,0.029334
289,would,0.024101
216,seen,0.003206


### Modelo Random Forest (Balanceado Simple - Obs)

In [31]:
classifier = RandomForestClassifier(n_estimators=1000, random_state=0)

#Usamos los balanceados
# Read
with open('X_train_balanced', 'rb') as training_model:
    X_train_balanced = pickle.load(training_model)

with open('y_train_balanced', 'rb') as training_model:
    y_train_balanced = pickle.load(training_model)

classifier.fit(X_train_balanced, y_train_balanced) 

RandomForestClassifier(n_estimators=1000, random_state=0)

In [32]:
pred_RF = classifier.predict(X_test)
print(confusion_matrix(y_test, pred_RF))
print(classification_report(y_test, pred_RF))
print('Accuracy Score:', accuracy_score(y_test, pred_RF)) 

[[3154  208]
 [ 336  206]]
              precision    recall  f1-score   support

       False       0.90      0.94      0.92      3362
        True       0.50      0.38      0.43       542

    accuracy                           0.86      3904
   macro avg       0.70      0.66      0.68      3904
weighted avg       0.85      0.86      0.85      3904

Accuracy Score: 0.860655737704918
