# 2. Etapa de preprocesado de texto

El alumno preparará una **etapa de preprocesado de reviews** que permita adecuar
el formato de las mismas a uno más adecuado. Será la etapa previa al entrenamiento del
modelo de sentimiento.

**Todo el preprocesado deberá incluirse en una función de Python** que contenga
todo el procesado de texto. Esta función puede (es recomendable) contener otras funciones
que realicen tareas más concretas (eliminar stopwords, eliminar signos de puntuación, etc.).

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Carga de los datasets

In [2]:
import pandas as pd
from gensim.models import Word2Vec

# Datasets de Train y Test
train_data = pd.read_csv('/content/drive/MyDrive/train_data.csv', sep=';')
test_data = pd.read_csv('/content/drive/MyDrive/test_data.csv', sep=';')

train_data.shape, test_data.shape

((58448, 2), (14612, 2))

In [3]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 58448 entries, 0 to 58447
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   rating  58448 non-null  int64 
 1   text    58441 non-null  object
dtypes: int64(1), object(1)
memory usage: 913.4+ KB


In [4]:
# como tenemos NAN los eliminamos
train_data = train_data.dropna()
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 58441 entries, 0 to 58447
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   rating  58441 non-null  int64 
 1   text    58441 non-null  object
dtypes: int64(1), object(1)
memory usage: 1.3+ MB


In [5]:
train_data['rating'].value_counts()

rating
1    45824
0    12617
Name: count, dtype: int64

In [6]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14612 entries, 0 to 14611
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   rating  14612 non-null  int64 
 1   text    14608 non-null  object
dtypes: int64(1), object(1)
memory usage: 228.4+ KB


In [7]:
# como tenemos NAN los eliminamos
test_data = test_data.dropna()
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 14608 entries, 0 to 14611
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   rating  14608 non-null  int64 
 1   text    14608 non-null  object
dtypes: int64(1), object(1)
memory usage: 342.4+ KB


In [8]:
test_data['rating'].value_counts()

rating
1    11540
0     3068
Name: count, dtype: int64

## Funciones de Preprocesado

In [9]:
!pip install num2words
!pip install nltk
!pip install autocorrect

Collecting num2words
  Downloading num2words-0.5.13-py3-none-any.whl (143 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/143.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.3/143.3 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting docopt>=0.6.2 (from num2words)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: docopt
  Building wheel for docopt (setup.py) ... [?25l[?25hdone
  Created wheel for docopt: filename=docopt-0.6.2-py2.py3-none-any.whl size=13706 sha256=c15693dc6b5fc8a0c3a746ac133dd15bae5334551da1a5fba3af75205ba47add
  Stored in directory: /root/.cache/pip/wheels/fc/ab/d4/5da2067ac95b36618c629a5f93f809425700506f72c9732fac
Successfully built docopt
Installing collected packages: docopt, num2words
Successfully installed docopt-0.6.2 num2words-0.5.13
Collecting autocorrect
  Downloading autocorre

In [10]:
import numpy as np
import pandas as pd
import matplotlib
import scipy
import re

from num2words import num2words
import nltk; nltk.download('punkt'); nltk.download('stopwords'); nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from autocorrect import Speller

from bs4 import BeautifulSoup

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


### funciones auxiliares

In [11]:
def digits_to_words(match):
  """
  Convert string digits to the English words. The function distinguishes between
  cardinal and ordinal.
  E.g. "2" becomes "two", while "2nd" becomes "second"

  Input: str
  Output: str
  """
  suffixes = ['st', 'nd', 'rd', 'th']
  # Making sure it's lower cased so not to rely on previous possible actions:
  string = match[0].lower()
  if string[-2:] in suffixes:
    type='ordinal'
    string = string[:-2]
  else:
    type='cardinal'

  return num2words(string, to=type)


In [12]:
def spelling_correction(text):
    """
    Replace misspelled words with the correct spelling.

    Input: str
    Output: str
    """
    corrector = Speller()
    spells = [corrector(word) for word in text.split()]
    return " ".join(spells)

In [13]:
def remove_stop_words(text):
    """
    Remove stopwords.

    Input: str
    Output: str
    """
    stopwords_set = set(stopwords.words('english'))
    return " ".join([word for word in text.split() if word not in stopwords_set])

In [14]:
def stemming(text):
    """
    Perform stemming of each word individually.

    Input: str
    Output: str
    """
    stemmer = PorterStemmer()
    return " ".join([stemmer.stem(word) for word in text.split()])


In [15]:
def lemmatizing(text):
    """
    Perform lemmatization for each word individually.

    Input: str
    Output: str
    """
    lemmatizer = WordNetLemmatizer()
    return " ".join([lemmatizer.lemmatize(word) for word in text.split()])


### función general

In [16]:
def preprocessing(input_text):
  """
  This function represents a complete pipeline for text preprocessing.

  Input: str
  Output: str
  """

  output = str(input_text)

  # Remove TAGS HTML:
  output = BeautifulSoup(output, "html5lib").get_text()

  # Lower casing:
  output = output.lower()

  # Convert digits to words:
  # The following regex syntax looks for matching of consequtive digits tentatively followed by an ordinal suffix:
  output = re.sub(r'\d+(st)?(nd)?(rd)?(th)?', digits_to_words, output, flags=re.IGNORECASE)

  # Remove punctuations and other special characters:
  output = re.sub('[^ A-Za-z0-9]+', ' ', output)

  # Spelling corrections:
  output = spelling_correction(output)

  # Remove stop words:
  output = remove_stop_words(output)

  # Stemming:
  output = stemming(output)

  # Lemmatizing:
  output = lemmatizing(output)

  return output

In [17]:
# Test with dummy document
preprocessing("""This is just a <em>test</em>.<br/><br />
But if it wasn't a test, it would make for a <b>Great</b> movie review!""")

'test test would make great movi review'

## Debido a que son muchas filas para procesar vamos a reducir el tamaño de los datasets

In [18]:
train_data_slim = train_data[:800]
train_data_slim['rating'].value_counts()

rating
1    616
0    184
Name: count, dtype: int64

In [19]:
test_data_slim = test_data[:200]
test_data_slim['rating'].value_counts()

rating
1    155
0     45
Name: count, dtype: int64

## Preprocesamos los datasets

In [20]:
train_data_slim["text"] = [preprocessing(text) for text in train_data_slim["text"]]

  output = BeautifulSoup(output, "html5lib").get_text()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  train_data_slim["text"] = [preprocessing(text) for text in train_data_slim["text"]]


In [21]:
test_data_slim["text"] = [preprocessing(text) for text in test_data_slim["text"]]

  output = BeautifulSoup(output, "html5lib").get_text()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_data_slim["text"] = [preprocessing(text) for text in test_data_slim["text"]]


In [22]:
train_data_slim.head()

Unnamed: 0,rating,text
0,0,want bunch vagu background rector stuff buy al...
1,1,great set cd like old style countri music
2,1,sound wonder came time promis packag good
3,1,barton sweeney regular colleg town saw least m...
4,0,suppos two cd one


In [23]:
test_data_slim.head()

Unnamed: 0,rating,text
0,1,need cd asap want cd long long time wish luke ...
1,1,surpris receiv cd russia imagin thrill toni be...
2,1,love littl feat believ last one craig fuller g...
3,1,absolut fabul
4,1,true tradit harmonica work nyc bred chicago ba...


In [24]:
train_data_slim.info()

<class 'pandas.core.frame.DataFrame'>
Index: 800 entries, 0 to 799
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   rating  800 non-null    int64 
 1   text    800 non-null    object
dtypes: int64(1), object(1)
memory usage: 18.8+ KB


In [25]:
test_data_slim.info()

<class 'pandas.core.frame.DataFrame'>
Index: 200 entries, 0 to 199
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   rating  200 non-null    int64 
 1   text    200 non-null    object
dtypes: int64(1), object(1)
memory usage: 4.7+ KB


## Guardamos los nuevos datasets

In [26]:
# guardamos los dataset
train_data_slim.to_csv('train_data_preprocesado.csv', sep=';', index=False)
test_data_slim.to_csv('test_data_preprocesado.csv', sep=';', index=False)

In [27]:
!cp train_data_preprocesado.csv /content/drive/MyDrive/train_data_preprocesado.csv
!cp test_data_preprocesado.csv /content/drive/MyDrive/test_data_preprocesado.csv

In [28]:
# Comprobamos a abrirlos de nuevo
train_data = pd.read_csv("/content/drive/MyDrive/train_data_preprocesado.csv", sep=';')
test_data = pd.read_csv("/content/drive/MyDrive/test_data_preprocesado.csv", sep=';')
train_data.shape, test_data.shape

((800, 2), (200, 2))

In [29]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 800 entries, 0 to 799
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   rating  800 non-null    int64 
 1   text    798 non-null    object
dtypes: int64(1), object(1)
memory usage: 12.6+ KB


In [30]:
# por algún motivo quedan 2 text con NAN al guardar y abrir el archivo.

In [31]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   rating  200 non-null    int64 
 1   text    200 non-null    object
dtypes: int64(1), object(1)
memory usage: 3.2+ KB


## Conclusion

Al intentar realizar un preprocesado con toda la data, notamos que colab nos indicaba que el proceso podría tardar más de 40hs, con lo cual se tomó la decisión de armar dos dataset mas pequeños de train y test.

Se revisó que ambos sigan existiendo reviews positivas y negativas, y que la relación no sea muy desproporcionada