# Clasificación de textos cortos en emociones

## Contexto:
El análisis de sentimientos es un área que genera valor en múltiples industrias. Usualmente se utilizan técnicas de NLP para clasificar rápidamente textos en un sentimiento positivo, negativo o neutral. Sin embargo, esta clasificación es bastante limitada y no permite un entendimiento profundo. Por esto, se requiere clasificar textos cortos en un espectro de emociones.

Para este caso se abordan las emociones desde un punto de vista clásico en el cual se clasifican en 7 emociones principales: 
1) `shame` = vergüenza
2) `sadness` = tristeza
3) `joy` = alegría
4) `guilt` = culpa
5) `fear` = miedo
6) `disgust` = asco
7) `anger` = ira

Se utiliza el conjunto de datos **ISEAR**, que consiste de 7666 encuestas realizadas por múltiples psicológos en los años 90 en varios países. La encuesta consistía en presentarle una emoción al entrevistado y este debía responder una situación que representara dicha emoción. Este dataset fue elegido porque ha sido ampliamente utilizado para entrenar y comparar el rendimiento de modelos en la detección de emociones.

## Objetivo:
El objetivo de este Notebook es explorar el conjunto de datos **ISEAR**, realizar una limpieza y almacenar el conjuto de datos curado listo para ser procesado por modelos analíticos.

## Orígen de datos:
El dataset original de la encuesta fue tomado de Kaggle -> https://www.kaggle.com/datasets/radedaevi/emotions-description

## Librerías

In [58]:
import re
import pandas as pd
import plotly.express as px

## Cargando DataSet

In [59]:
pd.set_option("display.max_colwidth", None)
pd.options.display.max_rows = 100

df = pd.read_excel("../data/raw/Emotions description.xlsx")
df = df.rename(columns={"Field1": "emotion", "SIT": "text"})
n_raw_obs = df.shape[0]
print("Number of observations = ", n_raw_obs)
df.head()

Number of observations =  7666


Unnamed: 0,emotion,text
0,anger,When a boy tried to fool me so he would be OK trying to show me á\nthat he is a gook boy.
1,anger,"I felt anger when I saw that I was being misleaded by my á\nboyfriend, he went out with other girls. I felt anger for his á\nfalsity."
2,anger,Once a friend had pushed me and I had fallen on to a window which á\nthen broke. I was taken to tthe principal's office and he á\naccused me of having broken the window.
3,anger,"When I was misleaded by a person who assured that something would á\nnot occur, that I had no reason to prepccupy myself, and suddenly á\nI saw myself implicated by the fact, because of the incompetence, á\nand irresponsibility of that person."
4,anger,"I don't use to lie to my parets about what I do, and the two á\ntimes that I felt anger were when they doubted me I said that I á\nwas going to the club, and they didn't believe me because the day á\nbefore they had met me at FLIPERAMA. I had the wish to kil"


Se puede observar que el dataset consiste de 7.666 registros con las siguientes 2 columnas: 
* `text`: Representa el texto de una situación específica
* `emotion`: Representa la emoción que el entrevistado le asignó a la situación representada en el texto.

## Exploratory Data Analysis (EDA)

### Analyzing Text by Lenght

In [60]:
df["text_length"] = df["text"].apply(len)
fig = px.box(
    df,
    x="text_length",
    title="Distribution of Text Lengths",
    labels={"text_length": "Length of Text"},
)
fig.show()

## Cleaning Dataset

Se evidencia que hay algunas observaciones que tienen muy pocos caracteres como para representar una frase significativa, por lo que se va decide analizar en detalle prestando especial atención a los textos cortos.

In [61]:
df.sort_values(by="text_length", ascending=True).head(20)

Unnamed: 0,emotion,text,text_length
6923,shame,None.,5
4713,joy,Blank.,6
1413,disgust,Nothing.,8
2907,fear,Got ill.,8
3557,guilt,Peeping.,8
3215,fear,[ Never.],9
419,anger,A murder.,9
7387,shame,NO RESPONSE,11
1204,disgust,NO RESPONSE,11
3318,guilt,NO RESPONSE.,12


### Removing invalid text

Hay algunas observaciones donde el texto no corresponde a una respuesta válida, como por ejemplo: "NO RESPONSE" or "None". Por lo tanto, se identifican todos los textos invalidos y se eliminan del conjunto de datos.

In [77]:
# Define a list of invalid responses
invalid_texts = [
    "None.",
    "Blank.",
    "Nothing.",
    "[ Never.]",
    "NO RESPONSE",
    "NO RESPONSE.",
    "Doesn't apply.",
    "[ No reponse.]",
    "Not applicable.",
    "[ No response.]",
    "[ Do not know.]",
    "Does not apply.",
    " [ No response.]",
    "[ No description.]",
    "[ Not applicable.]",
    "[ Can not remember.]",
    "DO NOT REMEMBER.",
    "[ Same as in anger.]",
    "[ Never experienced.]",
    "The same as in SHAME.",
    "[ The same as in anger.]",
    "[ The same as in shame.]",
    "[ I can not recall one.]",
    "[ The same as in guilt.]",
    "[ Never felt the emotion.]",
    "[ Can not think of anything.]",
    "[ Do not remember any incident.]",
    "[ I have not felt this emotion.]",
    "[ I have never felt this emotion.]",
    "[ Can not think of any situation.]",
    "[ I do not recall one here either.]",
    "[ I have not felt this emotion yet.]",
    "[ Normally I do not feel disgusted.]",
    "[ I have not felt this emotion in my life.]",
]

# Remove invalid responses
df_cleaned = df[~df["text"].isin(invalid_texts)]
n_cleaned_obs = df_cleaned.shape[0]
print("Number of observations after remove invalid texts = ", n_cleaned_obs)
df_cleaned.sort_values(by="text_length", ascending=True).head()

Number of observations after remove invalid texts =  7534


Unnamed: 0,emotion,text,text_length
2907,fear,Got ill.,8
3557,guilt,Peeping.,8
419,anger,A murder.,9
2939,fear,Getting ill.,12
3158,fear,Getting ill.,12


### Characters Engineer
* Se remueven caracteres no últiles como "[]" que se encuentran presentes en los textos.
* Se eliminan los "." finales.
* Se elimina 'á\n' dado que aparece en cada salto de línea.
* Se transforma a minúscula el texto.

In [80]:
# Function to clean the text
def clean_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Remove text enclosed in brackets
    text = re.sub(r"\[.*?\]", "", text)
    # Remove trailing periods
    text = text.rstrip(".")
    return text.strip()


# Apply the clean_text function to the 'text' column using .loc
df_cleaned.loc[:, "text"] = df_cleaned["text"].apply(clean_text)

# Clean the 'text' column in df using .loc to avoid the warning
df_cleaned.loc[:, "text"] = df_cleaned["text"].str.replace(" á\n", " ", regex=False)

# Display the first few rows of the cleaned DataFrame
df_cleaned.head()

Unnamed: 0,emotion,text,text_length
0,anger,when a boy tried to fool me so he would be ok trying to show me that he is a gook boy,88
1,anger,"i felt anger when i saw that i was being misleaded by my boyfriend, he went out with other girls. i felt anger for his falsity",131
2,anger,once a friend had pushed me and i had fallen on to a window which then broke. i was taken to tthe principal's office and he accused me of having broken the window,168
3,anger,"when i was misleaded by a person who assured that something would not occur, that i had no reason to prepccupy myself, and suddenly i saw myself implicated by the fact, because of the incompetence, and irresponsibility of that person",240
4,anger,"i don't use to lie to my parets about what i do, and the two times that i felt anger were when they doubted me i said that i was going to the club, and they didn't believe me because the day before they had met me at fliperama. i had the wish to kil",255


### Analyzing text lenght distribution after cleanning proccess

In [81]:
df["text_length"] = df["text"].apply(len)
fig = px.box(
    df_cleaned,
    x="text_length",
    title="Distribution of Text Lengths after cleanning",
    labels={"text_length": "Length of Text"},
)
fig.show()

Aunque aún se presentan textos cortos, ya se tiene certeza que son válidos.

### Distibution of texts by emotions

In [82]:
emotion_counts = df_cleaned.groupby("emotion").size().reset_index(name="count")
fig = px.bar(
    emotion_counts,
    y="emotion",
    x="count",
    title="Counts of Emotions in cleaned ISEAR Dataset",
    text="count",
)
fig.update_traces(texttemplate="%{text}", textposition="outside")
fig.update_layout(xaxis_title="number of observations")
fig.show()

### Cleaning Detail

En el proceso de limpieza se eliminan 132 registros lo que representa una variación porcential de -1.72%, pasando de **7.666** a **7.534** observaciones.

In [83]:
n_dif = n_raw_obs - n_cleaned_obs
n_var = ((n_cleaned_obs / n_raw_obs) - 1) * 100

print("Number of raw observations = ", n_raw_obs)
print("Number of cleaned observations = ", n_cleaned_obs)
print("Difference of observations = ", n_dif)
print("Percentage difference of observations = {:.2f}%".format(n_var))

Number of raw observations =  7666
Number of cleaned observations =  7534
Difference of observations =  132
Percentage difference of observations = -1.72%


Esta disminución se realizó de manera balanceada en cada emoción como se muestra acontinuación:

In [84]:
emotion_counts_raw = df.groupby("emotion").size().reset_index(name="count")
emotion_counts_cleaned = df_cleaned.groupby("emotion").size().reset_index(name="count")
emotion_counts = emotion_counts_raw.merge(emotion_counts_cleaned, on="emotion")
emotion_counts.rename(
    columns={"count_x": "raw_count", "count_y": "cleaned_count"}, inplace=True
)
emotion_counts["dif"] = emotion_counts["raw_count"] - emotion_counts["cleaned_count"]
emotion_counts["%var"] = (
    ((emotion_counts["cleaned_count"] / emotion_counts["raw_count"]) - 1) * 100
).apply(lambda x: "{:.2f}%".format(x))
emotion_counts.head()

Unnamed: 0,emotion,raw_count,cleaned_count,dif,%var
0,anger,1096,1086,10,-0.91%
1,disgust,1096,1070,26,-2.37%
2,fear,1095,1084,11,-1.00%
3,guilt,1093,1065,28,-2.56%
4,joy,1094,1088,6,-0.55%


### Saving cleaned dataset

In [88]:
# Selecting only needed columns

df_cleaned = df_cleaned[["text", "emotion"]]
df_cleaned.to_parquet("../data/cleaned/isear_cleaned.parquet")