# **Universidad Icesi - Maestría en Inteligencia Artificial Aplicada**

***

### **Equipo:**

1. Alvaro Acosta
2. Jhonatan Estrada
3. Cristian Gonzalez
4. Danny Martinez

***

# Analisis de Sentimientos en reseñas de restaurantes de McDonald's en EE. UU.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/Ohtar10/icesi-nlp/blob/main/Sesion1/7-sentiment-analysis.ipynb)

Ahora pongamos en práctica algunos de estos conceptos en un caso más real. Para esta práctica vamos a hacer un análisis de sentimientos sobre unas reseñas anónimas de restaurantes de McDonald's en EE. UU.

### Referencias
* [Natural Language Processing in Action](https://www.manning.com/books/natural-language-processing-in-action)

In [None]:
import pkg_resources
import warnings

warnings.filterwarnings('ignore')

installed_packages = [package.key for package in pkg_resources.working_set]
IN_COLAB = 'google-colab' in installed_packages

  import pkg_resources


In [None]:
from pathlib import Path

# Define the dependencies for the requirements.txt file
requirements_text = """# Updated requirements
numpy==1.26.4
pandas==2.2.2
matplotlib==3.8.0
seaborn==0.12.2
scikit-learn==1.6.1
statsmodels==0.14.0
tqdm>=4.67.0
torch==2.2.0
torchvision==0.17.0
torchaudio==2.2.0
lightning==2.2.0.post0
tensorboard==2.19.0
bokeh==3.7.0
transformers[torch]==4.41.2
datasets==2.19.1
torchinfo==1.8.0
accelerate==0.30.1
evaluate==0.4.2
sentence-transformers==3.0.1
gradio==5.42.0
ollama==0.5.3
spacy==3.8.7
thinc>=8.3.4,<8.4.0
nltk==3.9.1
httpx[http2]==0.28.1
websockets>=14.0,<15.1
fsspec==2024.3.1
gcsfs==2024.3.1
"""

# Define the output file path
path = Path("requirements.txt")

# Write the dependencies in the output file
path.write_text(requirements_text.strip() + "\n", encoding="utf-8")

# Print the absolute path of the generated file
print(f"Saved to {path.resolve()}")

Saved to /content/requirements.txt


In [None]:
# Colab: uninstall OpenCV (prevents NumPy≥2), install requirements, force-reinstall spaCy/thinc, download model, pin NumPy 1.26.4, then check dependencies
!test '{IN_COLAB}' = 'True' && pip uninstall -y opencv-python opencv-python-headless opencv-contrib-python || true && pip install -U --no-cache-dir -r requirements.txt --force-reinstall "spacy==3.8.7" "thinc>=8.3.4,<8.4.0" && python -m spacy download en_core_web_sm && pip check || true

[0mCollecting spacy==3.8.7
  Downloading spacy-3.8.7-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting thinc<8.4.0,>=8.3.4
  Downloading thinc-8.3.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting numpy==1.26.4 (from -r requirements.txt (line 2))
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m122.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pandas==2.2.2 (from -r requirements.txt (line 3))
  Downloading pandas-2.2.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (19 kB)
Collecting matplotlib==3.8.0 (from -r requirements.txt (line 4))
  Downloading matplotlib-3.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB)
Collecting seaborn==0.12.2 (from -r requirements.txt (line 5))
  Downloading seaborn-0.12.2-py3-none-any.whl.

In [None]:
# Restart the kernel
import IPython; IPython.Application.instance().kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

Empecemos por cargar el dataset, el cual puede ser descargado desde:

https://www.kaggle.com/datasets/nelgiriyewithana/mcdonalds-store-reviews

Este dataset contiene un conjunto de más de 33.000 reseñas anonimizadas de restaurantes de McDonald's en EE. UU., extraídas de Google Reviews. Resume experiencias y opiniones de clientes por local, e incluye nombre de tienda, categoría, dirección, coordenadas, calificación, texto de la reseña y timestamp.

El archivo CSV debe encontrarse ubicado en el mismo directorio que el Notebook.

In [None]:
import pandas as pd
import numpy as np

# Load the dataset
reviews = pd.read_csv("McDonald_s_Reviews.csv")

# Select the source column name and map values
src = 'sentiment' if 'sentiment' in reviews.columns else 'Sentiment'
reviews[src] = (reviews[src].astype(str)
                .str.strip().str.lower()
                .map({'positive': 'pos', 'negative': 'neg'}))

# Rename the 'sentiment' column to 'label'
reviews = reviews.rename(columns={src: 'label'})

reviews.head()

Unnamed: 0,review,label
0,One of the other reviewers has mentioned that ...,pos
1,A wonderful little production. <br /><br />The...,pos
2,I thought this was a wonderful way to spend ti...,pos
3,Basically there's a family where a little boy ...,neg
4,"Petter Mattei's ""Love in the Time of Money"" is...",pos


In [None]:
import pandas as pd
import numpy as np

# Load the dataset
reviews = pd.read_csv("McDonald_s_Reviews.csv", encoding='latin-1')

# Extract star value (1–5) from strings like "1 star", "4 stars", "5 stars", etc
stars = pd.to_numeric(
    reviews["rating"].astype(str).str.extract(r"([1-5])")[0],
    errors="coerce"
)

# Map: 1–2 => neg, 3 => neu, 4–5 => pos
label_map = {1: "neg", 2: "neg", 3: "neu", 4: "pos", 5: "pos"}
reviews["label"] = stars.map(label_map)

reviews.head()

Unnamed: 0,reviewer_id,store_name,category,store_address,latitude,longitude,rating_count,review_time,review,rating,label
0,1,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,3 months ago,Why does it look like someone spit on my food?...,1 star,neg
1,2,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,5 days ago,It'd McDonalds. It is what it is as far as the...,4 stars,pos
2,3,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,5 days ago,Made a mobile order got to the speaker and che...,1 star,neg
3,4,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,a month ago,My mc. Crispy chicken sandwich was ï¿½ï¿½ï¿½ï¿...,5 stars,pos
4,5,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,2 months ago,"I repeat my order 3 times in the drive thru, a...",1 star,neg


Luego, hagamos algo de limpieza, vamos a remover nulos y valores vacíos:

In [None]:
reviews.dropna(inplace=True)
reviews.review = reviews.review.apply(lambda r: r.strip())
blanks = reviews[reviews.review == ''].index
reviews.drop(blanks, inplace=True)

In [None]:
reviews[reviews.review == ''].index

Index([], dtype='int64')

In [None]:
reviews.label.value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
pos,15705
neg,12325
neu,4706


El conjunto de datos presenta un desbalance en la distribución de clases, con un total de 15,705 reseñas positivas, 12,325 negativas y 4,706 neutrales. Esta diferencia en la proporción de ejemplos por categoría debe ser considerada en el análisis para evitar sesgos en procesos posteriores.

Para hacer las cosas simples, vamos a utilizar un VADER para computar el puntaje de positivo, neutro o negativo. Este modelo ya viene implementado dentro de NLTK.

In [None]:
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [None]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()
reviews['scores'] = reviews.review.apply(lambda r: sid.polarity_scores(r))
reviews.head()

Unnamed: 0,reviewer_id,store_name,category,store_address,latitude,longitude,rating_count,review_time,review,rating,label,scores
0,1,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,3 months ago,Why does it look like someone spit on my food?...,1 star,neg,"{'neg': 0.027, 'neu': 0.879, 'pos': 0.094, 'co..."
1,2,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,5 days ago,It'd McDonalds. It is what it is as far as the...,4 stars,pos,"{'neg': 0.0, 'neu': 0.791, 'pos': 0.209, 'comp..."
2,3,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,5 days ago,Made a mobile order got to the speaker and che...,1 star,neg,"{'neg': 0.051, 'neu': 0.949, 'pos': 0.0, 'comp..."
3,4,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,a month ago,My mc. Crispy chicken sandwich was ï¿½ï¿½ï¿½ï¿...,5 stars,pos,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound..."
4,5,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,2 months ago,"I repeat my order 3 times in the drive thru, a...",1 star,neg,"{'neg': 0.143, 'neu': 0.857, 'pos': 0.0, 'comp..."


Con estos puntajes ahora podemos convertir el resultado en una etiqueta de predicción:

In [None]:
reviews['compound'] = reviews.scores.apply(lambda s: s['compound'])
reviews['prediction'] = reviews['compound'].apply(lambda c: 'pos' if c >= 0.0001 else ('neg' if c <= -0.0001 else 'neu'))
reviews.head()

Unnamed: 0,reviewer_id,store_name,category,store_address,latitude,longitude,rating_count,review_time,review,rating,label,scores,compound,prediction
0,1,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,3 months ago,Why does it look like someone spit on my food?...,1 star,neg,"{'neg': 0.027, 'neu': 0.879, 'pos': 0.094, 'co...",0.5215,pos
1,2,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,5 days ago,It'd McDonalds. It is what it is as far as the...,4 stars,pos,"{'neg': 0.0, 'neu': 0.791, 'pos': 0.209, 'comp...",0.8687,pos
2,3,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,5 days ago,Made a mobile order got to the speaker and che...,1 star,neg,"{'neg': 0.051, 'neu': 0.949, 'pos': 0.0, 'comp...",-0.3535,neg
3,4,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,a month ago,My mc. Crispy chicken sandwich was ï¿½ï¿½ï¿½ï¿...,5 stars,pos,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,neu
4,5,McDonald's,Fast food restaurant,"13749 US-183 Hwy, Austin, TX 78750, United States",30.460718,-97.792874,1240,2 months ago,"I repeat my order 3 times in the drive thru, a...",1 star,neg,"{'neg': 0.143, 'neu': 0.857, 'pos': 0.0, 'comp...",-0.802,neg


Y finalmente computar unas cuantas métricas de calidad del modelo:

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

y_true = reviews.label.values
y_pred = reviews.prediction.values

acc = accuracy_score(y_true, y_pred)
cm = confusion_matrix(y_true, y_pred)
cr = classification_report(y_true, y_pred)

print(f"Accuracy:\n{acc}\n")
print(f"Classification Report:\n{cr}")
print(f"Confusion Matrix:\n{cm}")

Accuracy:
0.7010935972629521

Classification Report:
              precision    recall  f1-score   support

         neg       0.83      0.64      0.72     12325
         neu       0.33      0.42      0.37      4706
         pos       0.76      0.83      0.80     15705

    accuracy                           0.70     32736
   macro avg       0.64      0.63      0.63     32736
weighted avg       0.72      0.70      0.71     32736

Confusion Matrix:
[[ 7895  2072  2358]
 [  981  1961  1764]
 [  651  1959 13095]]


El modelo alcanza una exactitud del 70%, lo cual puede considerarse un buen rendimiento en términos generales. Sin embargo, el desempeño es desigual entre clases. En las reseñas positivas se observa un buen resultado (F1 = 0.80), con alta capacidad para identificarlas correctamente. Las reseñas negativas también muestran un rendimiento aceptable (F1 = 0.72), aunque el modelo pierde algunos casos (recall = 0.64). La mayor dificultad se encuentra en las reseñas neutrales, donde el desempeño es bajo (F1 = 0.37), lo que indica que el modelo no logra diferenciarlas adecuadamente. Por último, podemos decir que aunque el rendimiento general es satisfactorio, se requiere trabajar en técnicas adicionales como balanceo de clases, para mejorar la muestra de las reseñas neutras.