Ejercicios
Julian Andrés Quiroga
Juan Camilo Ramos

Exercise 1

Give an example of a question we might be able to answer with this sort of data, and another question that we'd need additional data to answer. Assume for now
that all of the reviews are coming from one business.

Question 1: Solved with data analysis:

What is the relationship between the number of stars awarded and the number of "helpful", "funny" or "great" votes the reviews receive?

Question 2: Solved with aditional data

What are the key factors that influence a customer rating a restaurant 5 stars instead of 3 or less?


Exercise 2

Conduct an exploratory analysis of the sizes of reviews: find the shortest and longest reviews, then plot a histogram showing the distribution of review lengths.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Carga el archivo csv
data = pd.read_csv('sdata.csv')
data.head()

# Extrae solo la columna texto
AllReviews = data['text']
AllReviews.head()

# Calcular el tamaño de las reseñas en términos de número de palabras
# utilizando la función `str.split()` para dividir las palabras, y luego usamos `len()` para contar las palabras
tamanios_resenas = AllReviews.apply(lambda x: len(x.split()))

# Encontrar la reseña más corta y más larga
resena_larga = AllReviews[tamanios_resenas.idxmax()]
resena_corta = AllReviews[tamanios_resenas.idxmin()]

# Diagrama de barras con la información calculada anteriormente
plt.figure(figsize=(10,6))
plt.hist(tamanios_resenas, bins=30, edgecolor='black')
plt.title('Tamaño de las reseñas en funión del número de palabras')
plt.xlabel('Número de palabras')
plt.ylabel('Cantidad')

# Mostrar la gráfica
plt.show()

Exercise 3

Write a function word_cloud_rating(data, star_value) that constructs a word cloud from the subset of data that exhibit a certain star_value. Visualize the results of
this function for 1-star reviews.

In [None]:
import matplotlib.pyplot as plt
from wordcloud import WordCloud, STOPWORDS

def word_cloud_rating(datos, valor_estrella):
    # Obtiene las reseñas que son iguales al número de estrellas
    reseñas_filtradas = datos[datos['stars'] == valor_estrella]['text']
    
    # Se unen todas en un solo texto
    texto = " ".join(reseñas_filtradas)
    
    # Se crea una lista de stopwrod para mostrar la nube de palabras
    stopwords = set(STOPWORDS)
    
    # Crear la nube de palabras
    nube_palabras = WordCloud(width=800, height=400, background_color='white', stopwords=stopwords).generate(texto)
    
    # Visualizar la nube
    plt.figure(figsize=(10, 5))
    plt.imshow(nube_palabras, interpolation='bilinear')
    plt.axis('off')
    plt.show()

Excercise 4

The word "good" seems to appear quite frequently in the negative reviews. Investigate why that is and come up with a reasonable explanation.

Anwser

The use of the word "good" in negative reviews is often a strategy for softening criticism. People use it to tone down their feedback, be more diplomatic, and avoid coming across as excessively harsh. By highlighting a positive aspect, even the smallest one, they seek to balance their critique, offering a more fair and reasonable opinion. This also reflects the need to maintain courtesy, avoid confrontation, and make their feedback appear constructive rather than purely negative. In this way, "bueno" acts as a buffer to express dissatisfaction without being aggressive.

Excercise 5

Find all the high-frequency (top 1%) and low-frequency (bottom 1%) words in the reviews overall. (Hint: import the Counter() function from the collections class.)

In [None]:
import pandas as pd
import nltk
from collections import Counter
import numpy as np

# Descargar los recursos de NLTK
nltk.download('stopwords')
nltk.download('punkt')
from nltk.corpus import stopwords

# Cargar los datos (sdata.csv)
data = pd.read_csv('sdata.csv')

# Filtrar la columna de texto
AllReviews = data['text']

# Paso 1: Tokenizar todas las reseñas y eliminamos las stopwords
stop_words = set(stopwords.words('english'))

# Combinar todas las reseñas en un solo bloque de texto
all_text = ' '.join(AllReviews.astype(str))

# Tokenizar en palabras y eliminar las stopwords
words = nltk.word_tokenize(all_text.lower())
filtered_words = [word for word in words if word.isalnum() and word not in stop_words]

# Paso 2: Contar la frecuencia de cada palabra
word_counts = Counter(filtered_words)

# Total de palabras
total_words = sum(word_counts.values())

# Paso 3: Encontrar el 1% más alto y el 1% más bajo
# Ordenar las palabras por frecuencia
sorted_word_counts = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)

# Calcular el umbral del 1%
top_1_percent_index = int(len(sorted_word_counts) * 0.01)
low_1_percent_index = int(len(sorted_word_counts) * 0.99)

# Paso 4: Extraer palabras de alta y baja frecuencia
top_1_percent_words = sorted_word_counts[:top_1_percent_index]
low_1_percent_words = sorted_word_counts[low_1_percent_index:]

# Imprimir las palabras de alta y baja frecuencia
print("Palabras de alta frecuencia (1% más alto):")
for word, count in top_1_percent_words:
    print(f"{word}: {count}")

print("\nPalabras de baja frecuencia (1% más bajo):")
for word, count in low_1_percent_words:
    print(f"{word}: {count}")

Excercise 6

Write a function called top_k_ngrams(word_tokens, n, k) for printing out the top k n-grams. Use this function to get the top 10 1-grams, 2-grams, and 3-grams from
the first 1000 reviews in our dataset.

In [None]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

# Descargar los recursos de NLTK
nltk.download('stopwords')
nltk.download('punkt')

# Cargar los datos (sdata.csv) y obtener las primeras 1000 revisiones
data = pd.read_csv('sdata.csv', nrows=1000)

# Extraer las primeras 1000 reseñas
AllReviews = data['text']

# Tokenizar y limpiar las reseñas (eliminar stopwords)
stop_words = set(stopwords.words('english'))
all_text = ' '.join(AllReviews.astype(str))
words = nltk.word_tokenize(all_text.lower())
filtered_words = [word for word in words if word.isalnum() and word not in stop_words]

# Función para obtener los k n-gramas superiores
def top_k_ngrams(word_tokens, n, k):
    # Convertir los tokens en un solo texto
    word_text = ' '.join(word_tokens)
    
    # Usar CountVectorizer para contar los n-gramas
    vec = CountVectorizer(ngram_range=(n, n)).fit([word_text])
    bag_of_words = vec.transform([word_text])
    sum_words = bag_of_words.sum(axis=0)
    
    # Obtener los n-gramas y sus frecuencias
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
    
    # Imprimir los k n-gramas más frecuentes
    print(f"Top {k} {n}-grams:")
    for word, freq in words_freq[:k]:
        print(f"{word}: {freq}")
    print("\n")

# Aplicar la función para obtener los 10 mejores 1-gramas, 2-gramas y 3-gramas
top_k_ngrams(filtered_words, 1, 10)  # 1-gramas
top_k_ngrams(filtered_words, 2, 10)  # 2-gramas
top_k_ngrams(filtered_words, 3, 10)  # 3-gramas

Excercise 7

7.1
Filter out all of the stop words in the first review of the Yelp review data and print out your answer. Additionally, print out (separately) the stopwords you found in
this review.

In [None]:
import nltk
from nltk.corpus import stopwords
import pandas as pd

# Descargar recursos de NLTK
nltk.download('stopwords')
nltk.download('punkt')

# Cargar los datos (sdata.csv)
data = pd.read_csv('sdata.csv', nrows=5000)

# Obtener la primera revisión
first_review = data['text'][0]

# Tokenizar la primera revisión en palabras individuales
words_in_review = nltk.word_tokenize(first_review.lower())

# Obtener las palabras vacías en inglés
stop_words = set(stopwords.words('english'))

# Filtrar palabras vacías y no vacías
stopwords_in_review = [word for word in words_in_review if word in stop_words]
filtered_review = [word for word in words_in_review if word not in stop_words]

# Imprimir las palabras vacías encontradas
print("Palabras vacías encontradas en la primera revisión:")
print(stopwords_in_review)

# Imprimir la revisión sin las palabras vacías
print("\nPrimera revisión sin palabras vacías:")
print(' '.join(filtered_review))


7.2
Modify the function top_k_ngrams(word_tokens, n, k) to remove stop words before determining the top n-grams.

In [None]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords

# Descargar los recursos de NLTK (si no están descargados ya)
nltk.download('stopwords')
nltk.download('punkt')

# Obtener las stopwords en inglés
stop_words = set(stopwords.words('english'))

# Función modificada para obtener los k n-gramas superiores sin palabras vacías
def top_k_ngrams(word_tokens, n, k):
    # Filtrar las palabras vacías de los tokens
    filtered_tokens = [word for word in word_tokens if word not in stop_words]
    
    # Transformar los tokens filtrados en un solo texto
    word_text = ' '.join(filtered_tokens)
    
    # Usar CountVectorizer para contar los n-gramas de las palabras vacías
    vec = CountVectorizer(ngram_range=(n, n)).fit([word_text])
    bag_of_words = vec.transform([word_text])
    sum_words = bag_of_words.sum(axis=0)
    
    # Obtener los n-gramas y sus frecuencias
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
    
    # Imprimir por consola los k n-gramas más frecuentes
    print(f"Top {k} {n}-grams sin palabras vacías:")
    for word, freq in words_freq[:k]:
        print(f"{word}: {freq}")
    print("\n")

# Ejemplo con 1000 datos de cómo usar la función con las primeras 1000 revisiones
data = pd.read_csv('sdata.csv', nrows=1000)
AllReviews = data['text']

# Tokenización y preparación de palabras
all_text = ' '.join(AllReviews.astype(str))
words = nltk.word_tokenize(all_text.lower())

# Obtener los 10 mejores 1-gramas, 2-gramas y 3-gramas, eliminando palabras vacías
top_k_ngrams(words, 1, 10)  # 1-gramas
top_k_ngrams(words, 2, 10)  # 2-gramas
top_k_ngrams(words, 3, 10)  # 3-gramas


Excercise 8

8.1

Divide the data into "good reviews" (i.e. stars rating was greater than 3) and "bad reviews" (i.e. stars rating was less or equal than 3) and make a bar plot of the top
20 words in each case. Are these results different from above?

In [None]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
from nltk.corpus import stopwords

# Descargar recursos de NLTK
nltk.download('stopwords')
nltk.download('punkt')

# Cargar el dataset
data = pd.read_csv('sdata.csv', nrows=5000)

# Filtrar palabras vacías
stop_words = list(stopwords.words('english'))

# Función para obtener las palabras más frecuentes
def get_top_n_words(corpus, n=20):
    vec = CountVectorizer(stop_words=stop_words).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
    return words_freq[:n]

# Dividir las críticas en buneas y malas buenas (>3 estrellas) y malas (<=3 estrellas)
good_reviews = data[data['stars'] > 3]['text']
bad_reviews = data[data['stars'] <= 3]['text']

# Obtener las 20 palabras más frecuentes en "buenas críticas"
top_good_words = get_top_n_words(good_reviews, 20)

# Obtener las 20 palabras más frecuentes en "malas críticas"
top_bad_words = get_top_n_words(bad_reviews, 20)

# Crear DataFrames para cada conjunto de palabras
df_good = pd.DataFrame(top_good_words, columns=['Word', 'Frequency'])
df_bad = pd.DataFrame(top_bad_words, columns=['Word', 'Frequency'])

# Grafica de las mejores 20 palabras
plt.figure(figsize=(10, 6))
df_good.groupby('Word').sum()['Frequency'].sort_values(ascending=False).plot(
    kind='bar', title='Top 20 palabras "Buenas Criticas"')
plt.xticks(rotation=45)
plt.show()

# Grafica de las peores 20 palabras
plt.figure(figsize=(10, 6))
df_bad.groupby('Word').sum()['Frequency'].sort_values(ascending=False).plot(
    kind='bar', title='Top 20 palarbas "Malas criticas"')
plt.xticks(rotation=45)
plt.show()

La diferencia entre las palabras más comunes en reseñas "buenas" y "malas" podría ser muy relevante en la analítica de datos significativa. Por ejemplo, las "buenas críticas" podrían tener más palabras asociadas con satisfacción, buenas vibras, elecciones éxitosas, gustos, etc; mientras que las "malas críticas" tendrían más palabras negativas, como podrían ser no me gusta, amargo, disgustos, etc. Este análisis permitirá ver claramente cómo el sentimiento de las reseñas afecta la frecuencia de ciertas palabras en cada categoría.

8.2

Use the get_top_n_words() function to find the top 20 bigrams and trigrams (In both, bad and good reviews). Do the results seem useful?



In [None]:
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
from nltk.corpus import stopwords

# Descargar recursos de NLTK
nltk.download('stopwords')
nltk.download('punkt')

# Cargar el dataset
data = pd.read_csv('sdata.csv', nrows=5000)

# Convertir el conjunto de palabras vacías en una lista
stop_words = list(stopwords.words('english'))

# Función para obtener los n-gramas más frecuentes
def get_top_n_words(corpus, n=20, ngram_range=(1, 1)):
    # Ajustar el rango de n-gramas (bigrama, trigrama, etc.)
    vec = CountVectorizer(stop_words=stop_words, ngram_range=ngram_range).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0)
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq = sorted(words_freq, key=lambda x: x[1], reverse=True)
    return words_freq[:n]

# Dividir las críticas en "buenas" (>3 estrellas) y "malas" (<=3 estrellas)
good_reviews = data[data['stars'] > 3]['text']
bad_reviews = data[data['stars'] <= 3]['text']

# Obtener los 20 bigramas más frecuentes en "buenas críticas"
top_good_bigrams = get_top_n_words(good_reviews, 20, ngram_range=(2, 2))

# Obtener los 20 bigramas más frecuentes en "malas críticas"
top_bad_bigrams = get_top_n_words(bad_reviews, 20, ngram_range=(2, 2))

# Obtener los 20 trigramas más frecuentes en "buenas críticas"
top_good_trigrams = get_top_n_words(good_reviews, 20, ngram_range=(3, 3))

# Obtener los 20 trigramas más frecuentes en "malas críticas"
top_bad_trigrams = get_top_n_words(bad_reviews, 20, ngram_range=(3, 3))

# Crear DataFrames para cada conjunto de n-gramas
df_good_bigrams = pd.DataFrame(top_good_bigrams, columns=['Bigram', 'Frequency'])
df_bad_bigrams = pd.DataFrame(top_bad_bigrams, columns=['Bigram', 'Frequency'])
df_good_trigrams = pd.DataFrame(top_good_trigrams, columns=['Trigram', 'Frequency'])
df_bad_trigrams = pd.DataFrame(top_bad_trigrams, columns=['Trigram', 'Frequency'])

# Gráfica top de 20 bigramas en buenas críticas
plt.figure(figsize=(10, 6))
df_good_bigrams.groupby('Bigram').sum()['Frequency'].sort_values(ascending=False).plot(
    kind='bar', title='Top 20 Bigramas "Buenas críticas"')
plt.xticks(rotation=45)
plt.show()

# Gráfica top de 20 bigramas en malas críticas
plt.figure(figsize=(10, 6))
df_bad_bigrams.groupby('Bigram').sum()['Frequency'].sort_values(ascending=False).plot(
    kind='bar', title='Top 20 Bigramas "Malas críticas"')
plt.xticks(rotation=45)
plt.show()

# Gráfica top de 20 trigramas en buenas críticas
plt.figure(figsize=(10, 6))
df_good_trigrams.groupby('Trigram').sum()['Frequency'].sort_values(ascending=False).plot(
    kind='bar', title='Top 20 Trigramas "Buenas críticas"')
plt.xticks(rotation=45)
plt.show()

# Gráfica top de 20 trigramas en malas críticas
plt.figure(figsize=(10, 6))
df_bad_trigrams.groupby('Trigram').sum()['Frequency'].sort_values(ascending=False).plot(
    kind='bar', title='Top 20 Trigramas "Malas críticas"')
plt.xticks(rotation=45)
plt.show()

¿Son los resultados útiles?
Bigramas y trigramas:
Los bigramas y trigramas pueden proporcionar contexto adicional al mostrar combinaciones comunes de palabras que aparecen juntas, 
como "great service" o "bad experience".

En buenas críticas, es probable que encuentres combinaciones positivas como "friendly staff" o "delicious food".

En malas críticas, podrías encontrar bigramas o trigramas negativos como "bad service" o "wait long time".

Al comparar estos resultados con las palabras individuales más comunes, los bigramas y trigramas ofrecen una 
visión más detallada del contexto y pueden ayudar a detectar frases recurrentes tanto en las experiencias 
positivas como en las negativas.

Excercise 9

9.1

You may have noticed that many of the important "bad" bigrams included the words "like" or "just" but didn't seem very informative (e.g. "felt like", "food just").
Give some ideas of how to use this sort of observation in future pre-processing of reviews, based on the pre-processing ideas we have already studied.

Anwser

Based on the observations of "bad" bigrams such as "like" and "just," which seem uninformative in sentiment analysis, there are several preprocessing strategies we can apply to improve the quality of the data:

1. Stopword Removal or Modification: Words like "like" and "just" may function as stopwords, adding little meaning in most contexts. Since they frequently appear in non-informative bigrams, it might be useful to either remove them during preprocessing or handle them differently depending on the context. For example, removing them when they appear in bigrams like "felt like" or "just food" could reduce noise and improve model performance.

2. Bigrams as Features: In future analyses, we could flag bigrams that contain these non-informative words as low-value features. By reducing their weight or excluding them from the model, we would be focusing more on phrases that convey significant sentiment or opinions.

3. POS Tagging and Filtering: These words often act as fillers rather than carrying semantic weight. By using Part-of-Speech (POS) tagging, we could filter out filler words and reduce the overall feature space, helping the model focus on more informative content.

4. Custom Word Lists: Instead of globally removing these words, we could build a custom list of terms that often form weak bigrams, like "felt like" or "just food," and exclude or modify them during preprocessing. This approach allows us to retain some instances of these words when they contribute meaning but eliminate them when they detract from analysis quality.

These strategies aim to filter out uninformative patterns that could dilute the insights gained from review data.

9.2

Building on the previous question, we note that most of the most important complaints and compliments can't be completely observed by looking at bigrams or
trigrams. This can often be fixed by small modifications. Do the following:
1. Write down a complaint that is unlikely to be (completely) picked up by bigram analysis. Hint: what might you write if your hamburger was served cold?
2. Write down a processing step that would fix this problem. Try to find a solution that would work for several similar problems without additional human
input.


Anwser

1. Complaint Example: "I was so disappointed because my hamburger was served cold and it took forever for the server to bring it back hot."

2. Processing Step: One effective solution could be to implement dependency parsing to capture relationships between words and understand the context more accurately. This would help identify that "hamburger" is the subject, "served" is the action, and "cold" is the key complaint, even if these words aren't in adjacent positions (which bigram or trigram analysis might miss). By recognizing such dependencies, the model can pick up on more complex complaints, such as food being cold, late service, or other nuanced grievances. This approach can be applied broadly to handle various review types without requiring specific human input each time.