# MODELO LDA

El modelo LDA (Latent Dirichlet Allocation) es una técnica de modelado de temas en procesamiento de lenguaje natural. Su objetivo es descubrir temas subyacentes en un conjunto de textos.

Los tópicos son conjuntos de palabras que frecuentemente aparecen juntas en los documentos. Cada tópico representa un tema o concepto común en el texto, esto ayuda a identificar temas recurrentes (como "calidad del servicio" o "ambiente acogedor") mencioandos en reseñas.

In [1]:
# Importar librerías
import pandas as pd
import numpy as np
from gensim import corpora, models
import pyLDAvis.gensim_models as gensimvis
import pyLDAvis
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [2]:
# Carga de dataframe
df_review = pd.read_parquet('../review_final_final.parquet')
df_review

Unnamed: 0,text,gmap_id,fecha,sentiment
0,make korean traditional food properly,0x80c2c778e3b73d33:0xbdc58662a4a97d49,2016-01-30 19:38:55,positive
1,great food price portion large,0x80c2c778e3b73d33:0xbdc58662a4a97d49,2016-07-15 13:11:12,positive
2,chicken sandwich delicious definitely twist fl...,0x80dd2b4c8555edb7:0xfc33d65c4bdbef42,2013-12-21 05:26:13,positive
3,love place fry garlic chicken crispy savory al...,0x80dd2b4c8555edb7:0xfc33d65c4bdbef42,2022-09-20 07:51:08,positive
4,delicious variety food good place go either qu...,0x80c2d765f8c90a3d:0x16afb75943e7ad50,2013-06-06 18:41:37,positive
...,...,...,...,...
169904,maybe order delivery noodle hard eat soup room...,0x808fe955b0beae57:0xb3159fe6572670c3,2014-09-04 00:38:44,positive
169905,great food staff kind gentleman help tonight g...,0x808fe955b0beae57:0xb3159fe6572670c3,2018-06-05 03:31:51,positive
169906,place take osaka raman try black garlic raman ...,0x808fe955b0beae57:0xb3159fe6572670c3,2017-07-02 19:41:03,negative
169907,delicious raman clean din room good service,0x808fe955b0beae57:0xb3159fe6572670c3,2020-11-05 01:30:44,positive


In [3]:
# Seleccionar en df_positive solo reseñas positivas
df_positive = df_review[df_review['sentiment'] == 'positive']


In [4]:
df_positive['tokens'] = df_positive['text'].apply(lambda x: x.lower().split())
# Crear el diccionario y el corpus
dictionary = corpora.Dictionary(df_positive['tokens'])
corpus = [dictionary.doc2bow(text) for text in df_positive['tokens']]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_positive['tokens'] = df_positive['text'].apply(lambda x: x.lower().split())


In [5]:
# Aplicar LDA
num_topics = 15
lda_model = models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=15)

In [6]:
# Visualizar el modelo LDA
lda_visualization = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.save_html(lda_visualization, 'lda_visualization.html')

In [7]:
# Imprimir los tópicos
for idx, topic in lda_model.print_topics(num_words=10):
    print(f"Topic {idx}: {topic}")

Topic 0: 0.218*"good" + 0.115*"really" + 0.092*"place" + 0.089*"food" + 0.052*"nice" + 0.036*"eat" + 0.034*"like" + 0.024*"go" + 0.023*"people" + 0.023*"love"
Topic 1: 0.142*"pizza" + 0.063*"wait" + 0.041*"long" + 0.038*"order" + 0.035*"good" + 0.033*"drive" + 0.030*"take" + 0.027*"worth" + 0.022*"line" + 0.021*"get"
Topic 2: 0.159*"great" + 0.132*"food" + 0.124*"service" + 0.061*"friendly" + 0.049*"good" + 0.047*"staff" + 0.032*"excellent" + 0.025*"place" + 0.024*"fast" + 0.023*"customer"
Topic 3: 0.076*"lunch" + 0.062*"great" + 0.045*"breakfast" + 0.037*"special" + 0.036*"happy" + 0.035*"dinner" + 0.034*"home" + 0.033*"make" + 0.030*"hour" + 0.025*"coffee"
Topic 4: 0.044*"time" + 0.040*"go" + 0.038*"back" + 0.035*"come" + 0.021*"food" + 0.017*"order" + 0.017*"definitely" + 0.016*"get" + 0.015*"make" + 0.015*"first"
Topic 5: 0.102*"recommend" + 0.058*"super" + 0.053*"highly" + 0.051*"delicious" + 0.050*"place" + 0.050*"amazing" + 0.043*"food" + 0.039*"try" + 0.035*"definitely" + 0.026

Creacion de diccionario en base a cada topico

In [8]:
topic_mapping2 = {
    0: "Tiempo de servicio o entrega optimo",
    1: "Buena comida y relación calidad-precio",
    2: "Calidad del servicio al cliente por parte del staff'",
    3: "Opciones de platos específicos y comidas saludables",
    4: "Fidelización de clientes",
    5: "Porciones y precios razonables",
    6: "Variedad de desayunos y opciones de café",
    7: "Platos siempre frescos y bien preparados",
    8: "Ambiente familiar y agradable",
    9: "Comida auténtica y selección variada",
    10: "Buena experiencia gastronómica",
    11: "Comodidad y experiencia hogareña",
    12: "Servicio excelente y rápido",
    13: "Platos populares, simples y funcionales",
    14: " ",
    15: ' '
}


In [9]:
# Obtener los tópicos principales para cada reseña
def get_top_topics(lda_model, bow, top_n=2):
    topics = sorted(lda_model[bow], key=lambda x: -x[1])
    return topics[:top_n]

In [10]:
# Aplicar la función para obtener los tópicos principales en cada reseña
df_positive['top_topics'] = df_positive['tokens'].apply(
    lambda x: get_top_topics(lda_model, dictionary.doc2bow(x))
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_positive['top_topics'] = df_positive['tokens'].apply(


In [11]:
# Separar los dos tópicos principales en columnas diferentes
df_positive['topic_1'] = df_positive['top_topics'].apply(lambda x: x[0][0] if len(x) > 0 else None)

# Convertir a int
df_positive['topic_1'] = df_positive['topic_1'].astype(int)

# Aplicar el mapeo usando el diccionario
df_positive['topic_1'] = df_positive['topic_1'].map(topic_mapping2) 

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_positive['topic_1'] = df_positive['top_topics'].apply(lambda x: x[0][0] if len(x) > 0 else None)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_positive['topic_1'] = df_positive['topic_1'].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_positive['topic_1'] = df_positive['topic_1

In [12]:
df_business = pd.read_csv('../restaurantes_california.csv')
df_business.drop(columns=['Unnamed: 0' , 'cluster','primary_category'], inplace=True)
df_business.head()

Unnamed: 0,gmap_id,name,latitude,longitude,combined_categories,num_of_reviews,avg_rating
0,0x80c2c778e3b73d33:0xbdc58662a4a97d49,San Soo Dang,34.058092,-118.29213,Korean restaurant,18,4.4
1,0x80dd2b4c8555edb7:0xfc33d65c4bdbef42,Vons Chicken,33.916402,-118.010855,Restaurant,18,4.5
2,0x808f879f35b5088b:0xe3541cec7a95bd88,TACOS LA CABANA,37.789076,-122.233884,Taco restaurant,2,5.0
3,0x808f87f90c1f661f:0xf384e804a61e0c0b,Mariscos el poblano,37.764203,-122.214647,Restaurant,3,5.0
4,0x80dcd95d192d988b:0x68795f58e35bf888,Off The Hoof,33.748329,-117.866045,Restaurant,3,4.0


In [13]:
# Agrupar por gmap_id y contar la frecuencia de topic_1
topic_counts = df_positive.groupby(['gmap_id', 'topic_1']).size().reset_index(name='count')

# Obtener el tópico más frecuente para cada gmap_id
top_topic = topic_counts.loc[topic_counts.groupby('gmap_id')['count'].idxmax()]

In [14]:
# Renombrar la columna topic_1 a top_topic_1 para diferenciar en el merge
top_topic = top_topic.rename(columns={'topic_1': 'top_topic_1'})


# Hacer el merge con df_business usando gmap_id
df_merged = pd.merge(df_business, top_topic[['gmap_id', 'top_topic_1']], on='gmap_id', how='left')
df_merged.head(10)


Unnamed: 0,gmap_id,name,latitude,longitude,combined_categories,num_of_reviews,avg_rating,top_topic_1
0,0x80c2c778e3b73d33:0xbdc58662a4a97d49,San Soo Dang,34.058092,-118.29213,Korean restaurant,18,4.4,Fidelización de clientes
1,0x80dd2b4c8555edb7:0xfc33d65c4bdbef42,Vons Chicken,33.916402,-118.010855,Restaurant,18,4.5,Servicio excelente y rápido
2,0x808f879f35b5088b:0xe3541cec7a95bd88,TACOS LA CABANA,37.789076,-122.233884,Taco restaurant,2,5.0,
3,0x808f87f90c1f661f:0xf384e804a61e0c0b,Mariscos el poblano,37.764203,-122.214647,Restaurant,3,5.0,
4,0x80dcd95d192d988b:0x68795f58e35bf888,Off The Hoof,33.748329,-117.866045,Restaurant,3,4.0,
5,0x80c2baf50d29bf63:0x5bd904b842b9fcc,La Potranca,34.000181,-118.441249,Restaurant,13,4.2,
6,0x80c2cc53e00f8d11:0x8b92407c6db84cf1,Atlantis Burgers,33.929756,-118.165255,Restaurant,7,3.7,
7,0x80c2d765f8c90a3d:0x16afb75943e7ad50,Cowboy Burgers & BBQ,34.079995,-117.988951,Hamburger restaurant American restaurant Barbe...,38,3.7,Fidelización de clientes
8,0x54d15b5c2681df95:0xa611357b2e497e58,Beau Pre Cafe,40.962914,-124.096238,Cafe Breakfast restaurant Hamburger restaurant...,18,4.6,Porciones y precios razonables
9,0x808580d0baf51259:0x24736823db702c96,Taco Bell,37.799632,-122.436278,Mexican restaurant Breakfast restaurant Burrit...,4,3.3,


In [15]:
# incluir el segundo tópico más repetido:
topic_counts_sorted = topic_counts.sort_values(by=['gmap_id', 'count'], ascending=[True, False])
second_top_topic = topic_counts_sorted.groupby('gmap_id').nth(1).reset_index()

# Renombrar la columna topic_1 a second_top_topic para diferenciar en el merge
second_top_topic = second_top_topic.rename(columns={'topic_1': 'second_top_topic'})

# Hacer el merge con df_business para incluir el segundo tópico más repetido
df_merged = pd.merge(df_merged, second_top_topic[['gmap_id', 'second_top_topic']], on='gmap_id', how='left')

In [16]:
# Reemplazar nulos
df_merged['top_topic_1'].fillna('Establecimiento sin reseñas o rating insuficiente', inplace=True)
df_merged['second_top_topic'].fillna('', inplace=True)

# Unir en una sola columna
df_merged['caracteristicas_clave'] = df_merged['top_topic_1'].astype(str) + ". " + df_merged['second_top_topic'].astype(str)

# Eliminar columnas innecesarias
df_merged.drop(columns=['top_topic_1','second_top_topic'], inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_merged['top_topic_1'].fillna('Establecimiento sin reseñas o rating insuficiente', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_merged['second_top_topic'].fillna('', inplace=True)


In [17]:
# Guardar datos
#df_merged.to_parquet('3-caracteristicas.parquet', index=False)