# Procesamiento de data de restaurantes

Este notebook está diseñado para dividir el dataset de Yelp en una muestra manejable.

Usaremos la carpeta `yelp_dataset`, el cual contiene el archivo `yelp_academic_datset_business.json` para extraer datos relacionados con restaurantes y generar una muestra.

Para ello debes subir la carpeta `yelp_dataset` que esta en el github a tu drive y luego en el path debes identificar donde lo dejas

In [1]:
import pandas as pd
import json
from sklearn.model_selection import train_test_split
from google.colab import drive

# Montar Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [13]:
# Ruta del archivo JSON
dataset_path = '/content/drive/MyDrive/IIC3633-2024-2/Restaurante/yelp_dataset/yelp_academic_dataset_business.json'

# Leer datos JSON
business_data = []
with open(dataset_path, 'r') as file:
    for line in file:
        business_data.append(json.loads(line))

# Convertir a DataFrame
business_df = pd.DataFrame(business_data)
business_df.head()

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,CA,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,MO,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,AZ,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,PA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."
4,mWMc6_wTdE0EUBKIGXDVfA,Perkiomen Valley Brewery,101 Walnut St,Green Lane,PA,18054,40.338183,-75.471659,4.5,13,1,"{'BusinessAcceptsCreditCards': 'True', 'Wheelc...","Brewpubs, Breweries, Food","{'Wednesday': '14:0-22:0', 'Thursday': '16:0-2..."


In [14]:
# Filtrar datos relevantes: categorías de restaurantes
restaurant_df = business_df[business_df['categories'].str.contains('Restaurant', na=False)]

# Dividir el dataset en muestra y conjunto principal
train_sample, test_sample = train_test_split(restaurant_df, test_size=0.01, random_state=42)

# Guardar la muestra como CSV
sample_path = '/content/drive/MyDrive/IIC3633-2024-2/Restaurante/yelp_dataset/restaurant_sample.csv'
train_sample.to_csv(sample_path, index=False)
print(f'Muestra guardada en: {sample_path}')

Muestra guardada en: /content/drive/MyDrive/IIC3633-2024-2/Restaurante/yelp_dataset/restaurant_sample.csv


filtramos por los reviwes que solo quedaron los id y los guardamos

In [2]:
# Cargar el dataset filtrado
data_path = '/content/drive/MyDrive/IIC3633-2024-2/Restaurante/yelp_dataset/restaurant_sample.csv'

train_sample = pd.read_csv(data_path)

In [3]:
# Ruta del archivo
data_path = '/content/drive/MyDrive/IIC3633-2024-2/Restaurante/yelp_dataset/yelp_academic_dataset_review.json'

# Crear una lista para almacenar las reseñas filtradas
filtered_reviews = []

# Cargar los business_id relevantes
relevant_business_ids = set(train_sample['business_id'])

# Leer el archivo línea por línea y filtrar
with open(data_path, 'r') as file:
    for line in file:
        review = json.loads(line)
        if review['business_id'] in relevant_business_ids:
            filtered_reviews.append(review)

# Convertir las reseñas filtradas a un DataFrame
review_data = pd.DataFrame(filtered_reviews)

# Mostrar las primeras filas
print("Datos procesados:")
review_data.head()

Datos procesados:


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3.0,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3.0,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
2,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5.0,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
3,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4.0,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15
4,JrIxlS1TzJ-iCu79ul40cQ,eUta8W_HdHMXPzLBBZhL1A,04UD14gamNjLY0IDYVhHJg,1.0,1,2,1,I am a long term frequent customer of this est...,2015-09-23 23:10:31


In [4]:
# Guardar la muestra como csv
sample_path = '/content/drive/MyDrive/IIC3633-2024-2/Restaurante/yelp_dataset/review_sample.csv'
review_data.to_csv(sample_path, index=False)
print(f'Muestra guardada en: {sample_path}')

Muestra guardada en: /content/drive/MyDrive/IIC3633-2024-2/Restaurante/yelp_dataset/review_sample.csv
