## 🛠️ **ETL (Extract, Transform, Load)**



En este ipynb vamos a transformar los archivos, cambiando sus tipos de datos y demas. 

Tambien vamos a estar utilizando la funcion personalizada personalizada `data_type_check` invocada desde `data_utils.py` para dejar un vistazo raápido del dataframe y  poder observar:
- Variables categóricas
- Variables numéricas
- Dimensiones del dataframe
- Nulos
- Tipos de datos
- Informacion acerca de los datos faltantes o nulos de cada columna    


####  **Importamos las librerías que vamos a usar**


In [31]:

import warnings
import re
import pandas as pd
from data_utils import data_type_check

warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import pyarrow as pa
import pyarrow.parquet as pq
import seaborn as sns
import json

#### 📦 **Extraccion** de los datos y primera exploración 


In [19]:
#Yelp
business = pd.read_parquet('../0_Dataset/Data_Sucia/Yelp/business.parquet')
df_rev_FL = pd.read_parquet('../0_Dataset/Data_Sucia/Yelp/review_reducido.parquet')
df_checkin_FL = pd.read_parquet('../0_Dataset/Data_Sucia/Yelp/checkin_reducido.parquet')
df_tip_FL = pd.read_parquet('../0_Dataset/Data_Sucia/Yelp/tip.parquet')
df_user_FL = pd.read_parquet('../0_Dataset/Data_Sucia/Yelp/user_reducido.parquet')

#Google
df_G_review_FL = pd.read_parquet('../0_Dataset/Data_Sucia/Google/G_review_FL_reducido.parquet')
df_G_metadata_FL = pd.read_parquet('../0_Dataset/Data_Sucia/Google/G_metadata_FL_reducido.parquet')


___

## Dataset Yelp

### business

In [20]:
data_type_check(business)
business.sample(2)


 Resumen del dataframe:

Dimensiones:  (150243, 14)
         columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0    business_id      100.00     0.00            0    object
1           name      100.00     0.00            0    object
2        address      100.00     0.00            0    object
3           city      100.00     0.00            0    object
4          state      100.00     0.00            3    object
5    postal_code      100.00     0.00            0    object
6       latitude      100.00     0.00            0   float64
7      longitude      100.00     0.00            0   float64
8          stars      100.00     0.00            0   float64
9   review_count      100.00     0.00            0     int64
10       is_open      100.00     0.00            0     int64
11    attributes       90.92     9.08        13642    object
12    categories      100.00     0.00            0    object
13         hours       84.61    15.39        23120    object


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
89762,1nFyIGNUFQXiO6yikhQJzg,Eva Nails,"8110 S Houghton Rd, Ste 126",Tucson,FL,85747,32.104641,-110.774687,3.0,47,1,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Beauty & Spas, Nail Salons, Hair Removal","{'Friday': '9:0-19:0', 'Monday': '9:0-19:0', '..."
133708,shbDs7N86yTsfZLh4Nie_A,Little Chicago,1524A Demonbreun St,Nashville,AZ,37203,36.153051,-86.790546,2.5,243,1,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Restaurants, Burgers, Hot Dogs, Italian, Food ...","{'Friday': '10:0-4:0', 'Monday': '10:0-3:0', '..."


🔁 TRANSFORM

Convertir columnas attributes y hours a cadenas para manejar nulos:

In [23]:
business['attributes'] = business['attributes'].apply(lambda x: json.dumps(x) if isinstance(x, dict) else x)
business['hours'] = business['hours'].apply(lambda x: json.dumps(x) if isinstance(x, dict) else x)


Rellenar valores nulos en attributes y hours

In [24]:
business['attributes'].fillna('No disponible', inplace=True)
business['hours'].fillna('No disponible', inplace=True)


Eliminar nulos en state

In [25]:
business.dropna(subset=['state'], inplace=True)

Eliminar filas duplicadas:

In [None]:
business.drop_duplicates(inplace=True)

Estandarización de Categorías:

- La columna categories contiene múltiples categorías separadas por comas. Es útil dividir estas categorías para análisis específicos

In [26]:
business['categories'] = business['categories'].str.split(', ')

Extracción de Horarios

In [27]:
def parse_hours(hours):
    try:
        return json.loads(hours.replace("'", '"'))
    except:
        return {}

business['parsed_hours'] = business['hours'].apply(parse_hours)


In [29]:
data_type_check(business)
business.sample(1)


 Resumen del dataframe:

Dimensiones:  (150240, 15)
         columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0    business_id       100.0      0.0            0    object
1           name       100.0      0.0            0    object
2        address       100.0      0.0            0    object
3           city       100.0      0.0            0    object
4          state       100.0      0.0            0    object
5    postal_code       100.0      0.0            0    object
6       latitude       100.0      0.0            0   float64
7      longitude       100.0      0.0            0   float64
8          stars       100.0      0.0            0   float64
9   review_count       100.0      0.0            0     int64
10       is_open       100.0      0.0            0     int64
11    attributes       100.0      0.0            0    object
12    categories       100.0      0.0            0    object
13         hours       100.0      0.0            0    object
14  parsed_hours       100.0    

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,parsed_hours
5852,9UMCi0Lh-nAn9Bona7rQgg,Bristol & Taylor Garage,2429 Bristol Rd,Bensalem,TN,19020,40.13253,-74.92489,4.0,7,1,"{""AcceptsInsurance"": null, ""AgesAllowed"": null...","[Automotive, Education, Oil Change Stations, T...","{""Friday"": ""0:0-0:0"", ""Monday"": ""0:0-0:0"", ""Sa...","{'Friday': '0:0-0:0', 'Monday': '0:0-0:0', 'Sa..."


📤 LOAD

In [30]:
#guardar en parquet
business.to_parquet("../0_Dataset/Data_Limpia/Yelp/business.parquet", engine="pyarrow")

### review

In [5]:
#abrir el parquet review_reducido
data_type_check(df_rev_FL)
df_rev_FL.sample(2)


 Resumen del dataframe:

Dimensiones:  (209708, 9)
       columna  %_no_nulos  %_nulos  total_nulos       tipo_dato
0    review_id       100.0      0.0            0          object
1      user_id       100.0      0.0            0          object
2  business_id       100.0      0.0            0          object
3        stars       100.0      0.0            0           int64
4       useful       100.0      0.0            0           int64
5        funny       100.0      0.0            0           int64
6         cool       100.0      0.0            0           int64
7         text       100.0      0.0            0          object
8         date       100.0      0.0            0  datetime64[ns]


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
24362,M4APWfNwyhI8Y9Gly-d-lA,CMRTvF6kjcUrcZssmnWAbw,2oh4NJDtqTUO5ZuvCe0hWw,3,0,0,0,I had such high hopes for this place. A much h...,2013-09-22 16:24:33
2645,fvzI2SI6o7s07opf_bt3Ig,GVODdkpncBOXGMbwqqduHw,dISs1oH_xeNAOOEcmJiGZQ,5,0,0,0,"While the Farmer's Market is a little weird, C...",2010-08-25 04:08:44


Eliminar filas duplicadas

In [None]:
df_rev_FL.drop_duplicates(inplace=True)


Normalización de Texto: Para análisis de texto, es útil limpiar y normalizar las cadenas

In [32]:
# Función para limpiar el texto
# Convierte el texto a minúsculas
# Remueve caracteres no alfanuméricos y los reemplaza por espacios
# Remueve espacios múltiples y los reemplaza por un solo espacio
# Elimina espacios al inicio y al final del texto
def limpiar_texto(texto):
    texto = texto.lower()
    texto = re.sub(r"\W", " ", texto)
    texto = re.sub(r"\s+", " ", texto)
    return texto.strip()


# Aplica la función limpiar_texto a la columna 'text' del DataFrame df_rev_FL
df_rev_FL["text"] = df_rev_FL["text"].apply(limpiar_texto)

posiblemente aca crear una columna para analisis de sentimiento


📤 LOAD

In [35]:
data_type_check(df_rev_FL)
df_rev_FL.sample(2)


 Resumen del dataframe:

Dimensiones:  (209708, 9)
       columna  %_no_nulos  %_nulos  total_nulos       tipo_dato
0    review_id       100.0      0.0            0          object
1      user_id       100.0      0.0            0          object
2  business_id       100.0      0.0            0          object
3        stars       100.0      0.0            0           int64
4       useful       100.0      0.0            0           int64
5        funny       100.0      0.0            0           int64
6         cool       100.0      0.0            0           int64
7         text       100.0      0.0            0          object
8         date       100.0      0.0            0  datetime64[ns]


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
75836,Lk1xSh_nfvX0TgbPM3oSPg,F0sRh5b3AZ9V1fwZLdg-DA,v5hCB55uWA97qj_Ww8DHqw,5,1,0,1,delicious food i went to fashion mall location...,2018-04-23 19:35:56
56384,M4N3wpc2RzNpOJ2vioCljQ,jP-eUX6BCDab_OPvN1xqnA,XCJ8N1GrV9Gdgbyp1U05NQ,5,1,0,0,this place is awesome i have no idea why they ...,2017-12-02 01:56:55


In [6]:
#guardar en parquet
df_rev_FL.to_parquet("../0_Dataset/Data_Limpia/Yelp/review_FL_reducido.parquet", engine="pyarrow")

### checkin

In [7]:
#abrir el parquet checkin
data_type_check(df_checkin_FL)
df_checkin_FL.sample(2)



 Resumen del dataframe:

Dimensiones:  (92351, 2)
       columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0  business_id       100.0      0.0            0    object
1         date       100.0      0.0            0    object


Unnamed: 0,business_id,date
39012,HtEd7ZDKlR0nQZbc_QWn0A,"2016-01-01 07:06:28, 2016-06-03 23:23:34, 2016..."
32092,ERxAPTa09fE0tZwvfht26Q,2017-08-26 17:02:10


Eliminar filas duplicadas

In [None]:
df_checkin_FL.drop_duplicates(inplace=True)


Transformación de Tipos de Datos

In [36]:
df_checkin_FL['date'] = df_checkin_FL['date'].apply(lambda x: x.split(', ')).apply(lambda x: [pd.to_datetime(date) for date in x])

Desglosar las fechas

In [None]:
df_checkin_FL = df_checkin_FL.explode('date')


Agrupar Checkins por Mes/Año

In [None]:
df_checkin['month_year'] = df_checkin['date'].dt.to_period('M')
checkins_by_month = df_checkin.groupby(['business_id', 'month_year']).size().reset_index(name='checkins_count')


📤 LOAD

In [None]:
#guardar en parquet
df_checkin_FL.to_parquet("../0_Dataset/Data_Limpia/Yelp/checkin_FL_reducido.parquet", engine="pyarrow")


### tip

In [8]:
#abrir el parquet tip
data_type_check(df_tip_FL)
df_tip_FL.sample(2)



 Resumen del dataframe:

Dimensiones:  (908915, 5)
            columna  %_no_nulos  %_nulos  total_nulos       tipo_dato
0           user_id       100.0      0.0            0          object
1       business_id       100.0      0.0            0          object
2              text       100.0      0.0            0          object
3              date       100.0      0.0            0  datetime64[ns]
4  compliment_count       100.0      0.0            0           int64


Unnamed: 0,user_id,business_id,text,date,compliment_count
195890,CfX4sTIFFNaRchNswqhVfg,ObZu0-S1VAqAuCbAF5AFvw,I'm not impressed,2014-12-24 00:41:18,0
457543,5tXRxr4T24Awl7vjyCvIcQ,WXgV2lOUgas7DzTLeDau-w,"$2, $3, $4, & $5 drink specials 3:00pm - 6:30p...",2015-08-07 16:10:42,0


### user


In [9]:
#abrir el parquet user
data_type_check(df_user_FL)
df_user_FL.sample(2)



 Resumen del dataframe:

Dimensiones:  (63168, 22)
               columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0              user_id       100.0      0.0            0    object
1                 name       100.0      0.0            0    object
2         review_count       100.0      0.0            0     int64
3        yelping_since       100.0      0.0            0    object
4               useful       100.0      0.0            0     int64
5                funny       100.0      0.0            0     int64
6                 cool       100.0      0.0            0     int64
7                elite       100.0      0.0            0    object
8              friends       100.0      0.0            0    object
9                 fans       100.0      0.0            0     int64
10       average_stars       100.0      0.0            0   float64
11      compliment_hot       100.0      0.0            0     int64
12     compliment_more       100.0      0.0            0     int64
13  compli

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
19542,iJS9bMpakeU9_DBn16mCcw,Vanessa,20,2016-01-14 00:15:27,2,0,3,,,0,...,0,0,0,0,0,0,0,0,0,0
39149,cZ3Bn4DDAMRAEnNWa82ZIQ,Tina,15,2011-03-30 18:46:37,20,1,6,,"dCA2DkZ5Q8FH_4XZxEjTQg, eADqTJT1iQIvAnicpsHicQ...",0,...,0,0,0,0,1,0,0,0,0,0


In [10]:
#Guardar los cambios al archivo
df_checkin_FL.to_parquet("../0_Dataset/Data_Limpia/Yelp/checkin_FL_reducido.parquet", engine="pyarrow")

___

___
## Dataset Gogle


### Reviews Florida

In [11]:
data_type_check(df_G_review_FL)
df_G_review_FL.sample(2)



 Resumen del dataframe:

Dimensiones:  (712500, 8)
   columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0  user_id      100.00     0.00            0   float64
1     name      100.00     0.00            0    object
2     time      100.00     0.00            0     int64
3   rating      100.00     0.00            0     int64
4     text       62.09    37.91       270081    object
5     pics        3.66    96.34       686417    object
6     resp       16.02    83.98       598322    object
7  gmap_id      100.00     0.00            0    object


Unnamed: 0,user_id,name,time,rating,text,pics,resp,gmap_id
80158,1.056812e+20,Link up we The people are we alike,1549578391456,5,,,,0x88e69d4373d8144d:0xbd8688de068cf541
1179511,1.123179e+20,TheOfficialWolf,1537880160668,5,,,,0x88dd4d12ce28aad1:0x1fe8c9fdb6373808


#### **📤 LOAD**

In [12]:
#Guardar los cambios al archivo
df_G_review_FL.to_parquet("../0_Dataset/Data_Limpia/Google/G_review_FL_reducido.parquet", engine="pyarrow")

### Metadata-sitios

In [13]:
data_type_check(df_G_metadata_FL)
df_G_metadata_FL.sample(2)


 Resumen del dataframe:

Dimensiones:  (220001, 15)
             columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0               name      100.00     0.00            5    object
1            address       97.15     2.85         6275    object
2            gmap_id      100.00     0.00            0    object
3        description        7.16    92.84       204255    object
4           latitude      100.00     0.00            0   float64
5          longitude      100.00     0.00            0   float64
6           category       99.35     0.65         1428    object
7         avg_rating      100.00     0.00            0   float64
8     num_of_reviews      100.00     0.00            0     int64
9              price        7.68    92.32       203095    object
10             hours       72.82    27.18        59802    object
11              MISC       75.58    24.42        53728    object
12             state       74.25    25.75        56652    object
13  relative_results       89.49    1

Unnamed: 0,name,address,gmap_id,description,latitude,longitude,category,avg_rating,num_of_reviews,price,hours,MISC,state,relative_results,url
135755,Willis & Associates,"Willis & Associates, 201 Penn Center Blvd #310...",0x8834ebe1eefa8bc7:0x8d542a5f62de011c,,40.428674,-79.811323,[Bankruptcy attorney],4.6,33,,"[[Saturday, 7AM–12PM], [Sunday, Closed], [Mond...",{'Accessibility': ['Wheelchair accessible entr...,Closed ⋅ Opens 7AM,"[0x8834f1509f85476d:0x87296006d454172, 0x8834f...",https://www.google.com/maps/place//data=!4m2!3...
1167687,Curves,"Curves, 3991 E Williamsburg Rd, Sandston, VA 2...",0x89b11c4d26e0ad29:0x8fe3c98c7abe5e80,,37.509408,-77.241078,"[Gym, Fitness center, Personal trainer, Physic...",4.7,3,,"[[Thursday, 8AM–12PM], [Friday, Closed], [Satu...",,Permanently closed,"[0x89b11a6b5dbe9483:0xb84f659d2655296d, 0x89b1...",https://www.google.com/maps/place//data=!4m2!3...


📤 LOAD

In [14]:
#Guardar los cambios al archivo
df_rev_FL.to_parquet("../0_Dataset/Yelp/Data_Limpia/review_FL_reducido.parquet", engine="pyarrow")
