## 🛠️ **ETL (Extract, Transform, Load)**



En este ipynb vamos a transformar los archivos, cambiando sus tipos de datos y demas. 

Tambien vamos a estar utilizando la funcion personalizada personalizada `data_type_check` invocada desde `data_utils.py` para dejar un vistazo raápido del dataframe y  poder observar:
- Variables categóricas
- Variables numéricas
- Dimensiones del dataframe
- Nulos
- Tipos de datos
- Informacion acerca de los datos faltantes o nulos de cada columna    


####  **Importamos las librerías que vamos a usar**


In [1]:

import warnings
import re
import pandas as pd
from data_utils import data_type_check

warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import pyarrow as pa
import pyarrow.parquet as pq
import seaborn as sns
import json

#### 📦 **Extraccion** de los datos y primera exploración 


In [2]:
#Yelp
business = pd.read_parquet('../0_Dataset/Data_Sucia/Yelp/business.parquet')
df_rev_FL = pd.read_parquet('../0_Dataset/Data_Sucia/Yelp/review_reducido.parquet')
df_checkin_FL = pd.read_parquet('../0_Dataset/Data_Sucia/Yelp/checkin_reducido.parquet')
df_tip_FL = pd.read_parquet('../0_Dataset/Data_Sucia/Yelp/tip.parquet')
df_user_FL = pd.read_parquet('../0_Dataset/Data_Sucia/Yelp/user_reducido.parquet')

#Google
df_G_review_FL = pd.read_parquet('../0_Dataset/Data_Sucia/Google/G_review_FL_reducido.parquet')
df_G_metadata_FL = pd.read_parquet('../0_Dataset/Data_Sucia/Google/G_metadata_FL_reducido.parquet')


___

## Dataset Yelp

### business

In [3]:
data_type_check(business)
business.sample(2)


 Resumen del dataframe:

Dimensiones:  (150240, 15)
         columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0    business_id       100.0      0.0            0    object
1           name       100.0      0.0            0    object
2        address       100.0      0.0            0    object
3           city       100.0      0.0            0    object
4          state       100.0      0.0            0    object
5    postal_code       100.0      0.0            0    object
6       latitude       100.0      0.0            0   float64
7      longitude       100.0      0.0            0   float64
8          stars       100.0      0.0            0   float64
9   review_count       100.0      0.0            0     int64
10       is_open       100.0      0.0            0     int64
11    attributes       100.0      0.0            0    object
12    categories       100.0      0.0            0    object
13         hours       100.0      0.0            0    object
14  parsed_hours       100.0    

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,parsed_hours
73095,ulOwGfwsixQlpXnaXONDgw,Walgreens,20 W Main St,Brownsburg,LA,46112,39.84388,-86.39834,2.5,9,1,"{""AcceptsInsurance"": null, ""AgesAllowed"": null...","[Food, Drugstores, Beauty & Spas, Cosmetics & ...","{""Friday"": ""7:0-23:0"", ""Monday"": ""7:0-23:0"", ""...","{'Friday': '7:0-23:0', 'Monday': '7:0-23:0', '..."
115021,dHYGLQkSxp8nFyzydmgeVw,Starbucks,6575 Central Ave,St. Petersburg,IN,33710,27.771167,-82.728097,3.0,45,1,"{""AcceptsInsurance"": null, ""AgesAllowed"": null...","[Food, Coffee & Tea]","{""Friday"": ""5:0-21:0"", ""Monday"": ""0:0-0:0"", ""S...","{'Friday': '5:0-21:0', 'Monday': '0:0-0:0', 'S..."


🔁 TRANSFORM

Convertir columnas attributes y hours a cadenas para manejar nulos:

In [4]:
business['attributes'] = business['attributes'].apply(lambda x: json.dumps(x) if isinstance(x, dict) else x)
business['hours'] = business['hours'].apply(lambda x: json.dumps(x) if isinstance(x, dict) else x)


Rellenar valores nulos en attributes y hours

In [5]:
business['attributes'].fillna('No disponible', inplace=True)
business['hours'].fillna('No disponible', inplace=True)


Eliminar nulos en state

In [6]:
business.dropna(subset=['state'], inplace=True)

Eliminar filas duplicadas:

In [7]:
#business.drop_duplicates(inplace=True)

Estandarización de Categorías:

- La columna categories contiene múltiples categorías separadas por comas. Es útil dividir estas categorías para análisis específicos

In [8]:
business['categories'] = business['categories'].str.split(', ')

Extracción de Horarios

In [9]:
def parse_hours(hours):
    try:
        return json.loads(hours.replace("'", '"'))
    except:
        return {}

business['parsed_hours'] = business['hours'].apply(parse_hours)


In [10]:
data_type_check(business)
business.sample(1)


 Resumen del dataframe:

Dimensiones:  (150240, 15)
         columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0    business_id       100.0      0.0            0    object
1           name       100.0      0.0            0    object
2        address       100.0      0.0            0    object
3           city       100.0      0.0            0    object
4          state       100.0      0.0            0    object
5    postal_code       100.0      0.0            0    object
6       latitude       100.0      0.0            0   float64
7      longitude       100.0      0.0            0   float64
8          stars       100.0      0.0            0   float64
9   review_count       100.0      0.0            0     int64
10       is_open       100.0      0.0            0     int64
11    attributes       100.0      0.0            0    object
12    categories         0.0    100.0       150240   float64
13         hours       100.0      0.0            0    object
14  parsed_hours       100.0    

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,parsed_hours
58468,gDvwpmD_Lqql3R81SeYGgQ,Designer Finds,2210 Crestmoor Rd,Nashville,FL,37215,36.10923,-86.814884,3.5,17,1,"{""AcceptsInsurance"": null, ""AgesAllowed"": null...",,"{""Friday"": ""10:0-17:0"", ""Monday"": ""10:0-17:0"",...","{'Friday': '10:0-17:0', 'Monday': '10:0-17:0',..."


📤 LOAD

In [11]:
#guardar en parquet
business.to_parquet("../0_Dataset/Data_Limpia/Yelp/business.parquet", engine="pyarrow")

### review

In [12]:
#abrir el parquet review_reducido
data_type_check(df_rev_FL)
df_rev_FL.sample(2)


 Resumen del dataframe:

Dimensiones:  (209708, 9)
       columna  %_no_nulos  %_nulos  total_nulos       tipo_dato
0    review_id       100.0      0.0            0          object
1      user_id       100.0      0.0            0          object
2  business_id       100.0      0.0            0          object
3        stars       100.0      0.0            0           int64
4       useful       100.0      0.0            0           int64
5        funny       100.0      0.0            0           int64
6         cool       100.0      0.0            0           int64
7         text       100.0      0.0            0          object
8         date       100.0      0.0            0  datetime64[ns]


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
192466,1TqRRXRmDWTwfm1dpNGbUw,Or4Y8elDmW2i6_cluqiUfg,Bq0CQcwk5R8yhm-MGfHxCA,4,0,0,0,"On a recent trip to Tucson, my husband and I f...",2014-05-18 16:35:36
142152,lV5mO279ezOSqWg9GeisZw,4MF8vS-qJGpCDFwWvZcNFQ,3XTQcobtEsU11sSN2Jn7Bw,1,1,0,0,Used Pay worst for parking 12/18-12/29 When I ...,2016-01-05 01:34:52


Eliminar filas duplicadas

In [13]:
df_rev_FL.drop_duplicates(inplace=True)


Normalización de Texto: Para análisis de texto, es útil limpiar y normalizar las cadenas

In [14]:
# Función para limpiar el texto
# Convierte el texto a minúsculas
# Remueve caracteres no alfanuméricos y los reemplaza por espacios
# Remueve espacios múltiples y los reemplaza por un solo espacio
# Elimina espacios al inicio y al final del texto
def limpiar_texto(texto):
    texto = texto.lower()
    texto = re.sub(r"\W", " ", texto)
    texto = re.sub(r"\s+", " ", texto)
    return texto.strip()


# Aplica la función limpiar_texto a la columna 'text' del DataFrame df_rev_FL
df_rev_FL["text"] = df_rev_FL["text"].apply(limpiar_texto)

posiblemente aca crear una columna para analisis de sentimiento


📤 LOAD

In [15]:
data_type_check(df_rev_FL)
df_rev_FL.sample(2)


 Resumen del dataframe:

Dimensiones:  (209708, 9)
       columna  %_no_nulos  %_nulos  total_nulos       tipo_dato
0    review_id       100.0      0.0            0          object
1      user_id       100.0      0.0            0          object
2  business_id       100.0      0.0            0          object
3        stars       100.0      0.0            0           int64
4       useful       100.0      0.0            0           int64
5        funny       100.0      0.0            0           int64
6         cool       100.0      0.0            0           int64
7         text       100.0      0.0            0          object
8         date       100.0      0.0            0  datetime64[ns]


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
183500,RS5cfqwYVCzwQqAZ60mWcg,8oeyYq6fdwa9XlYaFgxapA,QBnteiO92wOvE1WCYjMC_w,5,0,0,0,wow we loved this place it was very clean the ...,2021-10-29 23:02:20
82339,96krRm_f5dYabNLjEzUsnA,L27OShvycGmVOkAxD1h86A,e_MDOxwYA6b8XThonDDI_g,4,8,0,2,i am a west side kind of girl so i really don ...,2011-11-29 22:32:02


In [16]:
#guardar en parquet
df_rev_FL.to_parquet("../0_Dataset/Data_Limpia/Yelp/review_FL_reducido.parquet", engine="pyarrow")

### checkin

In [17]:
#abrir el parquet checkin
data_type_check(df_checkin_FL)
df_checkin_FL.sample(2)



 Resumen del dataframe:

Dimensiones:  (92351, 2)
       columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0  business_id       100.0      0.0            0    object
1         date       100.0      0.0            0    object


Unnamed: 0,business_id,date
12106,4ppN9-rsEyh-nkbDISeJcg,"2011-10-01 17:59:48, 2011-11-19 18:30:39, 2012..."
110743,pq7CAQGsxjaFcMLmhdbbvA,"2010-05-01 20:29:14, 2010-09-25 01:46:49, 2010..."


Eliminar filas duplicadas

In [18]:
df_checkin_FL.drop_duplicates(inplace=True)


Transformación de Tipos de Datos

In [19]:
#df_checkin_FL['date'] = df_checkin_FL['date'].apply(lambda x: x.split(', ')).apply(lambda x: [pd.to_datetime(date) for date in x])
#no funciona tarda mucho

Desglosar las fechas

In [20]:
df_checkin_FL = df_checkin_FL.explode('date')


Agrupar Checkins por Mes/Año

In [21]:
#df_checkin_FL['month_year'] = df_checkin_FL['date'].dt.to_period('M')
#checkins_by_month = df_checkin_FL.groupby(['business_id', 'month_year']).size().reset_index(name='checkins_count')


📤 LOAD

In [22]:
#guardar en parquet
df_checkin_FL.to_parquet("../0_Dataset/Data_Limpia/Yelp/checkin_reducido.parquet", engine="pyarrow")


### tip

In [23]:
#abrir el parquet tip
data_type_check(df_tip_FL)
df_tip_FL.sample(2)



 Resumen del dataframe:

Dimensiones:  (908915, 5)
            columna  %_no_nulos  %_nulos  total_nulos       tipo_dato
0           user_id       100.0      0.0            0          object
1       business_id       100.0      0.0            0          object
2              text       100.0      0.0            0          object
3              date       100.0      0.0            0  datetime64[ns]
4  compliment_count       100.0      0.0            0           int64


Unnamed: 0,user_id,business_id,text,date,compliment_count
17677,40zlq7PmkCNKFm71OZUlHQ,1FCxivPMoHC6xp7EpeHTVw,Go on monday for cheap $2.00 milkshakes,2012-08-26 00:03:28,0
593928,P6N1WMrAiWRoPHT9x5-v5Q,Gfegg1vwKJJo-EfEujQ-nw,Roast Beef Po'boy is a must!,2011-09-13 20:54:40,0


In [24]:
#Guardar tip como parquet
df_tip_FL.to_parquet("../0_Dataset/Data_Limpia/Yelp/tip.parquet", engine="pyarrow")

### user


In [25]:
#abrir el parquet user
data_type_check(df_user_FL)
df_user_FL.sample(2)



 Resumen del dataframe:

Dimensiones:  (63168, 22)
               columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0              user_id       100.0      0.0            0    object
1                 name       100.0      0.0            0    object
2         review_count       100.0      0.0            0     int64
3        yelping_since       100.0      0.0            0    object
4               useful       100.0      0.0            0     int64
5                funny       100.0      0.0            0     int64
6                 cool       100.0      0.0            0     int64
7                elite       100.0      0.0            0    object
8              friends       100.0      0.0            0    object
9                 fans       100.0      0.0            0     int64
10       average_stars       100.0      0.0            0   float64
11      compliment_hot       100.0      0.0            0     int64
12     compliment_more       100.0      0.0            0     int64
13  compli

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
16963,R8ZZkuxMx041F9GnBYwFXg,Tara,27,2012-08-22 20:14:21,26,7,6,,"q8uX7lKoMfBTeFLSzXYxjg, WpNZwRXAldu565YK0WcE4g...",0,...,0,0,0,0,1,0,0,0,1,3
20738,D2LJoZvVuqejhcYeGz-SOQ,Hali'a,1,2020-03-26 22:48:28,0,0,0,,,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
#Guardar los cambios al archivo
df_user_FL.to_parquet("../0_Dataset/Data_Limpia/Yelp/user_reducido.parquet", engine="pyarrow")

___

___
## Dataset Gogle


### Reviews Florida

In [27]:
data_type_check(df_G_review_FL)
df_G_review_FL.sample(2)



 Resumen del dataframe:

Dimensiones:  (712500, 8)
   columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0  user_id      100.00     0.00            0   float64
1     name      100.00     0.00            0    object
2     time      100.00     0.00            0     int64
3   rating      100.00     0.00            0     int64
4     text       62.09    37.91       270081    object
5     pics        3.66    96.34       686417    object
6     resp       16.02    83.98       598322    object
7  gmap_id      100.00     0.00            0    object


Unnamed: 0,user_id,name,time,rating,text,pics,resp,gmap_id
1410168,1.077266e+20,Rebecca Bowers,1541375423619,5,"Timely Togo order, and friendly prompt service...",,,0x88e77e241c851873:0x5a1332b614931bc5
552643,1.051432e+20,Cleve Baker,1567540584385,3,,,,0x88d9ca72eb2926cb:0x74e712c0719db790


#### **📤 LOAD**

In [28]:
#Guardar los cambios al archivo
df_G_review_FL.to_parquet("../0_Dataset/Data_Limpia/Google/G_review_FL_reducido.parquet", engine="pyarrow")

### Metadata-sitios

In [29]:
data_type_check(df_G_metadata_FL)
df_G_metadata_FL.sample(2)


 Resumen del dataframe:

Dimensiones:  (220001, 15)
             columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0               name      100.00     0.00            5    object
1            address       97.15     2.85         6275    object
2            gmap_id      100.00     0.00            0    object
3        description        7.16    92.84       204255    object
4           latitude      100.00     0.00            0   float64
5          longitude      100.00     0.00            0   float64
6           category       99.35     0.65         1428    object
7         avg_rating      100.00     0.00            0   float64
8     num_of_reviews      100.00     0.00            0     int64
9              price        7.68    92.32       203095    object
10             hours       72.82    27.18        59802    object
11              MISC       75.58    24.42        53728    object
12             state       74.25    25.75        56652    object
13  relative_results       89.49    1

Unnamed: 0,name,address,gmap_id,description,latitude,longitude,category,avg_rating,num_of_reviews,price,hours,MISC,state,relative_results,url
81928,Betty Jean's BBQ,"Betty Jean's BBQ, Main BX Food Court, 101 W Sp...",0x549e1443082a4eb7:0x1957b6205f4bf16a,,47.638971,-117.65277,"[Barbecue restaurant, Caterer]",4.4,38,$,"[[Monday, Closed], [Tuesday, 11AM–4PM], [Wedne...",{'Accessibility': ['Wheelchair accessible entr...,Closed ⋅ Opens 11AM Tue,"[0x549e17baf581c371:0xf830df636362e17a, 0x549e...",https://www.google.com/maps/place//data=!4m2!3...
172694,Nufab Rebar Slidell LLC,"Nufab Rebar Slidell LLC, 250 Stone Rd, Slidell...",0x889de5d3b1302b0b:0x66cd01e927f1140e,,30.30117,-89.782166,"[Construction, Professional services]",3.2,5,,,{'Accessibility': ['Wheelchair accessible entr...,Open now,"[0x889de8a0827d62f9:0x8b9d05e7c29e57b1, 0x8627...",https://www.google.com/maps/place//data=!4m2!3...


📤 LOAD

In [30]:
#Guardar los cambios al archivo
df_rev_FL.to_parquet("../0_Dataset/Data_Limpia/Google/G_metadata_FL.parquet", engine="pyarrow")
