## 🛠️ **ETL (Extract, Transform, Load)**



En este ipynb vamos a transformar los archivos, cambiando sus tipos de datos y demas. 

Tambien vamos a estar utilizando la funcion personalizada personalizada `data_type_check` invocada desde `data_utils.py` para dejar un vistazo raápido del dataframe y  poder observar:
- Variables categóricas
- Variables numéricas
- Dimensiones del dataframe
- Nulos
- Tipos de datos
- Informacion acerca de los datos faltantes o nulos de cada columna    


####  **Importamos las librerías que vamos a usar**


In [1]:
import warnings

import pandas as pd
from data_utils import data_type_check

warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import pyarrow as pa
import pyarrow.parquet as pq
import seaborn as sns

___

## Dataset Yelp

### business.pkl

In [2]:
business = pd.read_parquet('../0_Dataset/Yelp/business.parquet')
data_type_check(business)
business.sample(2)


 Resumen del dataframe:

Dimensiones:  (150243, 14)
         columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0    business_id      100.00     0.00            0    object
1           name      100.00     0.00            0    object
2        address      100.00     0.00            0    object
3           city      100.00     0.00            0    object
4          state      100.00     0.00            3    object
5    postal_code      100.00     0.00            0    object
6       latitude      100.00     0.00            0   float64
7      longitude      100.00     0.00            0   float64
8          stars      100.00     0.00            0   float64
9   review_count      100.00     0.00            0   float64
10       is_open      100.00     0.00            0   float64
11    attributes       90.92     9.08        13642    object
12    categories      100.00     0.00            0    object
13         hours       84.61    15.39        23120    object


Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
15405.0,sT5AGaXjIp0SNLPphyAbOQ,What's the Scoop,550 S Oak Ave,Primos,NJ,19018,39.919949,-75.298495,3.5,6.0,0.0,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Food, Ice Cream & Frozen Yogurt",
5995.0,IW_uYAP59YwLVurBUfhs4A,Ratchada Thai Restaurant and Sushi Bar,270 1st Avenue North,St. Petersburg,FL,33701,27.771826,-82.636816,2.5,22.0,0.0,"{'AcceptsInsurance': None, 'AgesAllowed': None...","Restaurants, Sushi Bars, Thai",


Tenemos columnas duplicadas, procedemos a quitarlas.


<class 'pandas.core.frame.DataFrame'>
Index: 150243 entries, 0.0 to 150345.0
Data columns (total 14 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   business_id   150243 non-null  object 
 1   name          150243 non-null  object 
 2   address       150243 non-null  object 
 3   city          150243 non-null  object 
 4   state         150240 non-null  object 
 5   postal_code   150243 non-null  object 
 6   latitude      150243 non-null  float64
 7   longitude     150243 non-null  float64
 8   stars         150243 non-null  float64
 9   review_count  150243 non-null  float64
 10  is_open       150243 non-null  float64
 11  attributes    136601 non-null  object 
 12  categories    150243 non-null  object 
 13  hours         127123 non-null  object 
dtypes: float64(5), object(9)
memory usage: 17.2+ MB


#### 🔁 **TRANSFORM**

Dado que diferentes negocios pueden compartir las mismas 'categories', no borramos los  67083 duplicados 


In [5]:
data_type_check(business)


 Resumen del dataframe:



Dimensiones:  (150243, 14)
         columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0    business_id      100.00     0.00            0    object
1           name      100.00     0.00            0    object
2        address      100.00     0.00            0    object
3           city      100.00     0.00            0    object
4          state      100.00     0.00            3    object
5    postal_code      100.00     0.00            0    object
6       latitude      100.00     0.00            0   float64
7      longitude      100.00     0.00            0   float64
8          stars      100.00     0.00            0   float64
9   review_count      100.00     0.00            0   float64
10       is_open      100.00     0.00            0   float64
11    attributes       90.92     9.08        13642    object
12    categories      100.00     0.00            0    object
13         hours       84.61    15.39        23120    object


📤 LOAD

In [6]:
#guardar en parquet
business.to_parquet("../0_Dataset/Yelp/business.parquet", engine="pyarrow")

### review

In [7]:
#abrir el parquet review_reducido
df_rev_FL = pd.read_parquet('../0_Dataset/Yelp/review_reducido.parquet')
data_type_check(df_rev_FL)
df_rev_FL.sample(2)


 Resumen del dataframe:

Dimensiones:  (209708, 9)
       columna  %_no_nulos  %_nulos  total_nulos       tipo_dato
0    review_id       100.0      0.0            0          object
1      user_id       100.0      0.0            0          object
2  business_id       100.0      0.0            0          object
3        stars       100.0      0.0            0           int64
4       useful       100.0      0.0            0           int64
5        funny       100.0      0.0            0           int64
6         cool       100.0      0.0            0           int64
7         text       100.0      0.0            0          object
8         date       100.0      0.0            0  datetime64[ns]


Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
138880,zDlQsg8fIAsDWKCzjHT_zg,ev6jFu0ecPL1g85uO9Qc6w,kXXjd8WebA6u9QaAi9OdfA,5,0,0,0,This place never fails..food/staff is always p...,2019-06-09 15:45:22
153516,S0ABI1FHvWOr1M11YbaziA,c9ZuyshW5arT17cH1Egr7A,rt5c08hpGnZ3DCI1C_LxCQ,5,2,0,0,Great place all kinds of stuff. They have an a...,2017-06-22 01:22:53



📤 LOAD

In [None]:
#guardar en parquet
df_rev_FL.to_parquet("../0_Dataset/Yelp/review_FL_reducido.parquet", engine="pyarrow")


### checkin

In [8]:
#abrir el parquet checkin
df_checkin_FL = pd.read_parquet('../0_Dataset/Yelp/checkin_reducido.parquet')
data_type_check(df_checkin_FL)
df_checkin_FL.sample(2)



 Resumen del dataframe:

Dimensiones:  (92351, 2)
       columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0  business_id       100.0      0.0            0    object
1         date       100.0      0.0            0    object


Unnamed: 0,business_id,date
77267,_V6hl1oGkTV2KbGeax_HPA,"2010-02-13 01:01:05, 2010-03-06 01:37:04, 2010..."
115580,sAqpTpSWYi6njHpbQCR08A,"2012-03-06 14:40:02, 2012-03-07 13:42:55, 2012..."


### tip

In [10]:
#abrir el parquet tip
df_tip_FL = pd.read_parquet('../0_Dataset/Yelp/tip.parquet')
data_type_check(df_tip_FL)
df_tip_FL.sample(2)



 Resumen del dataframe:

Dimensiones:  (908915, 5)
            columna  %_no_nulos  %_nulos  total_nulos       tipo_dato
0           user_id       100.0      0.0            0          object
1       business_id       100.0      0.0            0          object
2              text       100.0      0.0            0          object
3              date       100.0      0.0            0  datetime64[ns]
4  compliment_count       100.0      0.0            0           int64


Unnamed: 0,user_id,business_id,text,date,compliment_count
640645,H4plkcLEFUUnpwlsbjkIfQ,Pm8R9r7qBJ1XIHGEDUWpqA,Excellent white veggie pizza,2018-05-12 18:18:37,0
394275,HKI5IfOrMYMl46DWMvWT3w,UYVsGmMkFJq8hOBOUU8ZHA,Sometimes it's good and rinses all the Soap an...,2011-06-14 21:51:40,0


### user


In [11]:
#abrir el parquet user
df_user_FL = pd.read_parquet('../0_Dataset/Yelp/user_reducido.parquet')
data_type_check(df_user_FL)
df_user_FL.sample(2)



 Resumen del dataframe:

Dimensiones:  (63168, 22)
               columna  %_no_nulos  %_nulos  total_nulos tipo_dato
0              user_id       100.0      0.0            0    object
1                 name       100.0      0.0            0    object
2         review_count       100.0      0.0            0     int64
3        yelping_since       100.0      0.0            0    object
4               useful       100.0      0.0            0     int64
5                funny       100.0      0.0            0     int64
6                 cool       100.0      0.0            0     int64
7                elite       100.0      0.0            0    object
8              friends       100.0      0.0            0    object
9                 fans       100.0      0.0            0     int64
10       average_stars       100.0      0.0            0   float64
11      compliment_hot       100.0      0.0            0     int64
12     compliment_more       100.0      0.0            0     int64
13  compli

Unnamed: 0,user_id,name,review_count,yelping_since,useful,funny,cool,elite,friends,fans,...,compliment_more,compliment_profile,compliment_cute,compliment_list,compliment_note,compliment_plain,compliment_cool,compliment_funny,compliment_writer,compliment_photos
8863,xP8rQT9dksFkhQnf-ScJmQ,Samuel,13,2014-05-12 02:41:01,16,3,1,,"zO9APJ9Csbwn26k4LSA3bg, uyMEiUx6ZDvYCS4yWZeohA",1,...,0,0,0,0,0,0,0,0,0,0
5877,CecFUKt7wumI4YcVdEfbJA,Kirk,2,2014-04-22 20:51:26,0,0,0,,,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
#Guardar los cambios al archivo
df_checkin_FL.to_parquet("../0_Dataset/Yelp/checkin_FL_reducido.parquet", engine="pyarrow")

___

___
## Dataset Gogle


#### **📂Procesamiento del 1er archivo: `Google Maps/metadata-sitios/review-Florida-`**

### Reviews Florida

In [13]:
#abrir el parquet G_review
df_G_review_FL = pd.read_parquet('../0_Dataset/Google/G_review_reducido.parquet')
data_type_check(df_G_review_FL)
df_G_review_FL.sample(2)


FileNotFoundError: [Errno 2] No such file or directory: '../0_Dataset/Google/G_review_reducido.parquet'

#### **📤 LOAD**

In [None]:
#Guardar los cambios al archivo
df_G_review_FL.to_parquet("../0_Dataset/Google/G_review_FL_reducido.parquet", engine="pyarrow")

### Metadata-sitios

In [None]:
#abrir el parquet G_metadata_FL_reducido.parquet
df_G_metadata_FL = pd.read_parquet('../0_Dataset/Google/G_metadata_FL_reducido.parquet')
data_type_check(df_G_metadata_FL)
df_G_metadata_FL.sample(2)

📤 LOAD

In [None]:
#Guardar los cambios al archivo
df_rev_FL.to_parquet("../0_Dataset/Yelp/review_FL_reducido.parquet", engine="pyarrow")
