#### **Draft con primeros pasos ETL dataset Cost of Living**

1. Extracción datos raw del csv Cost of Living. 
2. Análisis explotario superficial de variables, volumen de nulos por variable y datos estadísticos generales. 
3. Elaboración de mapping de variables: incorporamos csv con nuevos nombres de variables (nombres descriptivos) y tipo de coste de esas variable. 
4. Incorporamos ese mapping en csv a la carpeta /src/data/ y /src/norebooks/data, además del README del respositorio. 
5. Renombramos columnas de variables para disponer de un naming de negocio que ayude al EDA posterior. 
6. Análisis de nulos por variable. 
7. Limpieza de los datos para disponer de un dataset totalmente preparado para el análisis posterior. 
8. Proceso de ingeniería de características para tener variables que permitan un análisis mejor ponderado. Alguna variables aisladas no tienen una narrativa de negocio. 

##### **1. Extracción de datos raw del Dataset 2 — [Cost of Living](https://www.kaggle.com/datasets/mvieira101/global-cost-of-living/code)**

Duplicamos la carpeta de los datasets dentro de la carpeta /src/notebooks/ para facilitar la extracción de los datos

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns

pd.options.mode.copy_on_write = True # CoW por defecto a partir de pandas 3.0.0 

In [7]:
# Extraigo y examino algunos de los datos que devuelve el dataset
# Aquí puede haber una primer decisión que sea trabajar con variables con nombres agnóstico o cambiarlos por los nombres descriptivos
# En principio tiene sentido cambiarlos por los nombres descriptivos si queremos hacer un análisis exploratorio para resolver problemas de negocio reales

df_cost = pd.read_csv("./data/cost-of-living.csv")
df_cost

Unnamed: 0,city,country,x1,x2,x3,x4,x5,x6,x7,x8,...,x47,x48,x49,x50,x51,x52,x53,x54,x55,data_quality
0,Seoul,South Korea,7.68,53.78,6.15,3.07,4.99,3.93,1.48,0.79,...,110.36,742.54,557.52,2669.12,1731.08,22067.70,10971.90,2689.62,3.47,1
1,Shanghai,China,5.69,39.86,5.69,1.14,4.27,3.98,0.53,0.33,...,123.51,1091.93,569.88,2952.70,1561.59,17746.11,9416.35,1419.87,5.03,1
2,Guangzhou,China,4.13,28.47,4.98,0.85,1.71,3.54,0.44,0.33,...,43.89,533.28,317.45,1242.24,688.05,12892.82,5427.45,1211.68,5.19,1
3,Mumbai,India,3.68,18.42,3.68,2.46,4.30,2.48,0.48,0.19,...,41.17,522.40,294.05,1411.12,699.80,6092.45,2777.51,640.81,7.96,1
4,Delhi,India,4.91,22.11,4.30,1.84,3.68,1.77,0.49,0.19,...,36.50,229.84,135.31,601.02,329.15,2506.73,1036.74,586.46,8.06,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4951,Peterborough,Australia,,,,,,,,,...,,,,,,,,,,0
4952,Georgetown,Australia,,,,,,,,,...,,,,,,,,,,0
4953,Ixtapa Zihuatanejo,Mexico,5.16,30.94,12.89,0.98,,1.80,0.62,0.41,...,103.14,412.55,257.84,515.69,412.55,,,,,0
4954,Iqaluit,Canada,29.65,74.27,13.71,6.67,8.89,3.71,3.52,4.08,...,,,,2964.60,2964.60,,,,6.53,0


##### **2. Análisis exploratorio de variables del Dataset 2 — [Cost of Living](https://www.kaggle.com/datasets/mvieira101/global-cost-of-living/code)**

Visualizamos algunos datos generales del dataset (especificidad, nulos y datos estadísticos generales)

In [None]:
# Visualizamos 10 filas del dataset para hacernos una idea de la relación entre variables (columnas)

df_cost.head(10)

Unnamed: 0,city,country,x1,x2,x3,x4,x5,x6,x7,x8,...,x47,x48,x49,x50,x51,x52,x53,x54,x55,data_quality
0,Seoul,South Korea,7.68,53.78,6.15,3.07,4.99,3.93,1.48,0.79,...,110.36,742.54,557.52,2669.12,1731.08,22067.7,10971.9,2689.62,3.47,1
1,Shanghai,China,5.69,39.86,5.69,1.14,4.27,3.98,0.53,0.33,...,123.51,1091.93,569.88,2952.7,1561.59,17746.11,9416.35,1419.87,5.03,1
2,Guangzhou,China,4.13,28.47,4.98,0.85,1.71,3.54,0.44,0.33,...,43.89,533.28,317.45,1242.24,688.05,12892.82,5427.45,1211.68,5.19,1
3,Mumbai,India,3.68,18.42,3.68,2.46,4.3,2.48,0.48,0.19,...,41.17,522.4,294.05,1411.12,699.8,6092.45,2777.51,640.81,7.96,1
4,Delhi,India,4.91,22.11,4.3,1.84,3.68,1.77,0.49,0.19,...,36.5,229.84,135.31,601.02,329.15,2506.73,1036.74,586.46,8.06,1
5,Dhaka,Bangladesh,1.95,11.71,4.88,5.85,5.12,1.95,0.29,0.16,...,41.53,142.09,87.79,347.57,208.5,1119.98,571.72,280.73,9.26,1
6,Osaka,Japan,7.45,48.39,5.36,3.35,3.72,3.28,1.09,0.81,...,132.61,674.96,376.14,1737.21,993.17,8043.38,4825.58,2322.46,1.49,1
7,Jakarta,Indonesia,2.59,22.69,3.57,2.06,3.24,2.23,0.61,0.27,...,79.85,505.59,277.43,1172.14,615.04,2632.8,1241.09,509.12,9.05,1
8,Shenzhen,China,4.27,28.47,4.98,1.14,3.99,4.2,0.47,0.34,...,106.77,738.75,435.07,1682.3,886.16,17898.73,8091.57,1572.22,4.99,1
9,Kinshasa,Congo,15.11,42.63,10.08,1.74,2.5,4.35,2.78,0.84,...,93.33,2000.0,725.0,4500.0,1160.0,6170.63,933.33,400.0,19.33,0


In [3]:
df_cost.info() 

# Aquí observamos que hay variables que tienen nulos 
# Las columnas de ciudad, país y data_quality no tiene números faltantes --> total: 4956 non-null


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4956 entries, 0 to 4955
Data columns (total 58 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   city          4956 non-null   object 
 1   country       4956 non-null   object 
 2   x1            4528 non-null   float64
 3   x2            4505 non-null   float64
 4   x3            4622 non-null   float64
 5   x4            4460 non-null   float64
 6   x5            4516 non-null   float64
 7   x6            4612 non-null   float64
 8   x7            4511 non-null   float64
 9   x8            4640 non-null   float64
 10  x9            4578 non-null   float64
 11  x10           4543 non-null   float64
 12  x11           4613 non-null   float64
 13  x12           4449 non-null   float64
 14  x13           4478 non-null   float64
 15  x14           4398 non-null   float64
 16  x15           4390 non-null   float64
 17  x16           4584 non-null   float64
 18  x17           4575 non-null 

In [4]:
df_cost.describe() # Aquí vemos algunos datos estadísticos generales del dataset

Unnamed: 0,x1,x2,x3,x4,x5,x6,x7,x8,x9,x10,...,x47,x48,x49,x50,x51,x52,x53,x54,x55,data_quality
count,4528.0,4505.0,4622.0,4460.0,4516.0,4612.0,4511.0,4640.0,4578.0,4543.0,...,4403.0,3593.0,3525.0,3476.0,3444.0,2729.0,2653.0,3524.0,3950.0,4956.0
mean,10.346705,43.357811,6.929697,3.379774,3.942465,2.754952,1.523873,1.119804,1.148554,1.612855,...,85.976357,711.574943,560.159957,1253.982914,974.082854,3235.270854,2341.933102,1821.186305,6.51803,0.186239
std,7.157058,25.528312,2.704435,2.105957,2.100146,1.358821,0.907025,0.755257,0.517186,1.089164,...,40.434524,648.869301,527.492342,1219.395297,921.331271,6258.792897,3570.623878,1631.46558,5.413828,0.389339
min,0.45,3.25,1.08,0.33,0.31,0.22,0.11,0.07,0.26,0.05,...,7.6,21.7,12.2,61.51,27.12,111.01,49.24,18.0,0.78,0.0
25%,4.31,22.44,4.98,1.6,2.2,1.58,0.77,0.48,0.83,0.78,...,58.43,256.24,171.05,465.2,325.71,1120.11,790.31,478.98,3.0,0.0
50%,9.6,40.0,6.99,3.0,3.69,2.63,1.48,1.05,1.045,1.32,...,84.26,526.87,403.94,921.075,712.83,2107.49,1580.62,1290.84,5.2,0.0
75%,15.0,60.0,8.5,5.0,5.27,3.77,2.11,1.58,1.32,2.2,...,106.77,992.61,790.31,1710.225,1369.87,3843.38,2798.25,2770.855,8.33,0.0
max,57.14,213.69,22.13,20.6,17.5,10.0,8.0,5.85,6.81,8.82,...,542.74,12608.83,8989.37,27397.38,17868.18,240963.67,80321.22,12821.4,61.33,1.0


##### **3. Elaboración de mapping de variables del Dataset 2 — [Cost of Living](https://www.kaggle.com/datasets/mvieira101/global-cost-of-living/code)**

Generamos un mapa que nos permita renombrar las variables del dataset Cost of Living con nombres que ayuden al análisis de negocio

In [13]:
# Cargamos un CSV con los nombres de las variables nuevos y además incorporamos una columna descriptiva del tipo de coste. 
# La columna de tipo de coste nos puede ayudar a clusterizar las variables, agruparlas y calcular métricas interesantes para el análisis 

pd.read_csv("./data/cost-of-living-vars-map.csv")

Unnamed: 0,original_name_var,name_var,description_var,cost_type
0,city,city_name,City name,
1,country,country_name,Country name,
2,x1,meal_inexpensive_restaurant,"Meal, Inexpensive Restaurant (USD)",Restaurantes y bebidas
3,x2,meal_midrange_restaurant_2p,"Meal for 2 People, Mid-Range Restaurant, Three...",Restaurantes y bebidas
4,x3,mcmeal_fastfood,McMeal at McDonald’s (or Equivalent Combo Meal...,Restaurantes y bebidas
5,x4,beer_domestic_restaurant_0_5l,"Domestic Beer (0.5 liter draught, in restauran...",Restaurantes y bebidas
6,x5,beer_imported_restaurant_0_33l,"Imported Beer (0.33 liter bottle, in restauran...",Restaurantes y bebidas
7,x6,cappuccino_restaurant,"Cappuccino (regular, in restaurant) (USD)",Restaurantes y bebidas
8,x7,soda_restaurant_0_33l,"Coke/Pepsi (0.33 liter bottle, in restaurant) ...",Restaurantes y bebidas
9,x8,water_restaurant_0_33l,"Water (0.33 liter bottle, in restaurant) (USD)",Restaurantes y bebidas


##### **5. Renombramos columnas de variables para disponer de un naming de negocio del Dataset 2 — [Cost of Living](https://www.kaggle.com/datasets/mvieira101/global-cost-of-living/code)**

Renombramos todas las columnas e incorporamos la columnas de cost_type que ayude al EDA posterior