![banner_etl](https://github.com/cistelsa/Commerce_Data_Analysis_and_Recommendations/blob/main/5_Sources/Images/banner_automatizacion.gif?raw=true)

**<mark style="background:#2bfe9c">Script de automatización</mark> proviene del Notebook  <mark>ETL_hb_hotels</mark>**
créditos **[@MayrenS95](https://github.com/MayrenS95)**,  **[@cistelsa](https://github.com/cistelsa)**

In [1]:
# Usaremos librería Pandas y Json para permitir la lectura de los archivos
import pandas as pd
import json

StatementMeta(, 717e4e8d-87ab-4ab4-804c-7740c35fe2a9, 3, Finished, Available)

In [2]:
# Ruta base donde se encuentran los archivos JSON
path_base = "/lakehouse/default/Files/data/original/HotelBeds/"

StatementMeta(, 717e4e8d-87ab-4ab4-804c-7740c35fe2a9, 4, Finished, Available)

### Realizamos lectura del dataset

In [3]:
# Hacemos lectura del dataset en csv previamente extraido.
df_hotels_c = pd.read_json(path_base + "hotels_dataset.json")

StatementMeta(, 17130808-cc3d-40c0-9f6f-6bb13cebe3ee, 5, Finished, Available)

In [4]:
# Revisamos su estructura, tipos de datos
df_hotels_c.info()

StatementMeta(, 17130808-cc3d-40c0-9f6f-6bb13cebe3ee, 6, Finished, Available)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39963 entries, 0 to 39962
Data columns (total 31 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   code                   39963 non-null  int64 
 1   name                   39963 non-null  object
 2   description            39923 non-null  object
 3   countryCode            39963 non-null  object
 4   stateCode              39963 non-null  object
 5   destinationCode        39963 non-null  object
 6   zoneCode               39963 non-null  int64 
 7   coordinates            39963 non-null  object
 8   categoryCode           39963 non-null  object
 9   categoryGroupCode      39876 non-null  object
 10  chainCode              33551 non-null  object
 11  accommodationTypeCode  39963 non-null  object
 12  boardCodes             38280 non-null  object
 13  segmentCodes           39166 non-null  object
 14  address                39963 non-null  object
 15  postalCode         

Analizamos columnas como `license` revisamos el diccionario de datos y es una licencia interna de Hotelbeds, vemos que los campos vacíos están en más del 99% por lo cual se considera eliminar la columna, tampoco es relevante para los KPIs y los objetivos.

In [5]:
df_hotels_c.head()
pd.options.display.max_columns=0

StatementMeta(, 17130808-cc3d-40c0-9f6f-6bb13cebe3ee, 7, Finished, Available)

La **lista** unicamente son referencia de código que hace un llamado a otra tabla relacionada es el caso de `boardCodes` y `segmentCodes`

La **Lista de Diccionarios** hace un llamado a más parametros o descripciones de servicios que tiene el Hotel, estos se pueden manejar en su mayoría como una tabla relacionada, este es un proceso de ETL y normalizado de tablas.

Podemos observar que las columnas que son en forma de **Diccionarios** son: `name`, `description`, `coordinates`, `address`, `city` estas columnas se procede a desanidar para que quede en formato de string

In [6]:
# Define una función para extraer el contenido de 'content'
def extract_content(json_dict):
    if isinstance(json_dict, dict):
        return json_dict.get('content', '')
    else:
        return ''

StatementMeta(, 17130808-cc3d-40c0-9f6f-6bb13cebe3ee, 8, Finished, Available)

In [7]:
# Aplica la función a la columna y crea una nueva columna con el contenido extraído
df_hotels_c['name'] = df_hotels_c['name'].apply(extract_content)
df_hotels_c['description'] = df_hotels_c['description'].apply(extract_content)
df_hotels_c['address'] = df_hotels_c['address'].apply(extract_content)
df_hotels_c['city'] = df_hotels_c['city'].apply(extract_content)

StatementMeta(, 17130808-cc3d-40c0-9f6f-6bb13cebe3ee, 9, Finished, Available)

In [8]:
# Separamos coordinates en dos columnas respectivamente longitude y latitude
def extract_longitude(json_dict):
    if isinstance(json_dict, dict):
        return json_dict.get('longitude', '')
    else:
        return ''
    
def extract_latitude(json_dict):
    if isinstance(json_dict, dict):
        return json_dict.get('latitude', '')
    else:
        return ''

StatementMeta(, 17130808-cc3d-40c0-9f6f-6bb13cebe3ee, 10, Finished, Available)

In [9]:
# Aplicamos la Función
df_hotels_c['longitude'] = df_hotels_c['coordinates'].apply(extract_longitude)
df_hotels_c['latitude'] = df_hotels_c['coordinates'].apply(extract_latitude)

StatementMeta(, 17130808-cc3d-40c0-9f6f-6bb13cebe3ee, 11, Finished, Available)

Eliminamos algunas columnas que ya se encuentran en el dataset de `hotels_details_dataset.csv`, es el caso de `terminals, issues, wildcards, images, interestPoints, facilities, rooms` en otros casos como el de `license` por tener el 99% de campos vacíos se procede a eliminar la columna ya verificando que los datos no son relevantes, `coordinates` se elimina ya que se dividión en dos columnas.

In [10]:
# Eliminar columnas con datos null: 'terminals', 'license', 'issues', 'wildcards', 'images', 'interestPoints', 'facilities', 'rooms'
df_hotels_c = df_hotels_c.drop(columns=['terminals', 'license', 'issues', 'wildcards', 'images', 'interestPoints', 'facilities', 'rooms'])
# Eliminar Columna ya gestionada y dividida
df_hotels_c = df_hotels_c.drop(columns=['coordinates'])

StatementMeta(, 17130808-cc3d-40c0-9f6f-6bb13cebe3ee, 12, Finished, Available)

In [11]:
df_hotels_c.info()

StatementMeta(, 17130808-cc3d-40c0-9f6f-6bb13cebe3ee, 13, Finished, Available)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39963 entries, 0 to 39962
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   code                   39963 non-null  int64  
 1   name                   39963 non-null  object 
 2   description            39963 non-null  object 
 3   countryCode            39963 non-null  object 
 4   stateCode              39963 non-null  object 
 5   destinationCode        39963 non-null  object 
 6   zoneCode               39963 non-null  int64  
 7   categoryCode           39963 non-null  object 
 8   categoryGroupCode      39876 non-null  object 
 9   chainCode              33551 non-null  object 
 10  accommodationTypeCode  39963 non-null  object 
 11  boardCodes             38280 non-null  object 
 12  segmentCodes           39166 non-null  object 
 13  address                39963 non-null  object 
 14  postalCode             39904 non-null  object 
 15  ci

In [12]:
# Cambiar el nombre a columna code
df_hotels_c = df_hotels_c.rename(columns={
    "code":"hotel_id",
    "countryCode": "country_code",
    "stateCode": "state_code",
    "destinationCode": "destination_code",
    "zoneCode": "zone_code",
    "categoryCode": "category_code",
    "categoryGroupCode": "categorygroup_code",
    "chainCode": "chain_code",
    "accommodationTypeCode": "accommodationtype_code"
})

StatementMeta(, 17130808-cc3d-40c0-9f6f-6bb13cebe3ee, 14, Finished, Available)

In [13]:
# eliminar boardCodes segmentCodes ya que están relacionados en su propia tabla
df_hotels_c = df_hotels_c.drop(['boardCodes', 'segmentCodes'], axis=1)

StatementMeta(, 17130808-cc3d-40c0-9f6f-6bb13cebe3ee, 15, Finished, Available)

In [14]:
#Reemplazando los caracteres especiales de la columna "S2C"
df_hotels_c['S2C'] = df_hotels_c['S2C'].str.replace('*','')

StatementMeta(, 17130808-cc3d-40c0-9f6f-6bb13cebe3ee, 16, Finished, Available)

  df_hotels_c['S2C'] = df_hotels_c['S2C'].str.replace('*','')


In [15]:
#Cambiando el tipo de dato de la columna "S2C"
df_hotels_c['S2C']= df_hotels_c['S2C'].astype('float')

StatementMeta(, 17130808-cc3d-40c0-9f6f-6bb13cebe3ee, 17, Finished, Available)

#### **Inicio de script para extraer las ciudades y distribuirlas en su columna y fila correspondiente, solo se ejecuta una vez**

In [47]:
import requests
api_key = 'AIzaSyCFGrbgonwq9tHwrhCykeq7cWZX5TJMV6o0'

StatementMeta(, c2245e55-c498-40b9-a5cf-95207841a51a, 49, Finished, Available)

In [44]:
def obtener_ciudad(lat, lon, api_key):
    # Construye la URL para la solicitud de geocodificación inversa de Google Maps.
    url = f'https://maps.googleapis.com/maps/api/geocode/json?latlng={lat},{lon}&key={api_key}'
    
    # Realiza la solicitud a la API de Google Maps.
    response = requests.get(url)
    data = response.json()
    
    # Analiza la respuesta y extrae la ciudad.
    city = None
    if 'results' in data:
        for result in data['results']:
            for component in result['address_components']:
                if 'locality' in component['types']:
                    city = component['long_name']
                    break
            if city:
                break
    
    return city

StatementMeta(, c2245e55-c498-40b9-a5cf-95207841a51a, 46, Finished, Available)

In [48]:
# Aplica la función obtener_ciudad a las columnas de latitud y longitud para obtener la ciudad.
df_hotels_c['city_2'] = df_hotels_c.apply(lambda row: obtener_ciudad(row['latitude'], row['longitude'], api_key), axis=1)

StatementMeta(, c2245e55-c498-40b9-a5cf-95207841a51a, 50, Finished, Available)

#### tenemos en cuenta que estos dos registros de hotel_id quedaron vacíos se buscan de forma manual

In [12]:
# Modifica el valor para hotel_id igual a 624324.
df_hotels_c.loc[df_hotels_c['hotel_id'] == 624324, 'city_2'] = 'Pocono Township'

# Modifica el valor para hotel_id igual a 485021.
df_hotels_c.loc[df_hotels_c['hotel_id'] == 485021, 'city_2'] = 'South Whitehall Township'

# Modifica el valor para hotel_id igual a 714817.
df_hotels_c.loc[df_hotels_c['hotel_id'] == 714817, 'city_2'] = 'Ishpeming'
lat_714817 = df_hotels_c.loc[df_hotels_c['hotel_id'] == 714817, 'longitude']
lon_714817 = df_hotels_c.loc[df_hotels_c['hotel_id'] == 714817, 'latitude']
df_hotels_c.loc[df_hotels_c['hotel_id'] == 714817, 'longitude'] = lon_714817
df_hotels_c.loc[df_hotels_c['hotel_id'] == 714817, 'latitude'] = lat_714817

# Modifica el valor para hotel_id igual a 714868.
df_hotels_c.loc[df_hotels_c['hotel_id'] == 714868, 'city_2'] = 'Torrance'
lat_714868 = df_hotels_c.loc[df_hotels_c['hotel_id'] == 714868, 'longitude']
lon_714868 = df_hotels_c.loc[df_hotels_c['hotel_id'] == 714868, 'latitude']
df_hotels_c.loc[df_hotels_c['hotel_id'] == 714868, 'longitude'] = lon_714868
df_hotels_c.loc[df_hotels_c['hotel_id'] == 714868, 'latitude'] = lat_714868

# Modifica el valor para hotel_id igual a 735342.
df_hotels_c.loc[df_hotels_c['hotel_id'] == 735342, 'city_2'] = 'Mount Vernon'
lat_735342 = df_hotels_c.loc[df_hotels_c['hotel_id'] == 735342, 'longitude']
lon_735342 = df_hotels_c.loc[df_hotels_c['hotel_id'] == 735342, 'latitude']
df_hotels_c.loc[df_hotels_c['hotel_id'] == 735342, 'longitude'] = lon_735342
df_hotels_c.loc[df_hotels_c['hotel_id'] == 735342, 'latitude'] = lat_735342

# Modifica el valor para hotel_id igual a 749641.
df_hotels_c.loc[df_hotels_c['hotel_id'] == 749641, 'city_2'] = 'Caribou'
lat_749641 = '46.8730282'
lon_749641 = '-67.9993441'
df_hotels_c.loc[df_hotels_c['hotel_id'] == 749641, 'longitude'] = lon_749641
df_hotels_c.loc[df_hotels_c['hotel_id'] == 749641, 'latitude'] = lat_749641

# Modifica el valor para hotel_id igual a 913405.
df_hotels_c.loc[df_hotels_c['hotel_id'] == 913405, 'city_2'] = 'Miami Beach'
lat_913405 = df_hotels_c.loc[df_hotels_c['hotel_id'] == 913405, 'longitude']
lon_913405 = df_hotels_c.loc[df_hotels_c['hotel_id'] == 913405, 'latitude']
df_hotels_c.loc[df_hotels_c['hotel_id'] == 913405, 'longitude'] = lon_913405
df_hotels_c.loc[df_hotels_c['hotel_id'] == 913405, 'latitude'] = lat_913405

# Modifica el valor para hotel_id igual a 913409.
df_hotels_c.loc[df_hotels_c['hotel_id'] == 913409, 'city_2'] = 'Orlando'
lat_913409 = df_hotels_c.loc[df_hotels_c['hotel_id'] == 913409, 'longitude']
lon_913409 = df_hotels_c.loc[df_hotels_c['hotel_id'] == 913409, 'latitude']
df_hotels_c.loc[df_hotels_c['hotel_id'] == 913409, 'longitude'] = lon_913409
df_hotels_c.loc[df_hotels_c['hotel_id'] == 913409, 'latitude'] = lat_913409

# Modifica el valor para hotel_id igual a 995721.
df_hotels_c.loc[df_hotels_c['hotel_id'] == 995721, 'city_2'] = 'Warren'
lat_995721 = '42.5149271'
lon_995721 = '-83.0270149'
df_hotels_c.loc[df_hotels_c['hotel_id'] == 995721, 'longitude'] = lon_995721
df_hotels_c.loc[df_hotels_c['hotel_id'] == 995721, 'latitude'] = lat_995721

StatementMeta(, 717e4e8d-87ab-4ab4-804c-7740c35fe2a9, 14, Finished, Available)

In [49]:
# Guardamos el dataframe con ciudades
df_hotels_c.to_csv(path_base + "hotels_with_cities_dataset.csv", index=False)

StatementMeta(, c2245e55-c498-40b9-a5cf-95207841a51a, 51, Finished, Available)

#### **Fin del código de extracción de ciudades para hotelbeds**

In [15]:
df_hotels_c = pd.read_csv(path_base + "hotels_with_cities_dataset.csv")

StatementMeta(, 717e4e8d-87ab-4ab4-804c-7740c35fe2a9, 17, Finished, Available)

In [16]:
df_city_code = pd.DataFrame()
df_city_code['city_name'] = df_hotels_c['city_2']
df_city_code['state_code'] = df_hotels_c['state_code']
# Eliminamos duplicados
df_city_code.drop_duplicates(inplace=True)

StatementMeta(, 717e4e8d-87ab-4ab4-804c-7740c35fe2a9, 18, Finished, Available)

In [17]:
df_city_code['city_code'] = range(100, 100 + len(df_city_code))
# Organizar columnas
df_city_code = df_city_code[['city_code', 'state_code', 'city_name']]
df_city_code.info()

StatementMeta(, 717e4e8d-87ab-4ab4-804c-7740c35fe2a9, 19, Finished, Available)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6125 entries, 0 to 39961
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   city_code   6125 non-null   int64 
 1   state_code  6125 non-null   object
 2   city_name   6125 non-null   object
dtypes: int64(1), object(2)
memory usage: 191.4+ KB


In [18]:
df_city_code

StatementMeta(, 717e4e8d-87ab-4ab4-804c-7740c35fe2a9, 20, Finished, Available)

Unnamed: 0,city_code,state_code,city_name
0,100,IL,Chicago
1,101,CA,Los Angeles
2,102,CA,Universal City
3,103,MA,Boston
4,104,LA,New Orleans
...,...,...,...
39905,6220,KS,Parkville
39939,6221,CA,Yucaipa
39958,6222,WV,Ellenboro
39959,6223,PA,O'Hara Township


In [19]:
#Importante Creación de dataset cities.csv
df_city_code.to_csv("/lakehouse/default/Files/data/launch/HotelBeds/cities.csv", index=False)

StatementMeta(, 717e4e8d-87ab-4ab4-804c-7740c35fe2a9, 21, Cancelled, Waiting)

In [20]:
df_hotels_c.info()

StatementMeta(, 717e4e8d-87ab-4ab4-804c-7740c35fe2a9, 22, Finished, Available)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39963 entries, 0 to 39962
Data columns (total 23 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   hotel_id                39963 non-null  int64  
 1   name                    39963 non-null  object 
 2   description             39923 non-null  object 
 3   country_code            39963 non-null  object 
 4   state_code              39963 non-null  object 
 5   destination_code        39963 non-null  object 
 6   zone_code               39963 non-null  int64  
 7   category_code           39963 non-null  object 
 8   categorygroup_code      39876 non-null  object 
 9   chain_code              33551 non-null  object 
 10  accommodationtype_code  39963 non-null  object 
 11  address                 39961 non-null  object 
 12  postalCode              39904 non-null  object 
 13  city                    39963 non-null  object 
 14  email                   20233 non-null

In [21]:
df_hotels_c_join = df_hotels_c.merge(df_city_code[['city_code', 'city_name', 'state_code']], left_on=['city_2', 'state_code'], right_on=['city_name', 'state_code'], how='inner')

StatementMeta(, 717e4e8d-87ab-4ab4-804c-7740c35fe2a9, 23, Finished, Available)

In [22]:
df_hotels_c_join.info()

StatementMeta(, 717e4e8d-87ab-4ab4-804c-7740c35fe2a9, 24, Finished, Available)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39963 entries, 0 to 39962
Data columns (total 25 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   hotel_id                39963 non-null  int64  
 1   name                    39963 non-null  object 
 2   description             39923 non-null  object 
 3   country_code            39963 non-null  object 
 4   state_code              39963 non-null  object 
 5   destination_code        39963 non-null  object 
 6   zone_code               39963 non-null  int64  
 7   category_code           39963 non-null  object 
 8   categorygroup_code      39876 non-null  object 
 9   chain_code              33551 non-null  object 
 10  accommodationtype_code  39963 non-null  object 
 11  address                 39961 non-null  object 
 12  postalCode              39904 non-null  object 
 13  city                    39963 non-null  object 
 14  email                   20233 non-null

In [23]:
# eliminar city y city_name ya que están relacionados en su propia tabla
df_hotels_c_join = df_hotels_c_join.drop(['city_2', 'city', 'city_name'], axis=1)

StatementMeta(, 717e4e8d-87ab-4ab4-804c-7740c35fe2a9, 25, Finished, Available)

In [24]:
df_hotels_c_join.info()

StatementMeta(, 717e4e8d-87ab-4ab4-804c-7740c35fe2a9, 26, Finished, Available)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39963 entries, 0 to 39962
Data columns (total 22 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   hotel_id                39963 non-null  int64  
 1   name                    39963 non-null  object 
 2   description             39923 non-null  object 
 3   country_code            39963 non-null  object 
 4   state_code              39963 non-null  object 
 5   destination_code        39963 non-null  object 
 6   zone_code               39963 non-null  int64  
 7   category_code           39963 non-null  object 
 8   categorygroup_code      39876 non-null  object 
 9   chain_code              33551 non-null  object 
 10  accommodationtype_code  39963 non-null  object 
 11  address                 39961 non-null  object 
 12  postalCode              39904 non-null  object 
 13  email                   20233 non-null  object 
 14  phones                  39505 non-null

In [25]:
#Generando dataset hotels_hb_dataset.csv
df_hotels_c_join.to_csv("/lakehouse/default/Files/data/beta/Hotelbeds/hotels_hb_dataset.csv", index=False)

StatementMeta(, 717e4e8d-87ab-4ab4-804c-7740c35fe2a9, 27, Finished, Available)