# <h1 align=center> **PROYECTO GRUPAL** </h1>
# <h2 align=center>**`ETL YELP (BUSINESS.PKL) - PARTE II`**</h2>

Luego de realizado el EDA se observó que el filtro realizado inicialmente en el proceso de ETL a la columna categorías (_categories_) no extrajo correctamente la data correspondiente al rubro de **hoteles** o sus semejantes. Además algunos registros se encuentran en el país de Canadá y no en EEUU que es el país de estudio. Por lo tanto, en el presente se hará otro proceso de ETL para corregir estas observaciones o fallos.


Importar las librerias necesarias

In [2]:
import pandas as pd
import numpy as np
import os
import pickle
import geopandas as gpd                     
from shapely.geometry import Point
import matplotlib.pyplot as plt
from geopy.geocoders import Nominatim

Importar en un dataframe el archivo original _business.pkl_ desde la ruta de acceso en el ordenador donde se encuentran alojado.

In [3]:
df_business = pickle.load(open('D:\\Aldemar\\cursos\\SoyHenry\\Cohorte DataPT02\\PF_DS\\Yelp\\business.pkl', 'rb')) #leer archivo .pkl
pd.options.display.max_columns=0 #mostrar todas las columnas
df_business = df_business.iloc[:,0:14] #quedarse con las columnas necesarias
df_business.head(4)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
0,Pns2l4eNsfO8kk83dixA6A,"Abby Rappoport, LAC, CMQ","1616 Chapala St, Ste 2",Santa Barbara,,93101,34.426679,-119.711197,5.0,7,0,{'ByAppointmentOnly': 'True'},"Doctors, Traditional Chinese Medicine, Naturop...",
1,mpf3x-BjTdTEA3yCZrAYPw,The UPS Store,87 Grasso Plaza Shopping Center,Affton,,63123,38.551126,-90.335695,3.0,15,1,{'BusinessAcceptsCreditCards': 'True'},"Shipping Centers, Local Services, Notaries, Ma...","{'Monday': '0:0-0:0', 'Tuesday': '8:0-18:30', ..."
2,tUFrWirKiKi_TAnsVWINQQ,Target,5255 E Broadway Blvd,Tucson,,85711,32.223236,-110.880452,3.5,22,0,"{'BikeParking': 'True', 'BusinessAcceptsCredit...","Department Stores, Shopping, Fashion, Home & G...","{'Monday': '8:0-22:0', 'Tuesday': '8:0-22:0', ..."
3,MTSW4McQd7CbVtyjqoe9mw,St Honore Pastries,935 Race St,Philadelphia,CA,19107,39.955505,-75.155564,4.0,80,1,"{'RestaurantsDelivery': 'False', 'OutdoorSeati...","Restaurants, Food, Bubble Tea, Coffee & Tea, B...","{'Monday': '7:0-20:0', 'Tuesday': '7:0-20:0', ..."


Se comienza eliminando los registros _NaN_ presentes en la columna _categories_, luego mediante el uso de una función se extraen los registros de todo el dataframe que cumplan con la condición de que en la columna _categories_ aparezca la palabra _Hotels_ en la primera o segunda posición; esto asegura que el rubro principal del comercio sea este o semejante (casa de alojamiento, hostal, etc).

In [4]:
df_business = df_business.dropna(subset=['categories']) #eliminar los registros NaN de la columna categories

def filter_categories_hotels(categories:list): #definir la funcion (recibe una lista como parámetro)
    for item in categories[0:2]: #validar que la palabra 'Hotels' aparezca en la 1era o 2da posición de la lista
        if type(item) == str and item.strip() == 'Hotels':
            return True   
    return False

df_business['categories'] = df_business['categories'].str.split(',') #crear una lista a partir de los valores de la columna 'categories' 
df_business = df_business[df_business['categories'].apply(filter_categories_hotels)] #aplicar la función en la columna 'categories'
df_business.head(4)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours
34,w_AMNoI1iG9eay7ncmc67w,River 127,100 Iberville St,New Orleans,PA,70130,29.951359,-90.064672,3.0,12,1,"{'BusinessAcceptsCreditCards': 'True', 'WiFi':...","[Event Planning & Services, Hotels, Hotels &...",
55,xM6LoUcnpDpMBzXs_7dXAg,Fairfield Inn & Suites,719 E Baltimore Pike,Kennett Square,AB,19348,39.856248,-75.69461,3.0,37,1,"{'BusinessAcceptsCreditCards': 'True', 'WiFi':...","[Hotels, Hotels & Travel, Event Planning & S...",
155,hUQ9Z7kQeabvhPOAQOVV1A,Rathbone Mansions,1244 Esplanade Ave,New Orleans,IN,70116,29.967055,-90.065828,3.5,67,1,"{'WiFi': 'u'free'', 'BusinessAcceptsCreditCard...","[Hotels, Hotels & Travel, Bed & Breakfast, ...",
259,vjLSYNGFkPu4Y5HKoJlzYg,Rancho 777,777 E 4th St,Reno,FL,89512,39.532347,-119.804255,2.0,5,1,"{'BusinessAcceptsCreditCards': 'True', 'Restau...","[Event Planning & Services, Hotels, Hotels &...","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W..."


Luego de tener tener filtrado el dataset para el rubro de hoteles se procede a eliminar los registros que no se encuentren en EEUU. Esto se realiza usando la librería _GeoPy_ y su geocodificador _Nominatim_, el cual trae la dirección completa de los hoteles a través de la latitud y de la longitud suministradas (que se encuentran en el dataset).

In [None]:
def search_state(lati_long:str): #definir la función que retornará la ubicación de cada hotel (recibe latitud y longitud como parámetros)
    geolocalizador = Nominatim(user_agent="aaaa")
    ubicacion = geolocalizador.reverse(lati_long)
    return ubicacion.address.split(',') #devuelve una lista con la dirección completa del hotel

df_business = df_business.astype({'latitude':'str', 'longitude':'str'}) #convertir las columnas 'latitude' y 'longitude' a tipo string
df_business['lati_long'] = df_business['latitude'] + ',' + df_business['longitude'] #unir en una sola columna los valores de 'latitude' y 'longitude'
df_business['state_en'] = df_business['lati_long'].apply(search_state) #guardar en la columna 'state_en' lo que retorna la función
df_business.head(4)

Se realiza una copia del dataframe anterior dado que el servicio de _GeoPy_ es limitado. Así se puede sobreescribir el nuevo dataframe sin afectar el original.

In [704]:
df_b = df_business.copy()
df_b.head(2)

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,lati_long,state_en
34,w_AMNoI1iG9eay7ncmc67w,River 127,100 Iberville St,New Orleans,PA,70130,29.951359,-90.0646715,3.0,12,1,"{'BusinessAcceptsCreditCards': 'True', 'WiFi':...","[Event Planning & Services, Hotels, Hotels &...",,"29.951359,-90.0646715","[One Canal Place, 365, Canal Street, French..."
55,xM6LoUcnpDpMBzXs_7dXAg,Fairfield Inn & Suites,719 E Baltimore Pike,Kennett Square,AB,19348,39.8562475317,-75.6946098804,3.0,37,1,"{'BusinessAcceptsCreditCards': 'True', 'WiFi':...","[Hotels, Hotels & Travel, Event Planning & S...",,"39.8562475317,-75.6946098804","[East Baltimore Pike, Kennett Township, Ches..."


Por otra parte se importa en un dataframe los 50 estados de los EUUU escritos en español e inglés así como su abreviatura.

In [705]:
df_state_EEUU = pd.read_csv('D:\\Aldemar\\cursos\\SoyHenry\\Cohorte DataPT02\\PF_DS\\estados_EEUU.csv')
df_state_EEUU = df_state_EEUU.iloc[:,0:3]
df_state_EEUU

Unnamed: 0,state_sp,state_en,state_abbreviation
0,Alabama,Alabama,AL
1,Alaska,Alaska,AK
2,Arizona,Arizona,AZ
3,Arkansas,Arkansas,AR
4,California,California,CA
5,Carolina del Norte,North Carolina,NC
6,Carolina del Sur,South Carolina,SC
7,Colorado,Colorado,CO
8,Connecticut,Connecticut,CT
9,Dakota del Norte,North Dakota,ND


Dado que se tiene la dirección completa del hotel y solo se desea tener el estado en el cual se encuentra el hotel se crea una función que devuelva el estado luego de cruzar la información contenida en las columnas _state-en_ de los dataframes _df-state-EEUU_ y _df-b_.

In [707]:
lista_state_EEUU_en = df_state_EEUU['state_en'].to_list() #convertir la columna 'state_en' en una lista

def search_state(fila_state_en:list): #definir la función
    fila_state_en = fila_state_en

    for item in fila_state_en:
        if item.strip() in lista_state_EEUU_en: #buscar cada item de la dirección en la lista de estados
            return item.strip() #devuelve el estado sin espacios en blanco

df_b['state_en'] = df_b['state_en'].apply(lambda x: search_state(x)) #guardar en la columna 'state_en' lo que devuelve la función
df_b

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,lati_long,state_en
34,w_AMNoI1iG9eay7ncmc67w,River 127,100 Iberville St,New Orleans,PA,70130,29.951359,-90.0646715,3.0,12,1,"{'BusinessAcceptsCreditCards': 'True', 'WiFi':...","[Event Planning & Services, Hotels, Hotels &...",,"29.951359,-90.0646715",Louisiana
55,xM6LoUcnpDpMBzXs_7dXAg,Fairfield Inn & Suites,719 E Baltimore Pike,Kennett Square,AB,19348,39.8562475317,-75.6946098804,3.0,37,1,"{'BusinessAcceptsCreditCards': 'True', 'WiFi':...","[Hotels, Hotels & Travel, Event Planning & S...",,"39.8562475317,-75.6946098804",Pennsylvania
155,hUQ9Z7kQeabvhPOAQOVV1A,Rathbone Mansions,1244 Esplanade Ave,New Orleans,IN,70116,29.9670548,-90.0658282,3.5,67,1,"{'WiFi': 'u'free'', 'BusinessAcceptsCreditCard...","[Hotels, Hotels & Travel, Bed & Breakfast, ...",,"29.9670548,-90.0658282",Louisiana
259,vjLSYNGFkPu4Y5HKoJlzYg,Rancho 777,777 E 4th St,Reno,FL,89512,39.5323474847,-119.804255405,2.0,5,1,"{'BusinessAcceptsCreditCards': 'True', 'Restau...","[Event Planning & Services, Hotels, Hotels &...","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W...","39.5323474847,-119.804255405",Nevada
268,xloFoRiYlH4IKGz3FhTDpA,1-275 Rest Area Manatee County Mile 7,13018 Rest Area,Terra Ceia,LA,34250,27.5843003458,-82.6139500829,4.0,5,1,"{'RestaurantsPriceRange2': '1', 'BusinessAccep...","[Hotels, Rest Stops, Event Planning & Servic...",,"27.5843003458,-82.6139500829",Florida
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
149956,XGGPXLaa_B2Qc79cq9YPlg,Beach House Inn,320 W Yanonali St,Santa Barbara,FL,93101,34.4103881,-119.6958943,4.5,96,1,"{'RestaurantsPriceRange2': '2', 'BusinessAccep...","[Apartments, Hotels, Hotels & Travel, Home ...",,"34.4103881,-119.6958943",California
150035,sfGALNhZEYz4HyrVVRaf3A,Beach House Suites by the Don Cesar,3860 Gulf Blvd,St. Pete Beach,PA,33706,27.7162027128,-82.7391078,3.5,39,1,"{'WiFi': ''free'', 'BusinessAcceptsCreditCards...","[Hotels, Event Planning & Services, Hotels &...","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W...","27.7162027128,-82.7391078",Florida
150046,1pBiQhcwaI_kF3urOdnG5A,Hotel Indigo Nashville,1719 W End Ave,Nashville,NV,37203,36.1529887,-86.7957085,3.0,39,0,"{'BusinessAcceptsCreditCards': 'True', 'Outdoo...","[Hotels, Shopping, Hotels & Travel, Venues ...",,"36.1529887,-86.7957085",Tennessee
150086,J2QLTuhFwXxnGHbEUP5e1A,Super 8 by Wyndham Indianapolis,"4530 S Emerson Ave, I-465, Exit 52",Indianapolis,LA,46203,39.7009908,-86.0838894,3.0,10,1,"{'RestaurantsPriceRange2': '2', 'WiFi': 'u'fre...","[Hotels, Hotels & Travel, Event Planning & S...",,"39.7009908,-86.0838894",Indiana


A continuación se puede observar la cantidad de valores nulos que existen aún en el dataframe.

In [712]:
df_b.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1626 entries, 34 to 150331
Data columns (total 16 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   business_id   1626 non-null   object
 1   name          1626 non-null   object
 2   address       1626 non-null   object
 3   city          1626 non-null   object
 4   state         1626 non-null   object
 5   postal_code   1626 non-null   object
 6   latitude      1626 non-null   object
 7   longitude     1626 non-null   object
 8   stars         1626 non-null   object
 9   review_count  1626 non-null   object
 10  is_open       1626 non-null   object
 11  attributes    1517 non-null   object
 12  categories    1626 non-null   object
 13  hours         1257 non-null   object
 14  lati_long     1626 non-null   object
 15  state_en      1592 non-null   object
dtypes: object(16)
memory usage: 280.5+ KB


Los valores nulos que existen en la columna _state-en_ corresponden a registros de hoteles que no son de EEUU. Por lo tanto se procede a elminarlos y además esta columna se cruza con la del dataframe de los estados para traer estos en idioma español y tener un mejor entendimiento.

In [714]:
df_b = df_b.dropna(subset=["state_en"]) #eliminar valores nulos (registros fuera de EEUU)
df_b = pd.merge(df_b, df_state_EEUU.iloc[:,0:2], on='state_en', how='left') #traer estados en español
df_b

Unnamed: 0,business_id,name,address,city,state,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,lati_long,state_en,state_sp
0,w_AMNoI1iG9eay7ncmc67w,River 127,100 Iberville St,New Orleans,PA,70130,29.951359,-90.0646715,3.0,12,1,"{'BusinessAcceptsCreditCards': 'True', 'WiFi':...","[Event Planning & Services, Hotels, Hotels &...",,"29.951359,-90.0646715",Louisiana,Luisiana
1,xM6LoUcnpDpMBzXs_7dXAg,Fairfield Inn & Suites,719 E Baltimore Pike,Kennett Square,AB,19348,39.8562475317,-75.6946098804,3.0,37,1,"{'BusinessAcceptsCreditCards': 'True', 'WiFi':...","[Hotels, Hotels & Travel, Event Planning & S...",,"39.8562475317,-75.6946098804",Pennsylvania,Pensilvania
2,hUQ9Z7kQeabvhPOAQOVV1A,Rathbone Mansions,1244 Esplanade Ave,New Orleans,IN,70116,29.9670548,-90.0658282,3.5,67,1,"{'WiFi': 'u'free'', 'BusinessAcceptsCreditCard...","[Hotels, Hotels & Travel, Bed & Breakfast, ...",,"29.9670548,-90.0658282",Louisiana,Luisiana
3,vjLSYNGFkPu4Y5HKoJlzYg,Rancho 777,777 E 4th St,Reno,FL,89512,39.5323474847,-119.804255405,2.0,5,1,"{'BusinessAcceptsCreditCards': 'True', 'Restau...","[Event Planning & Services, Hotels, Hotels &...","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W...","39.5323474847,-119.804255405",Nevada,Nevada
4,xloFoRiYlH4IKGz3FhTDpA,1-275 Rest Area Manatee County Mile 7,13018 Rest Area,Terra Ceia,LA,34250,27.5843003458,-82.6139500829,4.0,5,1,"{'RestaurantsPriceRange2': '1', 'BusinessAccep...","[Hotels, Rest Stops, Event Planning & Servic...",,"27.5843003458,-82.6139500829",Florida,Florida
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1587,XGGPXLaa_B2Qc79cq9YPlg,Beach House Inn,320 W Yanonali St,Santa Barbara,FL,93101,34.4103881,-119.6958943,4.5,96,1,"{'RestaurantsPriceRange2': '2', 'BusinessAccep...","[Apartments, Hotels, Hotels & Travel, Home ...",,"34.4103881,-119.6958943",California,California
1588,sfGALNhZEYz4HyrVVRaf3A,Beach House Suites by the Don Cesar,3860 Gulf Blvd,St. Pete Beach,PA,33706,27.7162027128,-82.7391078,3.5,39,1,"{'WiFi': ''free'', 'BusinessAcceptsCreditCards...","[Hotels, Event Planning & Services, Hotels &...","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W...","27.7162027128,-82.7391078",Florida,Florida
1589,1pBiQhcwaI_kF3urOdnG5A,Hotel Indigo Nashville,1719 W End Ave,Nashville,NV,37203,36.1529887,-86.7957085,3.0,39,0,"{'BusinessAcceptsCreditCards': 'True', 'Outdoo...","[Hotels, Shopping, Hotels & Travel, Venues ...",,"36.1529887,-86.7957085",Tennessee,Tennessee
1590,J2QLTuhFwXxnGHbEUP5e1A,Super 8 by Wyndham Indianapolis,"4530 S Emerson Ave, I-465, Exit 52",Indianapolis,LA,46203,39.7009908,-86.0838894,3.0,10,1,"{'RestaurantsPriceRange2': '2', 'WiFi': 'u'fre...","[Hotels, Hotels & Travel, Event Planning & S...",,"39.7009908,-86.0838894",Indiana,Indiana


Como último paso antes de exportar el dataframe se procede a eliminar las columnas que no se necesitarán y a renombrar la columna 'state-sp'.

In [715]:
df_b = df_b.drop(['state','lati_long','state_en'], axis=1) #eliminar columnas
df_b = df_b.rename(columns={'state_sp':'state'}) #renombrar columna
df_b.head(4)

Unnamed: 0,business_id,name,address,city,postal_code,latitude,longitude,stars,review_count,is_open,attributes,categories,hours,state
0,w_AMNoI1iG9eay7ncmc67w,River 127,100 Iberville St,New Orleans,70130,29.951359,-90.0646715,3.0,12,1,"{'BusinessAcceptsCreditCards': 'True', 'WiFi':...","[Event Planning & Services, Hotels, Hotels &...",,Luisiana
1,xM6LoUcnpDpMBzXs_7dXAg,Fairfield Inn & Suites,719 E Baltimore Pike,Kennett Square,19348,39.8562475317,-75.6946098804,3.0,37,1,"{'BusinessAcceptsCreditCards': 'True', 'WiFi':...","[Hotels, Hotels & Travel, Event Planning & S...",,Pensilvania
2,hUQ9Z7kQeabvhPOAQOVV1A,Rathbone Mansions,1244 Esplanade Ave,New Orleans,70116,29.9670548,-90.0658282,3.5,67,1,"{'WiFi': 'u'free'', 'BusinessAcceptsCreditCard...","[Hotels, Hotels & Travel, Bed & Breakfast, ...",,Luisiana
3,vjLSYNGFkPu4Y5HKoJlzYg,Rancho 777,777 E 4th St,Reno,89512,39.5323474847,-119.804255405,2.0,5,1,"{'BusinessAcceptsCreditCards': 'True', 'Restau...","[Event Planning & Services, Hotels, Hotels &...","{'Monday': '0:0-0:0', 'Tuesday': '0:0-0:0', 'W...",Nevada


Exportar el dataframe con la data filtrada para el rubro _hotel_ y peteneciente solo a EEUU.

In [None]:
df_b.to_csv('df_business.csv')