## Importación y limpieza de datos

In [1]:
import pandas as pd
import numpy as np

In [2]:
df_airbnb = pd.read_csv("listings.csv", sep = ",")

In [3]:
df_airbnb.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,21853,https://www.airbnb.com/rooms/21853,20241212051353,2024-12-12,city scrape,Bright and airy room,We have a quiet and sunny room with a good vie...,We live in a leafy neighbourhood with plenty o...,https://a0.muscache.com/pictures/68483181/87bc...,83531,...,4.82,4.21,4.67,,f,2,0,2,0,0.27
1,30320,https://www.airbnb.com/rooms/30320,20241212051353,2024-12-12,previous scrape,Great Vacational Apartments,,,https://a0.muscache.com/pictures/336868/f67409...,130907,...,4.78,4.9,4.69,,f,3,3,0,0,0.98
2,30959,https://www.airbnb.com/rooms/30959,20241212051353,2024-12-12,previous scrape,Beautiful loft in Madrid Center,Beautiful Loft 60m2 size just in the historica...,,https://a0.muscache.com/pictures/78173471/835e...,132883,...,4.63,4.88,4.25,,f,1,1,0,0,0.07
3,40916,https://www.airbnb.com/rooms/40916,20241212051353,2024-12-12,previous scrape,Holiday Apartment Madrid Center,,,https://a0.muscache.com/pictures/336736/c3b486...,130907,...,4.79,4.88,4.55,,f,3,3,0,0,0.29
4,62423,https://www.airbnb.com/rooms/62423,20241212051353,2024-12-12,city scrape,MAGIC ARTISTIC HOUSE IN THE CENTER OF MADRID,INCREDIBLE HOME OF AN ARTIST SURROUNDED BY PAI...,DISTRICT WITH VERY GOOD VIBES IN THE MIDDLE OF...,https://a0.muscache.com/pictures/miso/Hosting-...,303845,...,4.85,4.97,4.58,,f,3,1,2,0,2.73


In [4]:
df_airbnb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26760 entries, 0 to 26759
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            26760 non-null  int64  
 1   listing_url                                   26760 non-null  object 
 2   scrape_id                                     26760 non-null  int64  
 3   last_scraped                                  26760 non-null  object 
 4   source                                        26760 non-null  object 
 5   name                                          26760 non-null  object 
 6   description                                   25740 non-null  object 
 7   neighborhood_overview                         12228 non-null  object 
 8   picture_url                                   26759 non-null  object 
 9   host_id                                       26760 non-null 

Basado en las consultas a ejecutar en Mongo y Neo4j. Las columnas de interés son: ["id", "name", "neighbourhood", "neighbourhood_group_cleansed", "latitude", "longitude", "bathrooms", "bedrooms", "amenities", "price", "number_of_reviews", "review_scores_rating"]

In [5]:
columns_to_keep = ["id", "name", "neighbourhood_cleansed", "neighbourhood_group_cleansed", "latitude", "longitude", "bathrooms", "bedrooms", "amenities", "price", "number_of_reviews", "review_scores_rating"]

In [6]:
df_airbnb = df_airbnb[columns_to_keep]

In [7]:
df_airbnb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26760 entries, 0 to 26759
Data columns (total 12 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   id                            26760 non-null  int64  
 1   name                          26760 non-null  object 
 2   neighbourhood_cleansed        26760 non-null  object 
 3   neighbourhood_group_cleansed  26760 non-null  object 
 4   latitude                      26760 non-null  float64
 5   longitude                     26760 non-null  float64
 6   bathrooms                     20834 non-null  float64
 7   bedrooms                      24228 non-null  float64
 8   amenities                     26760 non-null  object 
 9   price                         20815 non-null  object 
 10  number_of_reviews             26760 non-null  int64  
 11  review_scores_rating          21293 non-null  float64
dtypes: float64(5), int64(2), object(5)
memory usage: 2.5+ MB


In [8]:
df_airbnb.head()

Unnamed: 0,id,name,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,bathrooms,bedrooms,amenities,price,number_of_reviews,review_scores_rating
0,21853,Bright and airy room,Cármenes,Latina,40.40381,-3.7413,1.0,1.0,"[""First aid kit"", ""Wifi"", ""Kitchen"", ""Essentia...",$31.00,33,4.58
1,30320,Great Vacational Apartments,Sol,Centro,40.41476,-3.70418,,1.0,"[""Heating"", ""Wifi"", ""TV with standard cable"", ...",,172,4.63
2,30959,Beautiful loft in Madrid Center,Embajadores,Centro,40.41259,-3.70105,,1.0,"[""Breakfast"", ""Heating"", ""Wifi"", ""Smoking allo...",,8,4.38
3,40916,Holiday Apartment Madrid Center,Universidad,Centro,40.42247,-3.70577,,1.0,"[""Heating"", ""Wifi"", ""Pets allowed"", ""Kitchen"",...",,49,4.65
4,62423,MAGIC ARTISTIC HOUSE IN THE CENTER OF MADRID,Justicia,Centro,40.41884,-3.69655,1.5,1.0,"[""Books and reading material"", ""First aid kit""...",$69.00,219,4.64


Se asegura que las columnas tipo object no tengan leading ni trailing whitespaces

In [9]:
for column in df_airbnb.columns:
    if df_airbnb[column].dtype == "object":
        df_airbnb[column] = df_airbnb[column].str.strip()

Eliminamos el símbolo "$", y "," de la columna price, y convertimos a float

In [10]:
df_airbnb["price"] = df_airbnb["price"].str.replace("$","")

In [11]:
df_airbnb["price"] = df_airbnb["price"].str.replace(",","")

In [12]:
df_airbnb["price"] = df_airbnb["price"].astype(float)

Eliminamos las tildes y pasamos a mayúsculas los campos de neighbourhood_cleansed y neighbourhood_group_cleansed 

In [13]:
df_airbnb["neighbourhood_cleansed"] = df_airbnb["neighbourhood_cleansed"].str.normalize('NFKD').str.encode('ascii', 'ignore').str.decode('utf-8').str.upper()

In [14]:
df_airbnb["neighbourhood_group_cleansed"] = df_airbnb["neighbourhood_group_cleansed"].str.normalize('NFKD').str.encode('ascii', 'ignore').str.decode('utf-8').str.upper()

In [15]:
df_airbnb.head()

Unnamed: 0,id,name,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,bathrooms,bedrooms,amenities,price,number_of_reviews,review_scores_rating
0,21853,Bright and airy room,CARMENES,LATINA,40.40381,-3.7413,1.0,1.0,"[""First aid kit"", ""Wifi"", ""Kitchen"", ""Essentia...",31.0,33,4.58
1,30320,Great Vacational Apartments,SOL,CENTRO,40.41476,-3.70418,,1.0,"[""Heating"", ""Wifi"", ""TV with standard cable"", ...",,172,4.63
2,30959,Beautiful loft in Madrid Center,EMBAJADORES,CENTRO,40.41259,-3.70105,,1.0,"[""Breakfast"", ""Heating"", ""Wifi"", ""Smoking allo...",,8,4.38
3,40916,Holiday Apartment Madrid Center,UNIVERSIDAD,CENTRO,40.42247,-3.70577,,1.0,"[""Heating"", ""Wifi"", ""Pets allowed"", ""Kitchen"",...",,49,4.65
4,62423,MAGIC ARTISTIC HOUSE IN THE CENTER OF MADRID,JUSTICIA,CENTRO,40.41884,-3.69655,1.5,1.0,"[""Books and reading material"", ""First aid kit""...",69.0,219,4.64


In [16]:
df_airbnb.neighbourhood_group_cleansed.unique()

array(['LATINA', 'CENTRO', 'SALAMANCA', 'FUENCARRAL - EL PARDO',
       'CIUDAD LINEAL', 'CHAMBERI', 'VILLAVERDE', 'RETIRO', 'HORTALEZA',
       'BARAJAS', 'USERA', 'ARGANZUELA', 'CARABANCHEL', 'CHAMARTIN',
       'TETUAN', 'VILLA DE VALLECAS', 'PUENTE DE VALLECAS',
       'SAN BLAS - CANILLEJAS', 'MONCLOA - ARAVACA', 'MORATALAZ',
       'VICALVARO'], dtype=object)

Reemplazamos los espacios antes y despues de los guiones

In [17]:
df_airbnb.neighbourhood_group_cleansed = df_airbnb.neighbourhood_group_cleansed.str.replace(" - ","-")

In [18]:
df_airbnb.neighbourhood_cleansed.unique()

array(['CARMENES', 'SOL', 'EMBAJADORES', 'UNIVERSIDAD', 'JUSTICIA',
       'RECOLETOS', 'VALVERDE', 'PUEBLO NUEVO', 'RIOS ROSAS', 'PALACIO',
       'LOS ANGELES', 'TRAFALGAR', 'IBIZA', 'CORTES', 'PUERTA DEL ANGEL',
       'PIOVERA', 'CASTELLANA', 'CASCO HISTORICO DE BARAJAS',
       'SAN FERMIN', 'CANILLAS', 'ARAPILES', 'CHOPERA', 'VALDEFUENTES',
       'COMILLAS', 'PALOS DE MOGUER', 'LUCERO', 'PROSPERIDAD', 'ACACIAS',
       'CASTILLA', 'DELICIAS', 'ALMENARA', 'CASCO HISTORICO DE VALLECAS',
       'PUERTA BONITA', 'FUENTE DEL BERRO', 'NINO JESUS', 'PINAR DEL REY',
       'COSTILLARES', 'LISTA', 'ALAMEDA DE OSUNA', 'SAN DIEGO', 'ALMAGRO',
       'ROSAS', 'CONCEPCION', 'ARGUELLES', 'PACIFICO', 'HISPANOAMERICA',
       'SAN JUAN BAUTISTA', 'ARCOS', 'CASTILLEJOS',
       'CIUDAD UNIVERSITARIA', 'SIMANCAS', 'TIMON', 'SALVADOR',
       'JERONIMOS', 'NUMANCIA', 'GOYA', 'MARROQUINA', 'GAZTAMBIDE',
       'SAN ANDRES', 'LOS ROSALES', 'PALOMERAS SURESTE', 'ATOCHA',
       'NUEVA ESPANA', 'CUATR

Cambiamos los nombre de las columnas para facilitar la lectura y redacción de consultas

In [19]:
nombres_columnas = ["id_airbnb", "nombre_airbnb", "barrio_airbnb", "distrito_airbnb", "latitud", "longitud", "numero_baños",
                    "numero_habitaciones", "servicios", "precio", "numero_reseñas", "puntuacion"]

In [20]:
df_airbnb.columns = nombres_columnas

In [21]:
df_airbnb.head()

Unnamed: 0,id_airbnb,nombre_airbnb,barrio_airbnb,distrito_airbnb,latitud,longitud,numero_baños,numero_habitaciones,servicios,precio,numero_reseñas,puntuacion
0,21853,Bright and airy room,CARMENES,LATINA,40.40381,-3.7413,1.0,1.0,"[""First aid kit"", ""Wifi"", ""Kitchen"", ""Essentia...",31.0,33,4.58
1,30320,Great Vacational Apartments,SOL,CENTRO,40.41476,-3.70418,,1.0,"[""Heating"", ""Wifi"", ""TV with standard cable"", ...",,172,4.63
2,30959,Beautiful loft in Madrid Center,EMBAJADORES,CENTRO,40.41259,-3.70105,,1.0,"[""Breakfast"", ""Heating"", ""Wifi"", ""Smoking allo...",,8,4.38
3,40916,Holiday Apartment Madrid Center,UNIVERSIDAD,CENTRO,40.42247,-3.70577,,1.0,"[""Heating"", ""Wifi"", ""Pets allowed"", ""Kitchen"",...",,49,4.65
4,62423,MAGIC ARTISTIC HOUSE IN THE CENTER OF MADRID,JUSTICIA,CENTRO,40.41884,-3.69655,1.5,1.0,"[""Books and reading material"", ""First aid kit""...",69.0,219,4.64


In [22]:
df_airbnb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26760 entries, 0 to 26759
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   id_airbnb            26760 non-null  int64  
 1   nombre_airbnb        26760 non-null  object 
 2   barrio_airbnb        26760 non-null  object 
 3   distrito_airbnb      26760 non-null  object 
 4   latitud              26760 non-null  float64
 5   longitud             26760 non-null  float64
 6   numero_baños         20834 non-null  float64
 7   numero_habitaciones  24228 non-null  float64
 8   servicios            26760 non-null  object 
 9   precio               20815 non-null  float64
 10  numero_reseñas       26760 non-null  int64  
 11  puntuacion           21293 non-null  float64
dtypes: float64(6), int64(2), object(4)
memory usage: 2.5+ MB


In [23]:
import pandas as pd
from pymongo import MongoClient

# Conexión a MongoDB
client = MongoClient("mongodb://localhost:27017/")  # Reemplaza con tu URI
db = client["locales_madrid_mongo"]  # Reemplaza con el nombre de tu base de datos
collection = db["locales_madrid"]  # Reemplaza con el nombre de tu colección

def mapear_id(nombre_barrio, tipo="barrio"):
    """
    Mapea el nombre de un barrio o distrito a su ID correspondiente.

    Args:
        nombre_barrio: El nombre del barrio o distrito.
        tipo: El tipo de dato que se va a mapear ("barrio" o "distrito").

    Returns:
        El ID correspondiente al nombre, o None si no se encuentra.
    """
    try:
        if tipo == "barrio":
            resultado = collection.find_one({"desc_barrio_local": nombre_barrio})
            if resultado:
                return resultado["id_barrio_local"]
        elif tipo == "distrito":
            resultado = collection.find_one({"desc_distrito_local": nombre_barrio})
            if resultado:
                return resultado["id_distrito_local"]
        return None
    except Exception as e:
        print(f"Error al mapear ID: {e}")
        return None

In [None]:
# ... (código anterior)

# Aplicar la función de mapeo a la columna de barrios
df_airbnb["id_barrio_alojamiento"] = df_airbnb["barrio_airbnb"].apply(mapear_id)

# Aplicar la función de mapeo a la columna de distritos
df_airbnb["id_distrito_alojamiento"] = df_airbnb["distrito_airbnb"].apply(lambda x: mapear_id(x, tipo="distrito"))

print(df_airbnb.head())

In [23]:
df_airbnb.head()

Unnamed: 0,id_airbnb,nombre_airbnb,barrio_airbnb,distrito_airbnb,latitud,longitud,numero_baños,numero_habitaciones,servicios,precio,numero_reseñas,puntuacion
0,21853,Bright and airy room,CARMENES,LATINA,40.40381,-3.7413,1.0,1.0,"[""First aid kit"", ""Wifi"", ""Kitchen"", ""Essentia...",31.0,33,4.58
1,30320,Great Vacational Apartments,SOL,CENTRO,40.41476,-3.70418,,1.0,"[""Heating"", ""Wifi"", ""TV with standard cable"", ...",,172,4.63
2,30959,Beautiful loft in Madrid Center,EMBAJADORES,CENTRO,40.41259,-3.70105,,1.0,"[""Breakfast"", ""Heating"", ""Wifi"", ""Smoking allo...",,8,4.38
3,40916,Holiday Apartment Madrid Center,UNIVERSIDAD,CENTRO,40.42247,-3.70577,,1.0,"[""Heating"", ""Wifi"", ""Pets allowed"", ""Kitchen"",...",,49,4.65
4,62423,MAGIC ARTISTIC HOUSE IN THE CENTER OF MADRID,JUSTICIA,CENTRO,40.41884,-3.69655,1.5,1.0,"[""Books and reading material"", ""First aid kit""...",69.0,219,4.64


In [26]:
df_airbnb.describe()

Unnamed: 0,id_airbnb,latitud,longitud,numero_baños,numero_habitaciones,precio,numero_reseñas,puntuacion
count,26760.0,26760.0,26760.0,20834.0,24228.0,20815.0,26760.0,21293.0
mean,6.092656e+17,40.421609,-3.69355,1.278271,1.438418,134.071295,46.983782,4.656892
std,5.191434e+17,0.023779,0.028177,0.641976,1.004122,420.808453,87.188354,0.466173
min,21853.0,40.3314,-3.833071,0.0,0.0,8.0,0.0,1.0
25%,35512280.0,40.409288,-3.707319,1.0,1.0,63.0,1.0,4.56
50%,7.71072e+17,40.42044,-3.70084,1.0,1.0,96.0,11.0,4.77
75%,1.104581e+18,40.431995,-3.683778,1.5,2.0,139.0,52.0,4.93
max,1.309432e+18,40.53553,-3.545904,15.0,25.0,21347.0,1136.0,5.0


## Carga de datos a Mongo

Considerando que el modelo de base de datos que ya está en Mongo es embebido en los locales. Se decide manejar los datos de airbnb como una colección aparte dentro de la misma base de datos. Los servicios de cada alojamiento se anidan en un array 

In [24]:
from pymongo import MongoClient
import ast

In [25]:
client = MongoClient("mongodb://localhost:27017/")
db = client["locales_madrid_mongo"]
alojamientos_collection = db["alojamientos"]

In [26]:


def preparar_datos(row):
  """Prepara un diccionario para insertar en MongoDB."""
  documento = row.to_dict()

  # Aseguramos que "servicios" sea un array
  if isinstance(documento["servicios"], str):
    documento["servicios"] = [s.strip() for s in documento["servicios"].strip("[]").replace('\"','').split(",")]
  elif not isinstance(documento["servicios"], list):  # Si no es ni string ni lista, lo dejamos vacío
    documento["servicios"] = []

  return documento

# Convertimos el DataFrame a una lista de diccionarios
datos_para_insertar = [preparar_datos(row) for _, row in df_airbnb.iterrows()]

In [27]:
# Insertamos los documentos
if datos_para_insertar:  # Verificamos que haya datos para insertar
  alojamientos_collection.insert_many(datos_para_insertar)
  print(f"Se insertaron {len(datos_para_insertar)} documentos en la colección.")
else:
  print("No hay datos para insertar.")

# Cerramos la conexión
client.close()

Se insertaron 26760 documentos en la colección.


La forma más sencilla de relacionar estos datos con los datos de los locales es usando la información de Barrio y Distrito