#**Análisis Exploratorio de Datos**#



**Descripción del dataset obtenido de Kaggle:**

Desde 2008, los huéspedes y anfitriones han utilizado ***Airbnb*** para ampliar las posibilidades de viaje y presentar una forma más única y personalizada de experimentar el mundo. Hoy, Airbnb se convirtió en un servicio único que es utilizado y reconocido por todo el mundo. El análisis de datos de millones de anuncios proporcionados a través de Airbnb es un factor crucial para la empresa. Estos millones de listados generan una gran cantidad de datos, datos que se pueden analizar y utilizar para seguridad, decisiones comerciales, comprensión del comportamiento y rendimiento de clientes y proveedores (anfitriones) en la plataforma, orientación de iniciativas de marketing, implementación de servicios adicionales innovadores y mucho más.

El dataset proviene de 48,896 observaciones de IDs que representan apartamentos en Nueva York.


# Medidas de Tendencia Central #

In [None]:
# Importar las librerías

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Importar la base
df = pd.read_csv('sample_data/NYC_Airbnb.csv')
df.head()


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,43392.0,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,43606.0,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,43651.0,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,43423.0,0.1,1,0


In [None]:
# Obtener info de las variables
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review                     

## Media o promedio del precio


In [None]:
# Cálculo manual
average = round(sum(df.price) / len(df.price),2)
average

152.72

In [None]:
# Cálculo  con pandas
average_pd = round(df["price"].mean(),2)
average_pd

152.72

In [None]:
# Promedio del precio por barrio
price_by_nh = df.groupby(['neighbourhood_group'])['price'].mean() 
price_by_nh

neighbourhood_group
Bronx             87.496792
Brooklyn         124.383207
Manhattan        196.875814
Queens            99.517649
Staten Island    114.812332
Name: price, dtype: float64

## Mediana del precio

In [None]:
# Cálculo  con pandas
median_price = round(df["price"].median(),2)
median_price

106.0

## Moda de todas las variables

In [None]:
# Cálculo  con pandas. Eliminamos primero las columnas que no nos servirán como identificadores ya que no tiene sentido obtener métricas de ellos.
df_s = df.drop(['id', 'host_id','host_name','name'], axis=1)
df_s

Unnamed: 0,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,43392.0,0.21,6,365
1,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,43606.0,0.38,2,355
2,Manhattan,Harlem,40.80902,-73.94190,Private room,150,3,0,,,1,365
3,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,43651.0,4.64,1,194
4,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,43423.0,0.10,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...
48890,Brooklyn,Bedford-Stuyvesant,40.67853,-73.94995,Private room,70,2,0,,,2,9
48891,Brooklyn,Bushwick,40.70184,-73.93317,Private room,40,4,0,,,2,36
48892,Manhattan,Harlem,40.81475,-73.94867,Entire home/apt,115,10,0,,,1,27
48893,Manhattan,Hell's Kitchen,40.75751,-73.99112,Shared room,55,1,0,,,6,2


Se puede calcular la moda tanto para renglones como para columnas ; especificar si únicamente se requiere para variables numéricas, así como eliminar valores perdidos si fuera el caso. El default es obtener la moda para todas las variables, eliminando valores perdidos NaN y solo de las columnas.

In [None]:
z = df_s.mode()
z

Unnamed: 0,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,Manhattan,Williamsburg,40.71813,-73.95677,Entire home/apt,100.0,1.0,0.0,43639.0,0.02,1.0,0.0
1,,,,-73.95427,,,,,,,,


# Medidas de Dispersión #

## Desviación Estándar


In [None]:
std_price = round(df["price"].std(),2)
std_price

240.15

In [None]:
# Desviación estándar por barrio
std_by_nh = df.groupby(['neighbourhood_group'])['price'].std() 
std_by_nh

neighbourhood_group
Bronx            106.709349
Brooklyn         186.873538
Manhattan        291.383183
Queens           167.102155
Staten Island    277.620403
Name: price, dtype: float64

El barrio que tiene mayor dispersión de los datos es Manhattan y el barrio más apegado a precios cercanos a la media es Bronx.

## Varianza

In [None]:
var_price = round(df["price"].var(),2)
var_price

57674.03

Una función que contiene muchas de las métricas anteriores y las resume : *describe.*

In [None]:
df.describe()

Unnamed: 0,id,host_id,latitude,longitude,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
count,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,48895.0,38843.0,38843.0,48895.0,48895.0
mean,19017140.0,67620010.0,40.728949,-73.95217,152.720687,7.029962,23.274466,43377.074582,1.373221,7.143982,112.781327
std,10983110.0,78610970.0,0.05453,0.046157,240.15417,20.51055,44.550582,413.916984,1.680442,32.952519,131.622289
min,2539.0,2438.0,40.49979,-74.24442,0.0,1.0,0.0,40630.0,0.01,1.0,0.0
25%,9471945.0,7822033.0,40.6901,-73.98307,69.0,1.0,1.0,43289.0,0.19,1.0,0.0
50%,19677280.0,30793820.0,40.72307,-73.95568,106.0,3.0,5.0,43604.0,0.72,1.0,45.0
75%,29152180.0,107434400.0,40.763115,-73.936275,175.0,5.0,24.0,43639.0,2.02,2.0,227.0
max,36487240.0,274321300.0,40.91306,-73.71299,10000.0,1250.0,629.0,43654.0,58.5,327.0,365.0
