Hasta ahora, hemos intentado predecir el precio con una sola variable: primero con `accommodates` y luego con  `bedrooms`

Realmente hay muchas más variables que influyen en el precio.

Hay dos formas de intentar mejorar la precisión:

1- Cambiar k 
2- Incrementar el número de atributos (p.ej, usar la combinación de `accommodates`, `bathrooms`, `bedrooms`....)

Nuestro objetivo ahora es añadir columnas al modelo. Tenemos que tener cuidado con las columnas que no funcionan bien con la ecuación de distancia.

- Valores no numéricos (el barrio, por ejemplo), la ecuación de distancia euclidiana espera números
- Valores nulos - nuestra ecuación de distancia no permite valores nulos.
- Valores no ordinales - aquellos cuyo orden no significa que sea más o menos algo. Por ejemplo, la latitud y la longitud. 

In [1]:
import pandas as pd
import numpy as np
import math

valencia_df = pd.read_csv('airbnb_valencia.csv')
# corregir los datos de la columna price
valencia_df['price'] = valencia_df['price'].str.replace(',', '').str.replace('$', '', regex=False).astype('float')
# barajar los datos - es importante hacerlo ANTES de crear los conjuntos de entrenamiento y test
np.random.seed(2)
valencia_df = valencia_df.loc[np.random.permutation(len(valencia_df))]

In [2]:
valencia_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6025 entries, 103 to 2575
Data columns (total 74 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            6025 non-null   int64  
 1   listing_url                                   6025 non-null   object 
 2   scrape_id                                     6025 non-null   int64  
 3   last_scraped                                  6025 non-null   object 
 4   name                                          6025 non-null   object 
 5   description                                   5848 non-null   object 
 6   neighborhood_overview                         3458 non-null   object 
 7   picture_url                                   6025 non-null   object 
 8   host_id                                       6025 non-null   int64  
 9   host_url                                      6025 non-null  

Columnas que contienen valores no numéricos:
- neighbourhood
- license
- name
- description

Columnas que contienen valores numéricos pero no son ordinales:
- latitude
- longitude
- id
- codigo_postal (no está pero es un no-ordinal clásico)


In [3]:
valencia_df[['host_response_rate', 'host_acceptance_rate','host_listings_count']]

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count
103,100%,50%,1
257,,0%,0
3299,93%,93%,5
3056,100%,98%,4
696,100%,100%,12
...,...,...,...
1099,100%,50%,7
2514,100%,94%,5
3606,94%,100%,1
5704,85%,69%,4


In [4]:
valencia_df.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'description',
       'neighborhood_overview', 'picture_url', 'host_id', 'host_url',
       'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_upd

In [5]:
dorp_dolumns = ['id', 'listing_url', 'scrape_id', 'last_scraped', 'description',
       'neighborhood_overview', 'picture_url', 'host_id', 'host_url',
       'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type',
       'bathrooms_text',  'amenities', 
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped', 
       'number_of_reviews_ltm', 'number_of_reviews_l30d', 'first_review',
       'last_review', 'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'license',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', ]

In [6]:
valencia_df = valencia_df.drop(dorp_dolumns, axis=1)

In [7]:
valencia_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6025 entries, 103 to 2575
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               6025 non-null   object 
 1   neighbourhood      3458 non-null   object 
 2   accommodates       6025 non-null   int64  
 3   bathrooms          0 non-null      float64
 4   bedrooms           5717 non-null   float64
 5   beds               5917 non-null   float64
 6   price              6025 non-null   float64
 7   number_of_reviews  6025 non-null   int64  
 8   instant_bookable   6025 non-null   object 
 9   reviews_per_month  5104 non-null   float64
dtypes: float64(5), int64(2), object(3)
memory usage: 517.8+ KB


Tenemos bedrooms 5717 no nulos (tenemos nulos), de beds lo mismo

In [8]:
valencia_df['bathrooms'].describe()

count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: bathrooms, dtype: float64

In [9]:
valencia_df = valencia_df.drop(['bathrooms'], axis=1)

In [10]:
#¿Qué hacemos con los valores perdidos (bedrooms y beds)

In [11]:
valencia_df.dropna(subset=['beds'])

Unnamed: 0,name,neighbourhood,accommodates,bedrooms,beds,price,number_of_reviews,instant_bookable,reviews_per_month
103,Habitación doble en Ruzafa con balcón,"Valencia, Valencian Community, Spain",2,1.0,1.0,88.0,2,f,0.02
257,Two bedroom flat; centre of Russafa,"Valencia, Valencian Community, Spain",6,2.0,4.0,95.0,6,f,0.06
3299,BIG AND CENTRAL PRIVATE ROOM,"València, Comunidad Valenciana, Spain",2,1.0,1.0,21.0,91,t,2.99
3056,Baño privado *WiFi* |Desayuno| 5M Estadio/Metro.,"València, Comunidad Valenciana, Spain",1,1.0,1.0,31.0,24,t,0.72
696,Apartment in Carmen,,3,1.0,2.0,41.0,422,t,6.34
...,...,...,...,...,...,...,...,...,...
1099,PISO ACOGEDOR EN EL CENTRO (Pl. Pilar),,6,2.0,3.0,89.0,45,f,0.73
2514,Loft 2 independiente Piscina Centro Valencia,"València, Comunidad Valenciana, Spain",4,1.0,1.0,64.0,94,f,2.27
3606,Habitación amplia y luminosa !,,2,1.0,1.0,36.0,3,f,2.81
5704,"FV-Nice Flat 2 rooms, good located,wifi high s...",,3,2.0,2.0,73.0,0,f,


In [12]:
valencia_df = valencia_df.dropna(subset=['beds'])

In [13]:
valencia_df.shape


(5917, 9)

In [14]:
valencia_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5917 entries, 103 to 2575
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               5917 non-null   object 
 1   neighbourhood      3411 non-null   object 
 2   accommodates       5917 non-null   int64  
 3   bedrooms           5623 non-null   float64
 4   beds               5917 non-null   float64
 5   price              5917 non-null   float64
 6   number_of_reviews  5917 non-null   int64  
 7   instant_bookable   5917 non-null   object 
 8   reviews_per_month  5032 non-null   float64
dtypes: float64(4), int64(2), object(3)
memory usage: 462.3+ KB


In [15]:
math.ceil(valencia_df['bedrooms'].mean())

2

In [16]:
valencia_df['bedrooms'] = valencia_df['bedrooms'].fillna(2)

In [17]:
valencia_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 5917 entries, 103 to 2575
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               5917 non-null   object 
 1   neighbourhood      3411 non-null   object 
 2   accommodates       5917 non-null   int64  
 3   bedrooms           5917 non-null   float64
 4   beds               5917 non-null   float64
 5   price              5917 non-null   float64
 6   number_of_reviews  5917 non-null   int64  
 7   instant_bookable   5917 non-null   object 
 8   reviews_per_month  5032 non-null   float64
dtypes: float64(4), int64(2), object(3)
memory usage: 462.3+ KB


In [18]:
valencia_df = valencia_df.drop(['name', 'neighbourhood','instant_bookable', 'reviews_per_month'], axis=1)

In [19]:
valencia_df

Unnamed: 0,accommodates,bedrooms,beds,price,number_of_reviews
103,2,1.0,1.0,88.0,2
257,6,2.0,4.0,95.0,6
3299,2,1.0,1.0,21.0,91
3056,1,1.0,1.0,31.0,24
696,3,1.0,2.0,41.0,422
...,...,...,...,...,...
1099,6,2.0,3.0,89.0,45
2514,4,1.0,1.0,64.0,94
3606,2,1.0,1.0,36.0,3
5704,3,2.0,2.0,73.0,0


# Normalización



In [20]:
valencia_df['bedrooms'].describe()

count    5917.000000
mean        1.810039
std         0.979030
min         1.000000
25%         1.000000
50%         2.000000
75%         2.000000
max        10.000000
Name: bedrooms, dtype: float64

In [21]:
valencia_df['number_of_reviews'].describe()

count    5917.000000
mean       41.168836
std        66.148646
min         0.000000
25%         2.000000
50%        13.000000
75%        50.000000
max       689.000000
Name: number_of_reviews, dtype: float64

Distribución normal estándar: Media 0, desviación típica 1
    
Normalizar será: 
    - restamos a cada valor la media de la columna
    - dividir cada valor por la desviación típica de la columna

In [22]:
valencia_df['accommodates'].mean()

3.663680919384823

In [23]:
(valencia_df['accommodates'] - valencia_df['accommodates'].mean()).mean()

6.484588090425116e-17

In [24]:
valencia_df['accommodates'].std()

2.0292288221988244

In [25]:
(valencia_df['accommodates'] / valencia_df['accommodates'].std()).std()

0.9999999999999998

In [26]:
(valencia_df['accommodates'] - valencia_df['accommodates'].mean())/valencia_df['accommodates'].std()

103    -0.819859
257     1.151333
3299   -0.819859
3056   -1.312657
696    -0.327061
          ...   
1099    1.151333
2514    0.165737
3606   -0.819859
5704   -0.327061
2575   -0.819859
Name: accommodates, Length: 5917, dtype: float64

In [27]:
valencia_df.head()

Unnamed: 0,accommodates,bedrooms,beds,price,number_of_reviews
103,2,1.0,1.0,88.0,2
257,6,2.0,4.0,95.0,6
3299,2,1.0,1.0,21.0,91
3056,1,1.0,1.0,31.0,24
696,3,1.0,2.0,41.0,422


In [28]:
valencia_df.mean()

accommodates          3.663681
bedrooms              1.810039
beds                  2.421159
price                90.420145
number_of_reviews    41.168836
dtype: float64

In [29]:
normalized_valencia_df = (valencia_df - valencia_df.mean())/valencia_df.std()

In [30]:
normalized_valencia_df.describe()

Unnamed: 0,accommodates,bedrooms,beds,price,number_of_reviews
count,5917.0,5917.0,5917.0,5917.0,5917.0
mean,6.964928e-17,-1.561105e-17,6.244418000000001e-17,-3.1222090000000005e-17,-5.5239080000000004e-17
std,1.0,1.0,1.0,1.0,1.0
min,-1.312657,-0.8273893,-0.8504439,-0.74006,-0.6223685
25%,-0.8198587,-0.8273893,-0.8504439,-0.4128417,-0.5921336
50%,0.1657374,0.19403,-0.2520283,-0.1765174,-0.4258415
75%,0.6585354,0.19403,0.3463872,0.0961645,0.1335048
max,6.079314,8.365384,11.71628,28.11878,9.793566


In [31]:
normalized_valencia_df.head()

Unnamed: 0,accommodates,bedrooms,beds,price,number_of_reviews
103,-0.819859,-0.827389,-0.850444,-0.021998,-0.592134
257,1.151333,0.19403,0.944803,0.041628,-0.531664
3299,-0.819859,-0.827389,-0.850444,-0.630987,0.753321
3056,-1.312657,-0.827389,-0.850444,-0.540093,-0.259549
696,-0.327061,-0.827389,-0.252028,-0.449199,5.757203


In [32]:
normalized_valencia_df['price'] = valencia_df['price']

In [33]:
normalized_valencia_df.head()

Unnamed: 0,accommodates,bedrooms,beds,price,number_of_reviews
103,-0.819859,-0.827389,-0.850444,88.0,-0.592134
257,1.151333,0.19403,0.944803,95.0,-0.531664
3299,-0.819859,-0.827389,-0.850444,21.0,0.753321
3056,-1.312657,-0.827389,-0.850444,31.0,-0.259549
696,-0.327061,-0.827389,-0.252028,41.0,5.757203


In [34]:
normalized_valencia_df.describe()

Unnamed: 0,accommodates,bedrooms,beds,price,number_of_reviews
count,5917.0,5917.0,5917.0,5917.0,5917.0
mean,6.964928e-17,-1.561105e-17,6.244418000000001e-17,90.420145,-5.5239080000000004e-17
std,1.0,1.0,1.0,110.018298,1.0
min,-1.312657,-0.8273893,-0.8504439,9.0,-0.6223685
25%,-0.8198587,-0.8273893,-0.8504439,45.0,-0.5921336
50%,0.1657374,0.19403,-0.2520283,71.0,-0.4258415
75%,0.6585354,0.19403,0.3463872,101.0,0.1335048
max,6.079314,8.365384,11.71628,3184.0,9.793566


In [35]:
from scipy.spatial import distance 

In [36]:
distance.euclidean([4,3], [6,7])

4.47213595499958

In [37]:
20**0.5

4.47213595499958

In [38]:
# calcular la distancia euclidiana usando accommodates y bedrooms, entre la primera fila y la quinnta fila

distance.euclidean(
    normalized_valencia_df.iloc[0][['accommodates', 'bedrooms']],
    normalized_valencia_df.iloc[4][['accommodates', 'bedrooms']],
)

0.4927980467557245

In [39]:
normalized_valencia_df

Unnamed: 0,accommodates,bedrooms,beds,price,number_of_reviews
103,-0.819859,-0.827389,-0.850444,88.0,-0.592134
257,1.151333,0.194030,0.944803,95.0,-0.531664
3299,-0.819859,-0.827389,-0.850444,21.0,0.753321
3056,-1.312657,-0.827389,-0.850444,31.0,-0.259549
696,-0.327061,-0.827389,-0.252028,41.0,5.757203
...,...,...,...,...,...
1099,1.151333,0.194030,0.346387,89.0,0.057918
2514,0.165737,-0.827389,-0.850444,64.0,0.798673
3606,-0.819859,-0.827389,-0.850444,36.0,-0.577016
5704,-0.327061,0.194030,-0.252028,73.0,-0.622369


In [40]:
distance.euclidean(
    normalized_valencia_df.iloc[0][['accommodates', 'bedrooms', 'beds', 'number_of_reviews']],
    normalized_valencia_df.iloc[4][['accommodates', 'bedrooms', 'beds', 'number_of_reviews']],
)

6.396485161317652

# SCIKIT - Learn

Tiene casi todos los principales algoritmos de aprendizaje automático

Flujo de trabajo: 

- Instala del algoritmo de machine learning que se quiere usar - Cada uno es una clase de Python. Lo primero es identificar cuál queremos. Queremos el Regresor porque es una variable continua: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html 
- Ajustamos el modelo a los datos de entrenamiento (*fit*), crea la función 'predictora'
- Hacemos predicciones con el modelo 
- Evaluamos la precisión

In [41]:
from sklearn.neighbors import KNeighborsRegressor

In [42]:
# creo una instancia de la clase que ejecuta el algoritmo
knn = KNeighborsRegressor()

In [43]:
knn

Según la documentación:
```python
class sklearn.neighbors.KNeighborsRegressor(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)
```

- n_neighbors=5, el número de vecinos
- algorithm - el algoritmo para comparar las distancias, brute son todas y otros optimizan y hay cálculos que no hacen
- p, el algoritmos de distancia, 2 para distancia euclidea
- weights: los pesos que cada vecino (de los más cercanos) van a tener en el resultado. 'uniform' es la media'

Si queremos reproducir lo que habíamos hecho hasta ahora, sería algo como:

```python
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')
````





In [44]:
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute')

In [45]:
# ajustamos el modelo
# para todos los modelos, el método fit tiene dos parámetros:
# 1. Dataframe/matriz/matriz numpy con las columnas que nos van a predecir el resultado, no podemos usar el valor objetivo como predictor
# 2. Serie/array/array numpy con los valores objetivo (etiquetas), en este caso el precio



In [46]:
# creamos los modelos de train y test
split_value = math.trunc(len(valencia_df)*0.8)

train_df = normalized_valencia_df[0:split_value].copy()
test_df = normalized_valencia_df[split_value:].copy()

In [47]:
train_df.shape

(4733, 5)

In [48]:
test_df.shape

(1184, 5)

In [49]:
# vamos a estimar tan sólo con accommodates y bedrooms
train_features = train_df[['accommodates', 'bedrooms']]
train_target = train_df['price']

In [50]:
knn.fit(train_features, train_target)

Cuando he llamado a `fit()` sklearn almacena los datos de entrenamiento dentro de la instancia `knn`

Si le pasamos datos con nulos o valores no numéricos, en este algoritmo el método `fit()` nos dará un error


In [51]:
predicciones = knn.predict( test_df[['accommodates', 'bedrooms']] )

In [52]:
test_df

Unnamed: 0,accommodates,bedrooms,beds,price,number_of_reviews
3617,0.658535,1.215449,0.944803,31.0,-0.607251
859,0.165737,-0.827389,-0.252028,131.0,-0.622369
12,-0.819859,-0.827389,-0.252028,80.0,0.027683
4852,1.151333,1.215449,2.141634,200.0,-0.531664
5894,1.151333,1.215449,0.346387,140.0,-0.607251
...,...,...,...,...,...
1099,1.151333,0.194030,0.346387,89.0,0.057918
2514,0.165737,-0.827389,-0.850444,64.0,0.798673
3606,-0.819859,-0.827389,-0.850444,36.0,-0.577016
5704,-0.327061,0.194030,-0.252028,73.0,-0.622369


In [53]:
test_df['predicted_price'] = predicciones.copy()

In [54]:
test_df

Unnamed: 0,accommodates,bedrooms,beds,price,number_of_reviews,predicted_price
3617,0.658535,1.215449,0.944803,31.0,-0.607251,98.4
859,0.165737,-0.827389,-0.252028,131.0,-0.622369,174.2
12,-0.819859,-0.827389,-0.252028,80.0,0.027683,54.2
4852,1.151333,1.215449,2.141634,200.0,-0.531664,127.4
5894,1.151333,1.215449,0.346387,140.0,-0.607251,127.4
...,...,...,...,...,...,...
1099,1.151333,0.194030,0.346387,89.0,0.057918,95.6
2514,0.165737,-0.827389,-0.850444,64.0,0.798673,174.2
3606,-0.819859,-0.827389,-0.850444,36.0,-0.577016,54.2
5704,-0.327061,0.194030,-0.252028,73.0,-0.622369,83.8


Cálculo del MSE con sklearn: mean_squared_error (con este tenemos también el rmse)

In [55]:
from sklearn.metrics import mean_squared_error

In [56]:
mse = mean_squared_error(test_df['price'], test_df['predicted_price'])

In [57]:
rmse = mse**0.5

In [58]:
rmse

83.19114184263299

RMSE

- accomodates: 81
- bedrooms: 89
- bedrooms + accommodates: 83

#### Ahora vamos a hacer una predicción con los cuatro (bedrooms, beds, accommodates ,n_reviews)

In [59]:
knn_4 = KNeighborsRegressor(n_neighbors=5, algorithm='brute')

In [60]:
train_df

Unnamed: 0,accommodates,bedrooms,beds,price,number_of_reviews
103,-0.819859,-0.827389,-0.850444,88.0,-0.592134
257,1.151333,0.194030,0.944803,95.0,-0.531664
3299,-0.819859,-0.827389,-0.850444,21.0,0.753321
3056,-1.312657,-0.827389,-0.850444,31.0,-0.259549
696,-0.327061,-0.827389,-0.252028,41.0,5.757203
...,...,...,...,...,...
4155,1.151333,0.194030,0.944803,83.0,-0.304902
4486,-0.327061,-0.827389,0.346387,63.0,-0.440959
1094,3.122526,2.236868,4.535296,130.0,-0.199079
5179,0.165737,0.194030,-0.252028,190.0,-0.592134


In [61]:
knn_4.fit(train_df[['accommodates', 'bedrooms', 'beds', 'number_of_reviews']], train_df['price'])

In [62]:
test_df['predictions_4'] = knn_4.predict(test_df[['accommodates', 'bedrooms', 'beds', 'number_of_reviews']])

In [63]:
test_df

Unnamed: 0,accommodates,bedrooms,beds,price,number_of_reviews,predicted_price,predictions_4
3617,0.658535,1.215449,0.944803,31.0,-0.607251,98.4,141.2
859,0.165737,-0.827389,-0.252028,131.0,-0.622369,174.2,106.0
12,-0.819859,-0.827389,-0.252028,80.0,0.027683,54.2,34.0
4852,1.151333,1.215449,2.141634,200.0,-0.531664,127.4,99.4
5894,1.151333,1.215449,0.346387,140.0,-0.607251,127.4,128.2
...,...,...,...,...,...,...,...
1099,1.151333,0.194030,0.346387,89.0,0.057918,95.6,89.2
2514,0.165737,-0.827389,-0.850444,64.0,0.798673,174.2,72.4
3606,-0.819859,-0.827389,-0.850444,36.0,-0.577016,54.2,85.6
5704,-0.327061,0.194030,-0.252028,73.0,-0.622369,83.8,104.6


In [64]:
mean_squared_error(test_df['price'], test_df['predictions_4'])**0.5

83.75370322591148

# Distancia al centro de la ciudad

Centro de la ciudad: 39.470223, -0.376666

In [65]:
train_df

Unnamed: 0,accommodates,bedrooms,beds,price,number_of_reviews
103,-0.819859,-0.827389,-0.850444,88.0,-0.592134
257,1.151333,0.194030,0.944803,95.0,-0.531664
3299,-0.819859,-0.827389,-0.850444,21.0,0.753321
3056,-1.312657,-0.827389,-0.850444,31.0,-0.259549
696,-0.327061,-0.827389,-0.252028,41.0,5.757203
...,...,...,...,...,...
4155,1.151333,0.194030,0.944803,83.0,-0.304902
4486,-0.327061,-0.827389,0.346387,63.0,-0.440959
1094,3.122526,2.236868,4.535296,130.0,-0.199079
5179,0.165737,0.194030,-0.252028,190.0,-0.592134


In [66]:
valencia_df

Unnamed: 0,accommodates,bedrooms,beds,price,number_of_reviews
103,2,1.0,1.0,88.0,2
257,6,2.0,4.0,95.0,6
3299,2,1.0,1.0,21.0,91
3056,1,1.0,1.0,31.0,24
696,3,1.0,2.0,41.0,422
...,...,...,...,...,...
1099,6,2.0,3.0,89.0,45
2514,4,1.0,1.0,64.0,94
3606,2,1.0,1.0,36.0,3
5704,3,2.0,2.0,73.0,0


In [67]:
import pandas as pd
import numpy as np
import math

valencia_df = pd.read_csv('airbnb_valencia.csv')
# corregir los datos de la columna price
valencia_df['price'] = valencia_df['price'].str.replace(',', '').str.replace('$', '', regex=False).astype('float')
# barajar los datos - es importante hacerlo ANTES de crear los conjuntos de entrenamiento y test
np.random.seed(2)
valencia_df = valencia_df.loc[np.random.permutation(len(valencia_df))]

In [68]:
valencia_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6025 entries, 103 to 2575
Data columns (total 74 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            6025 non-null   int64  
 1   listing_url                                   6025 non-null   object 
 2   scrape_id                                     6025 non-null   int64  
 3   last_scraped                                  6025 non-null   object 
 4   name                                          6025 non-null   object 
 5   description                                   5848 non-null   object 
 6   neighborhood_overview                         3458 non-null   object 
 7   picture_url                                   6025 non-null   object 
 8   host_id                                       6025 non-null   int64  
 9   host_url                                      6025 non-null  

In [69]:
centro = (39.470223, -0.376666)

In [70]:
from haversine import haversine

In [71]:
valencia_df[['latitude', 'longitude']]

Unnamed: 0,latitude,longitude
103,39.458468,-0.37205
257,39.461750,-0.37140
3299,39.468290,-0.39280
3056,39.481830,-0.36333
696,39.479470,-0.37757
...,...,...
1099,39.471860,-0.38231
2514,39.457120,-0.36859
3606,39.464290,-0.36240
5704,39.463840,-0.33938


In [72]:
def calcular_distancia_centro(punto):
    return haversine(punto, centro)

In [73]:
valencia_df['distancia_centro'] = valencia_df[['latitude', 'longitude']].apply(calcular_distancia_centro, axis= 1)

In [74]:
dorp_dolumns = ['id', 'listing_url', 'scrape_id', 'last_scraped', 'description',
       'neighborhood_overview', 'picture_url', 'host_id', 'host_url',
       'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type',
       'bathrooms_text',  'amenities', 
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped', 
       'number_of_reviews_ltm', 'number_of_reviews_l30d', 'first_review',
       'last_review', 'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'license',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', ]

In [75]:
valencia_df = valencia_df.drop(dorp_dolumns, axis=1)

In [76]:
valencia_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6025 entries, 103 to 2575
Data columns (total 11 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               6025 non-null   object 
 1   neighbourhood      3458 non-null   object 
 2   accommodates       6025 non-null   int64  
 3   bathrooms          0 non-null      float64
 4   bedrooms           5717 non-null   float64
 5   beds               5917 non-null   float64
 6   price              6025 non-null   float64
 7   number_of_reviews  6025 non-null   int64  
 8   instant_bookable   6025 non-null   object 
 9   reviews_per_month  5104 non-null   float64
 10  distancia_centro   6025 non-null   float64
dtypes: float64(6), int64(2), object(3)
memory usage: 564.8+ KB


In [77]:
valencia_df = valencia_df.drop(['name', 'neighbourhood', 'bathrooms', 'bedrooms', 'beds', 'number_of_reviews', 'instant_bookable', 'reviews_per_month'], axis=1)

In [78]:
normalized_valencia_df_distancia = (valencia_df - valencia_df.mean())/valencia_df.std()

In [79]:
normalized_valencia_df_distancia

Unnamed: 0,accommodates,price,distancia_centro
103,-0.804854,-0.017261,-0.379824
257,1.163783,0.046071,-0.488556
3299,-0.804854,-0.623441,-0.367749
3056,-1.297013,-0.532967,-0.258099
696,-0.312695,-0.442492,-0.493248
...,...,...,...
1099,1.163783,-0.008214,-0.667289
2514,0.179465,-0.234401,-0.295904
3606,-0.804854,-0.487729,-0.371306
5704,-0.312695,-0.152973,0.268261


In [83]:
normalized_valencia_df_distancia['price'] = valencia_df['price']



In [81]:
knn_da = KNeighborsRegressor(n_neighbors=5, algorithm='brute')


In [82]:
knn_da.fit(normalized_valencia_df_distancia[['accommodates', 'distancia_centro']], normalized_valencia_df_distancia['price'])

In [85]:
len(normalized_valencia_df_distancia)*0.8

4820.0

In [87]:
nv_train = normalized_valencia_df_distancia[0:4820].copy()
nv_test = normalized_valencia_df_distancia[4820:].copy()


In [89]:
knn_da = KNeighborsRegressor(n_neighbors=5, algorithm='brute')
knn_da.fit(nv_train[['accommodates', 'distancia_centro']], nv_train['price'])

In [91]:
nv_test['predicted'] = knn_da.predict(nv_test[['accommodates', 'distancia_centro']])

In [92]:
nv_test

Unnamed: 0,accommodates,price,distancia_centro,predicted
2469,0.179465,61.0,0.424645,71.6
3617,0.671624,31.0,-0.098633,68.4
859,0.179465,131.0,-0.793968,80.8
12,-0.804854,80.0,-0.719579,84.6
4852,1.163783,200.0,0.697574,144.4
...,...,...,...,...
1099,1.163783,89.0,-0.667289,109.2
2514,0.179465,64.0,-0.295904,88.6
3606,-0.804854,36.0,-0.371306,33.6
5704,-0.312695,73.0,0.268261,103.0


In [93]:
mean_squared_error(nv_test['price'], nv_test['predicted'])**0.5

89.54033931779145