## Comprobar la calidad de la predicción

Dividir el dataframe en dos particiones:

    - El conjunto de entrenamiento (75-80%)
    - El conjunto de test (el resto)
    - La intersección entre test y entrenamiento es vacía
    
Buscaremos los k-vecinos de cada elemento de test en el conjunto de entrenamiento, y lo compararemos con el resultado real. 
Lo que vamos a hacer es poner los resultados en la columna 'predicted_price'

In [1]:
import pandas as pd
import numpy as np
import math


In [2]:
madrid = pd.read_csv('data/airbnb-madrid.csv')

In [3]:
# corrección de datos
madrid['price'] = madrid['price'].apply(lambda x: x.replace('$', '').replace(',', '')).astype(float)


In [4]:
np.random.seed(3943)
madrid = madrid.iloc[
    np.random.permutation(madrid.shape[0])
]

In [5]:
# corte ( al 80 % )
corte = math.trunc(len(madrid)*0.8)

In [6]:
corte

16544

In [7]:
train = madrid.iloc[:corte].copy()
test = madrid.iloc[corte:].copy()

In [8]:
train.shape

(16544, 75)

In [9]:
test.shape

(4137, 75)

In [10]:
def predict_price(mi_capacidad, train_df, k=5):
    _df = train_df.copy()
    _df['distance'] = _df['accommodates'].apply(lambda x: np.abs(mi_capacidad -x))
    return _df.sort_values('distance').iloc[0:k]['price'].mean()
    

In [11]:
test['predicted_price'] = test['accommodates'].apply(
    lambda x: predict_price(x, train, 5)
)

In [12]:
train.shape[0]*test.shape[0]

68442528

In [14]:
# compararé los precios con su predicción
# si el valor predicho es 75$ y el real son 80$ -> la diferencia son 5 $
test[['price','predicted_price']]

Unnamed: 0,price,predicted_price
8870,300.0,34.8
4076,25.0,34.8
19649,181.0,95.6
7395,142.0,161.2
18364,315.0,193.8
...,...,...
3351,25.0,32.0
970,20.0,32.0
4766,20.0,34.8
4563,32.0,34.8


In [16]:
(test['price']-test['predicted_price']).mean() # esta forma no vale porque se compensan negativos con positivos

69.43988397389413

In [17]:
(np.abs(test['price']-test['predicted_price'])).mean() # MAE: MEDIUM ABSOLUTE ERROR - ERROR MEDIO ABSOLUTO

93.85022963500121

In [20]:
(np.abs(test['price']-test['predicted_price'])).value_counts(bins=10)

(-40.68, 4067.92]       4130
(4067.92, 8135.84]         4
(8135.84, 12203.76]        2
(36611.28, 40679.2]        1
(12203.76, 16271.68]       0
(16271.68, 20339.6]        0
(20339.6, 24407.52]        0
(24407.52, 28475.44]       0
(28475.44, 32543.36]       0
(32543.36, 36611.28]       0
dtype: int64

In [21]:
mae_accommodates = (np.abs(test['price']-test['predicted_price'])).mean() # MAE: MEDIUM ABSOLUTE ERROR - ERROR MEDIO ABSOLUTO

Probar otra variable

In [22]:
madrid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 20681 entries, 682 to 15043
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            20681 non-null  int64  
 1   listing_url                                   20681 non-null  object 
 2   scrape_id                                     20681 non-null  int64  
 3   last_scraped                                  20681 non-null  object 
 4   source                                        20681 non-null  object 
 5   name                                          20677 non-null  object 
 6   description                                   19971 non-null  object 
 7   neighborhood_overview                         11472 non-null  object 
 8   picture_url                                   20680 non-null  object 
 9   host_id                                       20681 non-nul

In [23]:
def predict_price_by_bedrooms(mi_capacidad, train_df, k=5):
    _df = train_df.copy()
    _df['distance'] = _df['bedrooms'].apply(lambda x: np.abs(mi_capacidad -x))
    return _df.sort_values('distance').iloc[0:k]['price'].mean()

In [24]:
test['predicted_price_bedrooms'] = test['bedrooms'].apply(
    lambda x: predict_price_by_bedrooms(x, train, 5)
)

In [25]:
mae_bedrooms = (np.abs(test['predicted_price_bedrooms']-test['price'])).mean()


In [26]:
mae_bedrooms

104.30679236161471

In [27]:
test[
    ['predicted_price', 'predicted_price_bedrooms', 'price']
]

Unnamed: 0,predicted_price,predicted_price_bedrooms,price
8870,34.8,63.6,300.0
4076,34.8,63.6,25.0
19649,95.6,221.4,181.0
7395,161.2,63.6,142.0
18364,193.8,221.4,315.0
...,...,...,...
3351,32.0,63.6,25.0
970,32.0,63.6,20.0
4766,34.8,63.6,20.0
4563,34.8,63.6,32.0


In [28]:
mae_accommodates

93.85022963500121

MSE - MEDIUM SQUARE ROOT - ERROR AL CUADRADO MEDIO

In [30]:
mse_bedrooms = ((test['predicted_price_bedrooms']-test['price'])**2).mean()

In [31]:
mse_bedrooms

534826.5116364516

RMSE - ROOT OF MSE : RAIZ DEL MSE

In [32]:
rmse_bedrooms = mse_bedrooms**0.5

In [33]:
rmse_bedrooms

731.3183380966537

In [34]:
mse_accommodates = ((test['predicted_price']-test['price'])**2).mean()
mse_accommodates

535645.3508145998

In [41]:
test['distancia'] = test['price'] - test['predicted_price']

In [43]:
test.sort_values('distancia').tail()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,predicted_price,predicted_price_bedrooms,distancia
14223,51921321,https://www.airbnb.com/rooms/51921321,20220911230855,2022-09-12,city scrape,Apartamento,Habitación tranquila,,https://a0.muscache.com/pictures/7cf526d7-e68d...,409616536,...,,t,1,0,1,0,,34.8,63.6,7965.2
11327,42524503,https://www.airbnb.com/rooms/42524503,20220911230855,2022-09-12,previous scrape,Apartamento único con vistas a la Gran Vía,Espectacular apartamento en la planta 11 del P...,,https://a0.muscache.com/pictures/c2c1f4a4-5cb6...,179623389,...,VT-6567,t,1,1,0,0,,34.8,63.6,7965.2
11940,44305807,https://www.airbnb.com/rooms/44305807,20220911230855,2022-09-12,city scrape,Alquilo Apartamento Centro,,,https://a0.muscache.com/pictures/e56fcee2-0071...,290977158,...,,t,1,1,0,0,,32.0,63.6,8437.0
10740,40887842,https://www.airbnb.com/rooms/40887842,20220911230855,2022-09-12,previous scrape,Great apartment in Chamberi Free WiFi,This exterior apartment of 1 bedroom could be ...,Is located in the touristic Madrid and with ex...,https://a0.muscache.com/pictures/76783e8b-2dfa...,4062786,...,VT-9885,t,4,4,0,0,0.16,95.6,63.6,9903.4
3831,19078457,https://www.airbnb.com/rooms/19078457,20220911230855,2022-09-12,city scrape,Double room with balcony in Atocha,IMPORTANT: In case you come 1 or 2 people you ...,"It is a central neighborhood of Madrid, with t...",https://a0.muscache.com/pictures/9011bf26-6b18...,60469442,...,,t,1,0,1,0,2.06,34.8,63.6,40679.2


Por qué las diferencias entre MAE Y RMSE

In [45]:
diferencias_uno = pd.Series(
    [5, 10, 5, 10, 5, 10, 5, 10, 5, 10 , 5, 10 ]
)
diferencias_dos = pd.Series(
    [5, 10, 5, 10, 5, 10, 5, 10, 5, 10 , 1000]
)

In [46]:
diferencias_uno.sum()/len(diferencias_uno) # mae uno

7.5

In [48]:
((diferencias_uno**2).sum()/len(diferencias_uno))**0.5 # rmse uno

7.905694150420948

In [49]:
diferencias_dos.sum()/len(diferencias_dos) # mae dos

97.72727272727273

In [50]:
((diferencias_dos**2).sum()/len(diferencias_dos))**0.5 # rmse uno

301.60555215530945