hasta ahora hemos intentado predecir con una variable (`accommodates` y `bedrooms`)

hay variables que influyen en el precio

Hay varias formas de intentar mejorar la precisión
1. Cambiar el valor de k
2. Cambiar las variables (incrementar el número de atributos, p.ej, `accommodates`, `bedrooms`, `bathrooms`


Ahora lo que vamos a hacer es añadir columnas al modelo. Tenemos que tener cuidados con las variables que no van a funcionar con la ecuación de distancia:

- Valores no numéricos (categóricos) - la distancia quiere números
- Valores nulos: la distancia no permite nulos
- Valores no ordinales: numéricos pero que no representan un orden (ej.lat y la long) 

In [1]:
import pandas as pd
import numpy as np
import math


In [2]:
madrid = pd.read_csv('data/airbnb-madrid.csv')
# corrección de datos
madrid['price'] = madrid['price'].apply(lambda x: x.replace('$', '').replace(',', '')).astype(float)
np.random.seed(3943)
madrid = madrid.iloc[
    np.random.permutation(madrid.shape[0])
]


# elimino outliers (o los que no juegan en la misma liga)
madrid = madrid[madrid['price']<300]

In [3]:
madrid.shape

(19130, 75)

In [4]:
madrid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19130 entries, 682 to 15043
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            19130 non-null  int64  
 1   listing_url                                   19130 non-null  object 
 2   scrape_id                                     19130 non-null  int64  
 3   last_scraped                                  19130 non-null  object 
 4   source                                        19130 non-null  object 
 5   name                                          19126 non-null  object 
 6   description                                   18518 non-null  object 
 7   neighborhood_overview                         10750 non-null  object 
 8   picture_url                                   19129 non-null  object 
 9   host_id                                       19130 non-nul

In [5]:
madrid[
    ['bathrooms', 'bathrooms_text']
]

Unnamed: 0,bathrooms,bathrooms_text
682,,1 bath
11870,,1 bath
3526,,2 baths
13270,,1 bath
15946,,1 bath
...,...,...
3351,,1 bath
970,,1.5 shared baths
4766,,1.5 shared baths
4563,,1 shared bath


In [6]:
madrid['bathrooms_text'].info()

<class 'pandas.core.series.Series'>
Int64Index: 19130 entries, 682 to 15043
Series name: bathrooms_text
Non-Null Count  Dtype 
--------------  ----- 
19108 non-null  object
dtypes: object(1)
memory usage: 298.9+ KB


In [7]:
madrid

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
682,3389625,https://www.airbnb.com/rooms/3389625,20220911230855,2022-09-12,city scrape,APT CALME DANS LE CENTRE HISTORIQUE,"Located in the heart of the typical Madrid, wa...",The most tipical neighborhood in Madrid where ...,https://a0.muscache.com/pictures/89933321/7c4f...,4794809,...,4.97,4.85,4.84,06/014204.9/17,f,1,1,0,0,2.23
11870,44149063,https://www.airbnb.com/rooms/44149063,20220911230855,2022-09-12,city scrape,A Renovated Apartment in the Downtown_3,"This is a sweet, clean apartment, with elegant...","The apartment is in the downtown, there are a ...",https://a0.muscache.com/pictures/806691c6-6db0...,341778007,...,4.62,4.92,4.48,,t,7,7,0,0,3.90
3526,18154095,https://www.airbnb.com/rooms/18154095,20220911230855,2022-09-12,city scrape,Flamingo Madrid Apartment,Premium Location ( walking distance from Madri...,Madrid's most central and fancy residential ne...,https://a0.muscache.com/pictures/28bdb3d1-97c3...,5144517,...,4.31,4.93,4.03,,t,8,8,0,0,0.46
13270,49645907,https://www.airbnb.com/rooms/49645907,20220911230855,2022-09-12,city scrape,ADELFAS 03 RESIDENTIAL ONE BEDROOM,Our apartment is located in a modern building ...,The Pacific neighborhood occupies the entire s...,https://a0.muscache.com/pictures/ad0f6fcc-2ff7...,392790154,...,4.50,4.00,3.00,,t,48,48,0,0,0.23
15946,555230367661086584,https://www.airbnb.com/rooms/555230367661086584,20220911230855,2022-09-12,city scrape,BLUE SPACE DELUXE,Precioso y lujoso apartamento de un dormitorio...,,https://a0.muscache.com/pictures/miso/Hosting-...,52530675,...,4.83,3.83,4.50,,f,44,44,0,0,0.91
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3351,17853779,https://www.airbnb.com/rooms/17853779,20220911230855,2022-09-12,previous scrape,Vallehermoso,"Very quiet house, very close to the city cente...",,https://a0.muscache.com/pictures/da5e9279-c7de...,53928475,...,4.92,4.79,4.83,,f,1,0,1,0,0.38
970,5350676,https://www.airbnb.com/rooms/5350676,20220911230855,2022-09-12,city scrape,Luminosa y tranquila hab individual,Alquilo habitación sencilla en piso acogedor e...,,https://a0.muscache.com/pictures/1cc85ba2-91c7...,3360175,...,4.90,4.63,4.58,,f,4,0,4,0,0.78
4766,21723324,https://www.airbnb.com/rooms/21723324,20220911230855,2022-09-12,previous scrape,Habitación doble luminosa en Marqués de Vadillo,Amplia habitación en amplio piso compartido. T...,,https://a0.muscache.com/pictures/9ec4fb08-db4f...,29642251,...,4.70,4.30,4.50,,f,2,0,2,0,0.17
4563,20810915,https://www.airbnb.com/rooms/20810915,20220911230855,2022-09-12,previous scrape,"Cama doble colchon viscolastico, piso centrico","Cama doble, armario, cajonera, estantería. Cas...",,https://a0.muscache.com/pictures/2a784d05-6df9...,149233829,...,,,,,t,1,0,1,0,0.02


In [8]:
def convert_half_in_number(x):
    if ('half' in str(x).lower()):
        return '0.5'
    else:
        return x
    
convert_half_in_number("single  batch")

'single  batch'

In [9]:
madrid['bathrooms_text'].apply(
    convert_half_in_number
)

682                1 bath
11870              1 bath
3526              2 baths
13270              1 bath
15946              1 bath
               ...       
3351               1 bath
970      1.5 shared baths
4766     1.5 shared baths
4563        1 shared bath
15043      3 shared baths
Name: bathrooms_text, Length: 19130, dtype: object

In [10]:
import re

def extract_number(x):
    return re.search("[0-9|.]*", str(x)).group(0)

In [11]:
madrid['bathrooms'] = madrid['bathrooms_text'].apply(
    convert_half_in_number
).apply(extract_number).apply(lambda x: '0' if str(x) == '' else x).astype('float')

In [12]:
madrid['bathrooms'].describe()

count    19130.000000
mean         1.258730
std          0.595462
min          0.000000
25%          1.000000
50%          1.000000
75%          1.500000
max         20.000000
Name: bathrooms, dtype: float64

In [13]:
madrid['number_of_reviews'].describe()

count    19130.000000
mean        40.295661
std         75.359670
min          0.000000
25%          1.000000
50%          9.000000
75%         43.000000
max        845.000000
Name: number_of_reviews, dtype: float64

Distribución normal estándar : Media 0 y Desviación típica 1

Normalizar es:
 - restar a cada valor la media de la columna
 - dividir cada valor por la desviación típica
 

In [14]:
madrid['accommodates'].mean()

2.977051751176163

In [15]:
madrid['accommodates'].std()

1.7447999066793247

In [16]:
accommodates_normalized = (madrid['accommodates']-madrid['accommodates'].mean())/madrid['accommodates'].std()

In [17]:
madrid['accommodates']

682      4
11870    4
3526     4
13270    4
15946    4
        ..
3351     1
970      1
4766     2
4563     2
15043    2
Name: accommodates, Length: 19130, dtype: int64

In [18]:
accommodates_normalized.mean()

1.0288569670964336e-16

In [19]:
accommodates_normalized.std()


1.0

In [20]:
accommodates_normalized.head()

682      0.586284
11870    0.586284
3526     0.586284
13270    0.586284
15946    0.586284
Name: accommodates, dtype: float64

In [21]:
madrid.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'ca

In [22]:
df = madrid[['accommodates', 'bathrooms', 'bedrooms', 'price']].copy()

In [23]:
df.head()

Unnamed: 0,accommodates,bathrooms,bedrooms,price
682,4,1.0,1.0,95.0
11870,4,1.0,1.0,76.0
3526,4,2.0,2.0,161.0
13270,4,1.0,1.0,108.0
15946,4,1.0,1.0,79.0


In [24]:
df = df[df['bedrooms'].isna() == False]

In [25]:
df

Unnamed: 0,accommodates,bathrooms,bedrooms,price
682,4,1.0,1.0,95.0
11870,4,1.0,1.0,76.0
3526,4,2.0,2.0,161.0
13270,4,1.0,1.0,108.0
15946,4,1.0,1.0,79.0
...,...,...,...,...
3351,1,1.0,1.0,25.0
970,1,1.5,1.0,20.0
4766,2,1.5,1.0,20.0
4563,2,1.0,1.0,32.0


In [26]:
df['bedrooms'].isna().sum()

0

In [27]:
target = df['price']

In [28]:
target.head()

682       95.0
11870     76.0
3526     161.0
13270    108.0
15946     79.0
Name: price, dtype: float64

In [29]:
df = (df- df.mean())/df.std()

In [30]:
df

Unnamed: 0,accommodates,bathrooms,bedrooms,price
682,0.549637,-0.455273,-0.502651,0.074486
11870,0.549637,-0.455273,-0.502651,-0.234200
3526,0.549637,1.186654,0.713334,1.146762
13270,0.549637,-0.455273,-0.502651,0.285692
15946,0.549637,-0.455273,-0.502651,-0.185460
...,...,...,...,...
3351,-1.133135,-0.455273,-0.502651,-1.062777
970,-1.133135,0.365690,-0.502651,-1.144010
4766,-0.572211,0.365690,-0.502651,-1.144010
4563,-0.572211,-0.455273,-0.502651,-0.949051


In [31]:
df['price'] = target

In [32]:
df.head()

Unnamed: 0,accommodates,bathrooms,bedrooms,price
682,0.549637,-0.455273,-0.502651,95.0
11870,0.549637,-0.455273,-0.502651,76.0
3526,0.549637,1.186654,0.713334,161.0
13270,0.549637,-0.455273,-0.502651,108.0
15946,0.549637,-0.455273,-0.502651,79.0


In [33]:
# instalar el paquete scipy
# en el entorno: pip install scipy
from scipy.spatial import distance

In [34]:
distance.euclidean([3], [2])

1.0

In [35]:
distance.euclidean([1,2], [4,0])

3.605551275463989

In [36]:
distance.euclidean([1,2,7,5], [7,3,8,0])

7.937253933193772

In [37]:
distance.euclidean(
    df[['accommodates','bathrooms', 'bedrooms']].iloc[0],
    df[['accommodates','bathrooms', 'bedrooms']].iloc[55]
)

2.8012506720042434

# SCIKIT - Learn

Tiene casi todos los algoritmos de aprendizaje automático, si bien no está especializada en Redes neuronales (*deep learning*)

La librería se llama `sklearn`, se instala con `pip install sklearn`



Flujo de trabajo: 
- Instalar el algoritmo que queremos usar. Cada uno es una clase de Python. Tenemos que identificar qué algoritmo queremos
- Crear una instancia del algoritmo
- Ajustamos el modelo a los datos de entrenamiento (*fit*), 'crea la función predictora)
- Hacemos predicciones con el modelo
- Evaluamos la predicción


### 1. Instalar el algoritmo que queremos 
- Tenemos que importarlo
- ¿Cuál es? https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor, porque queremos un valor numérico
    

In [38]:
from sklearn.neighbors import KNeighborsRegressor

In [39]:
# creo la instancia del algoritmo
knn = KNeighborsRegressor()

In [40]:
knn


Documentación:

```python
class sklearn.neighbors.KNeighborsRegressor(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)[source]
```

- n_neighbors=5, el número de vecinos, i.e. k
- p = algoritmo de distancia (2 es euclidiana)
- weights: 'uniform' es la media, es el peso que tiene cada vecino en el resultado
- algorithm, el algoritmo que se usa para comparar las distancias, es el algoritmo que 'elige' donde buscar a los vecinos. 'brute', miramos todos para ver cual está más cercano, los demás optimizan el cálculo, que no el resultado

In [41]:
knn = KNeighborsRegressor(n_neighbors=5, algorithm='brute') # se compararán todos y con un k=5

In [42]:
split_position = math.trunc(df.shape[0]*0.8)

In [43]:
split_position

14193

In [44]:
train = df.iloc[:split_position].copy()
test = df.iloc[split_position:].copy()

In [45]:
train.shape

(14193, 4)

In [46]:
knn.fit(
    train[['accommodates','bathrooms', 'bedrooms']],
    train['price']
)

Al llamar a `fit()` sklearn almacena los datos dentro de la instancia de knn. Si le pasamos algún valor erroneo para el algoritmo (nulos, no numéricos) nos dará el error en el momento del *fit* 

In [47]:
# predecimos con predict

predicciones = knn.predict(test[['accommodates','bathrooms', 'bedrooms']])

In [48]:
predicciones

array([213.4, 177.8,  92.4, ...,  60.4,  78.6,  38.4])

In [49]:
test['predicted_price'] = predicciones.copy()

In [50]:
test.tail()

Unnamed: 0,accommodates,bathrooms,bedrooms,price,predicted_price
3351,-1.133135,-0.455273,-0.502651,25.0,37.0
970,-1.133135,0.36569,-0.502651,20.0,29.8
4766,-0.572211,0.36569,-0.502651,20.0,60.4
4563,-0.572211,-0.455273,-0.502651,32.0,78.6
15043,-0.572211,2.828581,-0.502651,29.0,38.4


Vamos a calcular el MSE con sklearn

In [51]:
from sklearn.metrics import mean_squared_error

In [52]:
mse = mean_squared_error(test['price'], test['predicted_price'] )

In [53]:
mse

2481.853998309383

In [54]:
rmse = mse**0.5

In [55]:
rmse

49.81820950525403

# ¿Y si incorporamos la posición?

Con valores de lat, long, podemos calcular la distancia al centro



In [59]:
madrid[['latitude', 'longitude']]

Unnamed: 0,latitude,longitude
682,40.40945,-3.70965
11870,40.41662,-3.70406
3526,40.42671,-3.69040
13270,40.40112,-3.66967
15946,40.37992,-3.69413
...,...,...
3351,40.43522,-3.70723
970,40.42700,-3.63119
4766,40.39059,-3.71294
4563,40.39789,-3.70153


In [63]:
## haversine distance
 # pip install haversine
from haversine import haversine, Unit

# puerta de alcalá
center = (40.419991, -3.688737)
first = (madrid.iloc[0]['latitude'], madrid.iloc[0]['longitude'])

haversine(center, first, Unit.METERS)


2123.3335776464714

In [72]:
def calcula_distancia(punto):
    return haversine(punto, center, Unit.METERS)

madrid['distance'] = madrid[['latitude', 'longitude']].apply(calcula_distancia, axis=1)

In [74]:
df = madrid[['accommodates', 'distance', 'price']]

In [75]:
target = df['price'].copy()


In [76]:
df = (df - df.mean())/df.std()

In [77]:
df

Unnamed: 0,accommodates,distance,price
682,0.586284,-0.336567,0.086201
11870,0.586284,-0.654523,-0.228800
3526,0.586284,-0.897181,1.180414
13270,0.586284,-0.120989,0.301728
15946,0.586284,0.632309,-0.179063
...,...,...,...
3351,-1.133111,-0.261420,-1.074329
970,-1.133111,0.819137,-1.157224
4766,-0.559979,0.377072,-1.157224
4563,-0.559979,-0.105297,-0.958276


In [89]:
df['price'] = target

In [90]:
train = df.iloc[:split_position].copy()
test = df.iloc[split_position:].copy()

In [91]:
train

Unnamed: 0,accommodates,distance,price
682,0.586284,-0.336567,95.0
11870,0.586284,-0.654523,76.0
3526,0.586284,-0.897181,161.0
13270,0.586284,-0.120989,108.0
15946,0.586284,0.632309,79.0
...,...,...,...
12905,-0.559979,0.696678,80.0
6056,0.586284,-0.733281,155.0
19445,-0.559979,-0.017507,120.0
2641,-1.133111,-0.503937,21.0


In [92]:
X = train[['accommodates', 'distance']]
y = train['price']

In [93]:
knn_2 = KNeighborsRegressor(n_neighbors=5, algorithm='brute') # se compararán todos y con un k=5

In [94]:
knn_2.fit(X, y)

In [96]:
test['predicted']  = knn_2.predict(test[['accommodates', 'distance']])

In [98]:
test.head()

Unnamed: 0,accommodates,distance,price,predicted
834,-0.559979,-0.39198,49.0,75.6
9654,-0.559979,-0.711395,198.0,103.6
6340,2.305679,-0.524096,80.0,156.0
459,-0.559979,-0.208089,100.0,41.8
11680,-0.559979,-0.727322,250.0,131.2


In [100]:
mean_squared_error(test['price'], test['predicted'])**0.5

50.632019124909306

(40.40945, -3.70965)

In [56]:
df.head()

Unnamed: 0,accommodates,bathrooms,bedrooms,price
682,0.549637,-0.455273,-0.502651,95.0
11870,0.549637,-0.455273,-0.502651,76.0
3526,0.549637,1.186654,0.713334,161.0
13270,0.549637,-0.455273,-0.502651,108.0
15946,0.549637,-0.455273,-0.502651,79.0
