# Multivariante

Hasta ahora, hemos intentado predecir con una variable (`accommodates` y `bedrooms`)

Dos formas de intentar mejorar la precisión:

1. Cambiar el valor de k (cambio de hiperparámetros)
2. Cambiar las variables, incluso incrementando el número de variables, p.ej [`accommodates`, `bedrooms`,  `bathrooms`]


Lo que vamos a hacer en el multivariante es añadir columnas al modelo. Por el momento tienen que ser columnas:

- numéricas
- no nulas (el cálculo de la distancia no nos lo va a permitir)
- no ordinales - tenemos que quitar las variables en las que '2' no represente más que '1', por ejemplo, el id no representa un valor que siendo mayor o menor implique algo; lo mismo con host_id, scrape_id y la latitud y la longitud....

In [1]:
import pandas as pd
import numpy as np
import math

In [2]:
valencia = pd.read_csv('https://raw.githubusercontent.com/afoone/caipc-mar-2023/master/data/airbnb.csv')


In [3]:
valencia.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6711 entries, 0 to 6710
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            6711 non-null   int64  
 1   listing_url                                   6711 non-null   object 
 2   scrape_id                                     6711 non-null   int64  
 3   last_scraped                                  6711 non-null   object 
 4   source                                        6711 non-null   object 
 5   name                                          6711 non-null   object 
 6   description                                   6558 non-null   object 
 7   neighborhood_overview                         3759 non-null   object 
 8   picture_url                                   6711 non-null   object 
 9   host_id                                       6711 non-null   i

In [4]:
## limpiar los datos

# corregir el price (tiene $ y no es numérico)
valencia['price'] = valencia['price'].apply(lambda x: x.replace('$', '').replace(',', '')).astype('float')

valencia = valencia[valencia['price']<700]

In [5]:
# barajar

np.random.seed(3837980)
valencia = valencia.iloc[
    np.random.permutation(len(valencia))
]

In [6]:
valencia.shape

(6493, 75)

In [7]:
valencia.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6493 entries, 5738 to 2287
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            6493 non-null   int64  
 1   listing_url                                   6493 non-null   object 
 2   scrape_id                                     6493 non-null   int64  
 3   last_scraped                                  6493 non-null   object 
 4   source                                        6493 non-null   object 
 5   name                                          6493 non-null   object 
 6   description                                   6341 non-null   object 
 7   neighborhood_overview                         3567 non-null   object 
 8   picture_url                                   6493 non-null   object 
 9   host_id                                       6493 non-null 

In [10]:
valencia[['bathrooms', 'bedrooms', 'accommodates']]

Unnamed: 0,bathrooms,bedrooms,accommodates
5738,,1.0,1
1559,,1.0,2
5461,,2.0,4
25,,1.0,3
3559,,2.0,6
...,...,...,...
5882,,,2
2961,,1.0,1
1550,,3.0,6
2312,,1.0,2


In [14]:
valencia['bathrooms'].isnull().sum()

6493

In [15]:
valencia['bathrooms_text']

5738    1.5 shared baths
1559              1 bath
5461              1 bath
25                1 bath
3559              1 bath
              ...       
5882              1 bath
2961      1 private bath
1550              1 bath
2312       1 shared bath
2287              1 bath
Name: bathrooms_text, Length: 6493, dtype: object

In [16]:
valencia['bathrooms_text'].info()

<class 'pandas.core.series.Series'>
Int64Index: 6493 entries, 5738 to 2287
Series name: bathrooms_text
Non-Null Count  Dtype 
--------------  ----- 
6489 non-null   object
dtypes: object(1)
memory usage: 101.5+ KB


In [17]:
valencia['bathrooms_text'].value_counts()

1 bath              2950
2 baths             1087
1 shared bath        957
1.5 baths            483
1 private bath       357
1.5 shared baths     240
2 shared baths       155
2.5 baths             70
3 baths               57
2.5 shared baths      26
3 shared baths        23
4 baths               19
3.5 baths             14
0 shared baths        11
Shared half-bath      10
4 shared baths         8
0 baths                6
5 baths                5
4.5 baths              3
Half-bath              2
5.5 baths              1
4.5 shared baths       1
6.5 baths              1
3.5 shared baths       1
7 baths                1
6 baths                1
Name: bathrooms_text, dtype: int64

In [18]:
def half_bath(x):
    if ('half' in str(x).lower()):
        return '0.5'
    else:
        return x
    

In [19]:
valencia['bathrooms_text']=valencia['bathrooms_text'].apply(
    lambda x: half_bath(x)
) 

In [20]:
valencia['bathrooms_text'].value_counts()

1 bath              2950
2 baths             1087
1 shared bath        957
1.5 baths            483
1 private bath       357
1.5 shared baths     240
2 shared baths       155
2.5 baths             70
3 baths               57
2.5 shared baths      26
3 shared baths        23
4 baths               19
3.5 baths             14
0.5                   12
0 shared baths        11
4 shared baths         8
0 baths                6
5 baths                5
4.5 baths              3
5.5 baths              1
4.5 shared baths       1
6.5 baths              1
3.5 shared baths       1
7 baths                1
6 baths                1
Name: bathrooms_text, dtype: int64

In [32]:
import re

def txt_to_number(x):
    response = re.search("[0-9|.]*", str(x)).group(0)
    if (len(response) < 1):
        return '0'
    else:
        return response

txt_to_number('9.9df')

'9.9'

In [35]:
valencia['bathrooms'] = valencia['bathrooms_text'].apply(txt_to_number).astype('float')

In [37]:
valencia[['bathrooms', 'bedrooms', 'accommodates']]

Unnamed: 0,bathrooms,bedrooms,accommodates
5738,1.5,1.0,1
1559,1.0,1.0,2
5461,1.0,2.0,4
25,1.0,1.0,3
3559,1.0,2.0,6
...,...,...,...
5882,1.0,,2
2961,1.0,1.0,1
1550,1.0,3.0,6
2312,1.0,1.0,2


In [43]:
df = valencia[['bathrooms', 'bedrooms', 'accommodates', 'price']].copy()

In [46]:
df = df.dropna()

# Normalizar

Debido a las diferencias de órdenes de magnitud, debemos 'normalizar' los datos - > distribución normal estándar media 0 y desv. típica 1.





In [47]:
valencia['accommodates'].mean()

3.526721084244571

In [49]:
valencia['accommodates'].std()

1.9522414677994482

In [52]:
 (
    valencia['accommodates'] - valencia['accommodates'].mean()
) / valencia['accommodates'].std()

5738   -1.294267
1559   -0.782035
5461    0.242428
25     -0.269803
3559    1.266892
          ...   
5882   -0.782035
2961   -1.294267
1550    1.266892
2312   -0.782035
2287   -0.782035
Name: accommodates, Length: 6493, dtype: float64

In [55]:
(
     (
    valencia['accommodates'] - valencia['accommodates'].mean()
) / valencia['accommodates'].std()
).std()

1.0

In [57]:
df_predictoras = df[['bathrooms', 'bedrooms', 'accommodates']]

In [63]:
df[['bathrooms', 'bedrooms', 'accommodates']] = (df_predictoras - df_predictoras.mean()) / df_predictoras.std()

In [64]:
df_predictoras

Unnamed: 0,bathrooms,bedrooms,accommodates
5738,0.321845,-0.762584,-1.301771
1559,-0.601573,-0.762584,-0.794948
5461,-0.601573,0.263767,0.218697
25,-0.601573,-0.762584,-0.288126
3559,-0.601573,0.263767,1.232342
...,...,...,...
650,-0.601573,-0.762584,-1.301771
5933,1.245263,1.290118,1.232342
2961,-0.601573,-0.762584,-1.301771
1550,-0.601573,1.290118,1.232342


In [65]:
df

Unnamed: 0,bathrooms,bedrooms,accommodates,price
5738,0.321845,-0.762584,-1.301771,37.0
1559,-0.601573,-0.762584,-0.794948,95.0
5461,-0.601573,0.263767,0.218697,61.0
25,-0.601573,-0.762584,-0.288126,120.0
3559,-0.601573,0.263767,1.232342,94.0
...,...,...,...,...
650,-0.601573,-0.762584,-1.301771,20.0
5933,1.245263,1.290118,1.232342,200.0
2961,-0.601573,-0.762584,-1.301771,21.0
1550,-0.601573,1.290118,1.232342,86.0


In [66]:
from scipy.spatial import distance

In [67]:
distance.euclidean([3],[4])

1.0

In [68]:
distance.euclidean([1,4], [5,0])

5.656854249492381

In [69]:
distance.euclidean([1,3,5,6], [5,6,12,3])

9.1104335791443

In [None]:
distance.euclidean(
    df[['bathrooms', 'bedrooms', 'accommodates']].iloc[1],
        df[['bathrooms', 'bedrooms', 'accommodates']].iloc[16],

)

# SCIKIT-Learn

Flujo: 
- Importar el algoritmo que queremos usar. Cada algoritmo es una clase de Python. 
- Creamos una instancia del algoritmo.
- Crear el conjunto de train y test
- Ajustamos el modelo a los datos de entrenamiento (*fit*), esto es *creamos la función predictora*
- Hacemos predicciones
- Evaluamos la predicción.

In [71]:
# Importar el algoritmo que queremos usar. Cada algoritmo es una clase de Python. 
from sklearn.neighbors import KNeighborsRegressor

In [72]:
# Creamos una instancia del algoritmo.
knn = KNeighborsRegressor()

In [73]:
knn

### Documentación

```python
 class sklearn.neighbors.KNeighborsRegressor(n_neighbors=5, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', metric_params=None, n_jobs=None)
```

- **n_neighbors** default 5, es el número de vecinos, la **k**
- **metric** es el algoritmo de distancia, relacionado con el p. Si p = 2 es distancia euclidiana. Si p=1 es manhattan.
- **weights** 'uniform' es la media mean()
- **algorithm** , es el algoritmo que 'elige' donde buscar los vecinos. 'brute' es lo que habíamos hecho hasta ahora, calcular la distancia con TODOS los elementos. Los demás son de poda... Tiene influencia en el tiempo de cálculo pero no en el resultado.



In [74]:
import math
corte = math.trunc(df.shape[0]*0.8)

In [75]:
corte

4946

In [76]:
train = df[:corte].copy()
test = df[corte:].copy()

In [77]:
# - Ajustamos el modelo a los datos de entrenamiento (*fit*), esto es *creamos la función predictora*
knn.fit(
    train[['bathrooms', 'bedrooms', 'accommodates']],
    train['price']
)

cuando hacemos el fit, sklearn almacena los datos dentro de la instancia de knn y comprueba que sean correctos; esto es, en este caso de distancias,si alguno fuese nulo, devolvería un error.

In [78]:
# - Hacemos predicciones

# usamos el conjunto de test
predicciones = knn.predict(test[['bathrooms', 'bedrooms', 'accommodates']])

In [79]:
predicciones

array([ 89.4,  26.2, 127.4, ...,  25.8,  80. ,  86.2])

In [80]:
test['predicted_price'] = predicciones

In [81]:
test[ ['price', 'predicted_price'] ]

Unnamed: 0,price,predicted_price
3048,84.0,89.4
2379,24.0,26.2
6562,104.0,127.4
3530,125.0,107.2
234,63.0,111.6
...,...,...
650,20.0,25.8
5933,200.0,258.8
2961,21.0,25.8
1550,86.0,80.0


In [82]:
from sklearn.metrics import mean_squared_error

In [85]:
mse = mean_squared_error(test['price'], test['predicted_price'])

In [86]:
mse

3737.592145270816

In [88]:
rmse = mse**0.5

In [89]:
rmse

61.13584991861008

# Vamos a incorporar la distancia al centro de la ciudad



In [90]:
valencia[['latitude', 'longitude']]

Unnamed: 0,latitude,longitude
5738,39.49741,-0.37455
1559,39.45981,-0.37301
5461,39.48228,-0.39761
25,39.47625,-0.37203
3559,39.47267,-0.37796
...,...,...
5882,39.47265,-0.32772
2961,39.47267,-0.35266
1550,39.46857,-0.32447
2312,39.47053,-0.33897


In [99]:
# plaza del ayuntamiento de vlc : 39.46983569781667, -0.3765216511359857

# haversine 

from haversine import haversine, Unit

centro = (39.46983569781667, -0.3765216511359857)

primero = (valencia.iloc[3]['latitude'], valencia.iloc[3]['longitude'])


In [100]:
primero

(39.47625, -0.37203)

In [101]:
haversine(centro, primero, Unit.METERS)

810.7702433948596

In [102]:
valencia.iloc[3]

id                                                                           2053439
listing_url                                     https://www.airbnb.com/rooms/2053439
scrape_id                                                             20221221170335
last_scraped                                                              2022-12-21
source                                                                   city scrape
                                                                ...                 
calculated_host_listings_count                                                     1
calculated_host_listings_count_entire_homes                                        1
calculated_host_listings_count_private_rooms                                       0
calculated_host_listings_count_shared_rooms                                        0
reviews_per_month                                                               2.07
Name: 25, Length: 75, dtype: object

In [105]:
def distancia_al_centro(punto):
    return haversine(punto, centro, Unit.METERS)


In [109]:

valencia['distancia'] = valencia[['latitude', 'longitude']].apply(distancia_al_centro, axis=1)

In [117]:
df = valencia[['bathrooms', 'bedrooms', 'accommodates', 'distancia', 'price']].copy()
df_predictoras = df[['bathrooms', 'bedrooms', 'accommodates', 'distancia']]

In [118]:
df[['bathrooms', 'bedrooms', 'accommodates', 'distancia']] = (df_predictoras - df_predictoras.mean()) / df_predictoras.std()

In [147]:
df.head()
df = df.dropna()

In [148]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df[['bathrooms', 'bedrooms', 'accommodates', 'distancia']], df['price'], shuffle=False, test_size=0.2)

In [149]:
X_train

Unnamed: 0,bathrooms,bedrooms,accommodates,distancia
5738,0.338578,-0.762584,-1.294267,0.287466
1559,-0.583177,-0.762584,-0.782035,-0.456145
5461,-0.583177,0.263767,0.242428,-0.020090
25,-0.583177,-0.762584,-0.269803,-0.589687
3559,-0.583177,0.263767,1.266892,-0.772991
...,...,...,...,...
6142,-0.583177,-0.762584,-0.782035,-0.290703
5576,-0.583177,0.263767,-0.269803,-0.478202
4914,-0.583177,0.263767,0.242428,0.428297
5833,0.338578,-0.762584,-0.782035,-0.064695


In [150]:
y_train.shape

(4946,)

In [151]:
y_test.shape

(1237,)

In [152]:
knn2 = KNeighborsRegressor(n_neighbors=10, algorithm='brute')

In [153]:
knn2.fit(
    X_train,
    y_train
)

In [154]:
predicted = knn2.predict(X_test)

In [155]:
X_test


Unnamed: 0,bathrooms,bedrooms,accommodates,distancia
3048,-0.583177,0.263767,0.242428,0.769653
2379,2.182089,-0.762584,-0.782035,0.079632
6562,-0.583177,0.263767,0.754660,0.788643
3530,0.338578,0.263767,0.242428,-0.502373
234,-0.583177,-0.762584,0.242428,0.651002
...,...,...,...,...
650,-0.583177,-0.762584,-1.294267,0.108884
5933,1.260333,1.290118,1.266892,7.490771
2961,-0.583177,-0.762584,-1.294267,-0.100065
1550,-0.583177,1.290118,1.266892,0.830627


In [157]:
predicted

array([ 90. ,  49.9,  80. , ...,  19.8, 112.6,  67.8])

In [159]:
mean_squared_error(predicted, y_test) ** 0.5

54.99040830879203