# Multivariante

Hasta ahora, hemos intentado predecir con una variable (`accommodates` y `bedrooms`)

Dos formas de intentar mejorar la precisión:

1. Cambiar el valor de k (cambio de hiperparámetros)
2. Cambiar las variables, incluso incrementando el número de variables, p.ej [`accommodates`, `bedrooms`,  `bathrooms`]


Lo que vamos a hacer en el multivariante es añadir columnas al modelo. Por el momento tienen que ser columnas:

- numéricas
- no nulas (el cálculo de la distancia no nos lo va a permitir)
- no ordinales - tenemos que quitar las variables en las que '2' no represente más que '1', por ejemplo, el id no representa un valor que siendo mayor o menor implique algo; lo mismo con host_id, scrape_id y la latitud y la longitud....

In [1]:
import pandas as pd
import numpy as np
import math

In [2]:
valencia = pd.read_csv('https://raw.githubusercontent.com/afoone/caipc-mar-2023/master/data/airbnb.csv')


In [3]:
valencia.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6711 entries, 0 to 6710
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            6711 non-null   int64  
 1   listing_url                                   6711 non-null   object 
 2   scrape_id                                     6711 non-null   int64  
 3   last_scraped                                  6711 non-null   object 
 4   source                                        6711 non-null   object 
 5   name                                          6711 non-null   object 
 6   description                                   6558 non-null   object 
 7   neighborhood_overview                         3759 non-null   object 
 8   picture_url                                   6711 non-null   object 
 9   host_id                                       6711 non-null   i

In [4]:
## limpiar los datos

# corregir el price (tiene $ y no es numérico)
valencia['price'] = valencia['price'].apply(lambda x: x.replace('$', '').replace(',', '')).astype('float')

valencia = valencia[valencia['price']<700]

In [5]:
# barajar

np.random.seed(3837980)
valencia = valencia.iloc[
    np.random.permutation(len(valencia))
]

In [6]:
valencia.shape

(6493, 75)

In [7]:
valencia.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6493 entries, 5738 to 2287
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            6493 non-null   int64  
 1   listing_url                                   6493 non-null   object 
 2   scrape_id                                     6493 non-null   int64  
 3   last_scraped                                  6493 non-null   object 
 4   source                                        6493 non-null   object 
 5   name                                          6493 non-null   object 
 6   description                                   6341 non-null   object 
 7   neighborhood_overview                         3567 non-null   object 
 8   picture_url                                   6493 non-null   object 
 9   host_id                                       6493 non-null 

In [10]:
valencia[['bathrooms', 'bedrooms', 'accommodates']]

Unnamed: 0,bathrooms,bedrooms,accommodates
5738,,1.0,1
1559,,1.0,2
5461,,2.0,4
25,,1.0,3
3559,,2.0,6
...,...,...,...
5882,,,2
2961,,1.0,1
1550,,3.0,6
2312,,1.0,2


In [14]:
valencia['bathrooms'].isnull().sum()

6493

In [15]:
valencia['bathrooms_text']

5738    1.5 shared baths
1559              1 bath
5461              1 bath
25                1 bath
3559              1 bath
              ...       
5882              1 bath
2961      1 private bath
1550              1 bath
2312       1 shared bath
2287              1 bath
Name: bathrooms_text, Length: 6493, dtype: object

In [16]:
valencia['bathrooms_text'].info()

<class 'pandas.core.series.Series'>
Int64Index: 6493 entries, 5738 to 2287
Series name: bathrooms_text
Non-Null Count  Dtype 
--------------  ----- 
6489 non-null   object
dtypes: object(1)
memory usage: 101.5+ KB


In [17]:
valencia['bathrooms_text'].value_counts()

1 bath              2950
2 baths             1087
1 shared bath        957
1.5 baths            483
1 private bath       357
1.5 shared baths     240
2 shared baths       155
2.5 baths             70
3 baths               57
2.5 shared baths      26
3 shared baths        23
4 baths               19
3.5 baths             14
0 shared baths        11
Shared half-bath      10
4 shared baths         8
0 baths                6
5 baths                5
4.5 baths              3
Half-bath              2
5.5 baths              1
4.5 shared baths       1
6.5 baths              1
3.5 shared baths       1
7 baths                1
6 baths                1
Name: bathrooms_text, dtype: int64

In [18]:
def half_bath(x):
    if ('half' in str(x).lower()):
        return '0.5'
    else:
        return x
    

In [19]:
valencia['bathrooms_text']=valencia['bathrooms_text'].apply(
    lambda x: half_bath(x)
) 

In [20]:
valencia['bathrooms_text'].value_counts()

1 bath              2950
2 baths             1087
1 shared bath        957
1.5 baths            483
1 private bath       357
1.5 shared baths     240
2 shared baths       155
2.5 baths             70
3 baths               57
2.5 shared baths      26
3 shared baths        23
4 baths               19
3.5 baths             14
0.5                   12
0 shared baths        11
4 shared baths         8
0 baths                6
5 baths                5
4.5 baths              3
5.5 baths              1
4.5 shared baths       1
6.5 baths              1
3.5 shared baths       1
7 baths                1
6 baths                1
Name: bathrooms_text, dtype: int64

In [32]:
import re

def txt_to_number(x):
    response = re.search("[0-9|.]*", str(x)).group(0)
    if (len(response) < 1):
        return '0'
    else:
        return response

txt_to_number('9.9df')

'9.9'

In [35]:
valencia['bathrooms'] = valencia['bathrooms_text'].apply(txt_to_number).astype('float')

In [37]:
valencia[['bathrooms', 'bedrooms', 'accommodates']]

Unnamed: 0,bathrooms,bedrooms,accommodates
5738,1.5,1.0,1
1559,1.0,1.0,2
5461,1.0,2.0,4
25,1.0,1.0,3
3559,1.0,2.0,6
...,...,...,...
5882,1.0,,2
2961,1.0,1.0,1
1550,1.0,3.0,6
2312,1.0,1.0,2


In [43]:
df = valencia[['bathrooms', 'bedrooms', 'accommodates', 'price']].copy()

In [46]:
df = df.dropna()

# Normalizar

Debido a las diferencias de órdenes de magnitud, debemos 'normalizar' los datos - > distribución normal estándar media 0 y desv. típica 1.





In [47]:
valencia['accommodates'].mean()

3.526721084244571

In [49]:
valencia['accommodates'].std()

1.9522414677994482

In [52]:
 (
    valencia['accommodates'] - valencia['accommodates'].mean()
) / valencia['accommodates'].std()

5738   -1.294267
1559   -0.782035
5461    0.242428
25     -0.269803
3559    1.266892
          ...   
5882   -0.782035
2961   -1.294267
1550    1.266892
2312   -0.782035
2287   -0.782035
Name: accommodates, Length: 6493, dtype: float64

In [55]:
(
     (
    valencia['accommodates'] - valencia['accommodates'].mean()
) / valencia['accommodates'].std()
).std()

1.0

In [57]:
df_predictoras = df[['bathrooms', 'bedrooms', 'accommodates']]

In [63]:
df[['bathrooms', 'bedrooms', 'accommodates']] = (df_predictoras - df_predictoras.mean()) / df_predictoras.std()

In [64]:
df_predictoras

Unnamed: 0,bathrooms,bedrooms,accommodates
5738,0.321845,-0.762584,-1.301771
1559,-0.601573,-0.762584,-0.794948
5461,-0.601573,0.263767,0.218697
25,-0.601573,-0.762584,-0.288126
3559,-0.601573,0.263767,1.232342
...,...,...,...
650,-0.601573,-0.762584,-1.301771
5933,1.245263,1.290118,1.232342
2961,-0.601573,-0.762584,-1.301771
1550,-0.601573,1.290118,1.232342


In [65]:
df

Unnamed: 0,bathrooms,bedrooms,accommodates,price
5738,0.321845,-0.762584,-1.301771,37.0
1559,-0.601573,-0.762584,-0.794948,95.0
5461,-0.601573,0.263767,0.218697,61.0
25,-0.601573,-0.762584,-0.288126,120.0
3559,-0.601573,0.263767,1.232342,94.0
...,...,...,...,...
650,-0.601573,-0.762584,-1.301771,20.0
5933,1.245263,1.290118,1.232342,200.0
2961,-0.601573,-0.762584,-1.301771,21.0
1550,-0.601573,1.290118,1.232342,86.0
