## ENUNCIADO EJERCICIO

* Dataset Airbnb NYC https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data AB_NYC_2019.csv
    * Quitar columnas: id, host_id, host_name
    * Fecha: probar a descomponer la fecha en campos año mes y día con pandas es decir, en 3 columnas, por ejemplo usando to_datetime de pandas y los accesores de fecha para extraer año mes y día.

* EDAs (10%) (menos foco para este módulo)
    * univariantes: histogramas boxplot countplot
    * bivariantes: scatterplot
    * multivariante: corr en heatmap, pairplot
* Preprocesados (20%)
    * numéricas: imputer, scaler, transformer
    * categóricas: imputer, encoder
    * Requisito: hacer los preprocesados con Scikit Learn en lugar de métodos de pandas
* clustering y siluetas (10 %)
    * Crear una columna cluster usando KMeans o cualquier otro algoritmo de Clustering
    * Usar esa columna para hacer algún gráfico EDA como hue para colorear con scatterplot
* feature selection (10%)
    * SelectKBest para filtrar las mejores columnas y probar
    * PCA 
* Regresión (20%):
    * Predecir la columna 'price'
* Clasificación multiclase (20%)
    * Predecir la columna 'room_type'
* Comparar resultados de modelos con validación cruzada (10 %)
    * Mostrar un dataframe de resultados con las métricas calculadas
    * Opcional: mostrar boxplot de los resultados de validación cruzada como tiempos de ejecución y predicción y métricas

* Opcional:
    * Uso de pipelines opcional:
        * Opción 1: hacer las transformaciones por separado manualmente
        * Opción 2: hacer las transformaciones con pipelines
        * Opción 3: una primera parte con transformaciones manuales y una segunda parte con Pipeline
        * En ambos casos sería interesante calcular las métricas para ver qué técnicas de preprocesado van mejor
    * Vectorizar la columna texto 'name' y usar TruncatedSVD
    * Clasificación multiclase 'room_type' con TensorFlow-Keras
    * SMOTE si hay desbalanceo para el problema de clasificación multiclase 'room_type'



In [None]:

import seaborn as sns 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer, FunctionTransformer, OneHotEncoder, LabelEncoder
from sklearn.metrics import r2_score, mean_absolute_error, root_mean_squared_error, mean_absolute_percentage_error
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.svm import SVR

In [None]:
df = pd.read_csv('airbnb_nyc_clean.csv').drop(['id','name','host_id','host_name','last_review', 'house_rules'], axis=1)
df.head(3)

In [None]:
df.info()

In [None]:
df.isna().sum()

In [52]:
df.describe()

Unnamed: 0,lat,long,construction_year,price,service_fee,minimum_nights,number_of_reviews,reviews_per_month,review_rate_number,calculated_host_listings_count,availability_365
count,69305.0,69305.0,69305.0,69305.0,69305.0,69305.0,69305.0,69305.0,69305.0,69305.0,69305.0
mean,40.72807,-73.949036,2012.489503,624.73607,124.894026,4.62033,28.003896,1.301503,3.321636,8.976755,153.184287
std,0.055973,0.05047,5.756144,331.158937,66.222794,4.356887,52.03518,1.659188,1.255746,34.808447,134.421373
min,40.49979,-74.24984,2003.0,50.0,10.0,0.0,0.0,0.01,1.0,1.0,-10.0
25%,40.68854,-73.98279,2008.0,339.0,68.0,2.0,1.0,0.3,2.0,1.0,18.0
50%,40.72265,-73.95439,2012.0,624.73607,124.894026,3.0,7.0,0.79,3.0,1.0,127.0
75%,40.76273,-73.93138,2017.0,911.0,182.0,6.0,30.0,1.73,4.0,3.0,281.0
max,40.91697,-73.70522,2022.0,1200.0,240.0,13.0,1024.0,90.0,5.0,332.0,426.0


In [None]:
plt.figure(figsize=(16, 8))
plt.subplot(1, 2, 1)
sns.histplot(df, x='price', kde=True)
plt.subplot(1, 2, 2)
sns.boxplot(df, hue='room_type', y='price')
#sns.countplot(df, x='room_type')

In [None]:
sns.heatmap(df.corr(numeric_only=True).round(2), annot=True, cmap='spring')

In [None]:
#sns.pairplot(df)

In [None]:
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound= Q1 - 1.5*IQR
upper_bound= Q3 + 1.5*IQR

filtro = filtro = ~((df['price'] < lower_bound) | (df['price'] > upper_bound))
print(df.shape)
print(df[filtro].shape)


In [None]:
# # Crear los encoder
# label_encoder = LabelEncoder()

# # Mapear la columna 'room_type' y 'neighbourhood_group' a valores numéricos
# df['room_type_encoded'] = label_encoder.fit_transform(df['room_type'])
# df['neighbourhood_group_encoded'] = label_encoder.fit_transform(df['eighbourhood_group'])

# # Borrar la columna 'room_type' y 'neighbourhood_group' 
# print(df[['room_type', 'room_type_encoded']].head())

In [None]:
X= df.drop('price', axis=1)
y= df['price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [None]:
print(X_train.select_dtypes(exclude=[np.number]).columns.to_list())

In [None]:
# pipeline numéricas
numerical_cols = X_train.select_dtypes(include=[np.number]).columns.to_list()
categorical_cols = X_train.select_dtypes(exclude=[np.number]).columns.to_list()
pipeline_numerical = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value=0)),
    ('scaler', MinMaxScaler()), 
])

# pipeline categóricas

pipeline_categorical = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='Otro')),
    ('encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])

# unir pipelines con ColumnTransformer
pipeline_all = ColumnTransformer([
    ('numeric', pipeline_numerical, numerical_cols),
    ('categorical', pipeline_categorical, categorical_cols)
])

# pipeline final con el modelo
pipeline = make_pipeline(
    pipeline_all,
    LinearRegression()
)

In [None]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

In [None]:
df_resultados = pd.DataFrame(columns=['Modelo', 'Preprocesado', 'R2', 'MAE', 'RMSE', 'MAPE'])

In [None]:
def calculate_metrics(X_train, X_test, y_train, y_test):
    models = {
        'LinearRegression': LinearRegression(),
        'KNN': KNeighborsRegressor(),
        'SVR': SVR(),
        'DecisionTree': DecisionTreeRegressor(random_state=42),
        'RandomForest': RandomForestRegressor(random_state=42)
    }
    for model_name, model in models.items():
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
        df_resultados.loc[len(df_resultados)] = [model_name, r2_score(y_test, y_pred), mean_absolute_error(y_test, y_pred),root_mean_squared_error(y_test, y_pred),mean_absolute_percentage_error(y_test, y_pred)]
    
    return df_resultados.sort_values('R2', ascending=False)

In [None]:
pipeline_all

In [None]:
calculate_metrics(X_train[numerical_cols], X_test, y_train, y_test)