# **Kaggle ‚Äì DataTops¬Æ**
Tu TA ha decidido cambiar de aires y, por eso, ha comprado una tienda de port√°tiles. Sin embargo, su √∫nica especialidad es Data Science, por lo que ha decidido crear un modelo de ML para establecer los mejores precios.

¬øPodr√≠as ayudar a tu profe a mejorar ese modelo?

## Aspectos importantes
- √öltima submission:
    - Ma√±ana: 17 de febrero a las 5pm
    - Tarde: 19 de febrero a las 5pm
- **Enlace de la competici√≥n**: https://www.kaggle.com/t/c5cc87b50c4b4770bdc8f5acbe15577d
- **Requisito**: Estar registrado en [Kaggle](https://www.kaggle.com/)

## M√©trica:
El error cuadr√°tico medio (RMSE, por sus siglas en ingl√©s) es una medida de la desviaci√≥n est√°ndar de los residuos (errores de predicci√≥n). Los residuos representan la diferencia entre los valores observados y los valores predichos por el modelo. El RMSE indica qu√© tan dispersos est√°n estos errores: cuanto menor es el RMSE, m√°s cercanas est√°n las predicciones a los valores reales. En otras palabras, el RMSE mide qu√© tan bien se ajusta la l√≠nea de regresi√≥n a los datos.


$$ RMSE = \sqrt{\frac{1}{n}\Sigma_{i=1}^{n}{\Big(\frac{d_i -f_i}{\sigma_i}\Big)^2}}$$


## 1. Librer√≠as

In [None]:
import numpy as np
import pandas as pd
from PIL import Image
from sklearn.model_selection import train_test_split
from sklearn.metrics import root_mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
import urllib.request
import matplotlib.pyplot as plt

## 2. Datos

In [None]:
df = pd.read_csv("./data/train.csv")



In [None]:
train = pd.read_csv("./data/train.csv")
test  = pd.read_csv("./data/test.csv")
sample_sub = pd.read_csv("./data/sample_submission.csv")

print("train:", train.shape)
print("test:", test.shape)
print("sample:", sample_sub.shape)
train.head()


### 2.1 Exploraci√≥n de los datos

In [None]:
df.info()

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.describe()

In [None]:
df.isna().sum().sort_values(ascending=False).head(20)



In [None]:
df.duplicated().sum()


In [None]:
df.describe(include="all").T


In [None]:
target = "Price_in_euros"

df[target].describe()


In [None]:
plt.figure(figsize=(6,4))
plt.hist(df[target], bins=30)
plt.title("Distribuci√≥n de Price_in_euros")
plt.xlabel("Price_in_euros")
plt.ylabel("Frecuencia")
plt.show()


In [None]:
#  Aqu√≠ solo creamos columnas m√°s ‚Äúanalizables‚Äù (num√©ricas). No borramos las originales.
df_eda = df.copy()

# RAM: "8GB" -> 8
df_eda["Ram_GB"] = df_eda["Ram"].str.replace("GB", "", regex=False).astype(float)

# Weight: "1.86kg" -> 1.86
df_eda["Weight_kg"] = df_eda["Weight"].str.replace("kg", "", regex=False).astype(float)

# ScreenResolution: extraer ancho y alto
wh = df_eda["ScreenResolution"].str.extract(r'(?P<width>\d+)\s*x\s*(?P<height>\d+)')
df_eda["Screen_Width"]  = wh["width"].astype(float)
df_eda["Screen_Height"] = wh["height"].astype(float)

# PPI
df_eda["PPI"] = np.sqrt(df_eda["Screen_Width"]**2 + df_eda["Screen_Height"]**2) / df_eda["Inches"]

# Flags
df_eda["Touchscreen"] = df_eda["ScreenResolution"].str.contains("Touchscreen", case=False, na=False).astype(int)
df_eda["IPS_Panel"]   = df_eda["ScreenResolution"].str.contains("IPS", case=False, na=False).astype(int)

# CPU: marca y GHz
df_eda["Cpu_Brand"] = df_eda["Cpu"].str.split().str[0]
df_eda["Cpu_GHz"] = df_eda["Cpu"].str.extract(r'(\d+(\.\d+)?)\s*GHz')[0].astype(float)

# GPU: marca
df_eda["Gpu_Brand"] = df_eda["Gpu"].str.split().str[0]

df_eda[["Ram", "Ram_GB", "Weight", "Weight_kg", "ScreenResolution", "PPI", "Cpu", "Cpu_GHz"]].head()


In [None]:
# Correlaci√≥n con el target (solo num√©ricas)
num_cols = df_eda.select_dtypes(include=[np.number]).columns

corr_target = df_eda[num_cols].corr()[target].drop(target)

# Ordenar por valor absoluto 
corr_target = corr_target.reindex(corr_target.abs().sort_values(ascending=False).index)

corr_target.head(15)


In [None]:
# Colinealidad usando correlacion entre Features
corr_matrix = df_eda[num_cols].corr().abs()

plt.figure(figsize=(10,7))
plt.imshow(corr_matrix, aspect="auto")
plt.colorbar()
plt.xticks(range(len(num_cols)), num_cols, rotation=90)
plt.yticks(range(len(num_cols)), num_cols)
plt.title("Correlaci√≥n absoluta entre variables num√©ricas")
plt.show()


In [None]:
# Ver variables categ√≥ricas
cat_cols = df.select_dtypes(include="object").columns

cat_cols



In [None]:
df_corr = df_eda.copy()


In [None]:
# Matriz de correlaci√≥n SOLO num√©rica
num_cols = df_corr.select_dtypes(include=np.number).columns

corr_matrix = df_corr[num_cols].corr()


In [None]:
plt.figure(figsize=(10,8))

plt.imshow(corr_matrix, aspect="auto")
plt.colorbar()

plt.xticks(range(len(num_cols)), num_cols, rotation=90)
plt.yticks(range(len(num_cols)), num_cols)

plt.title("Matriz de correlaci√≥n (limpia)")

plt.show()


In [None]:
# correlaci√≥n entre variables num√©ricas y el target 
num_cols = df_corr.select_dtypes(include=np.number).columns


In [None]:
corr_target = df_corr[num_cols].corr()[target]

# quitar el propio target
corr_target = corr_target.drop(target)

# ordenar por importancia (valor absoluto)
corr_target = corr_target.reindex(
    corr_target.abs().sort_values(ascending=False).index
)

corr_target.head(15)


In [None]:
plt.figure(figsize=(8,6))

plt.barh(corr_target.index[:15], corr_target.values[:15])

plt.title("Variables m√°s correlacionadas con el precio")
plt.xlabel("Correlaci√≥n")
plt.gca().invert_yaxis()

plt.show()


### 2.3 Definir X e y

In [None]:
# Definir target
target = "Price_in_euros"

# X = todas las columnas excepto el target
X = df.drop(columns=[target])

# y = target
y = df[target].copy()

print("Shape X:", X.shape)
print("Shape y:", y.shape)


### 2.4 Dividir X_train, X_test, y_train, y_test

In [None]:
from sklearn.model_selection import train_test_split

# Divisi√≥n train / test (validaci√≥n interna)
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,   # 20% para test interno
    random_state=42
)

print("X_train:", X_train.shape)
print("X_test :", X_test.shape)
print("y_train:", y_train.shape)
print("y_test :", y_test.shape)


In [None]:
def feature_engineering(df):

    df = df.copy()

    # RAM: "8GB" -> 8
    df["Ram_GB"] = df["Ram"].str.replace("GB","", regex=False).astype(float)

    # Weight: "1.86kg" -> 1.86
    df["Weight_kg"] = df["Weight"].str.replace("kg","", regex=False).astype(float)

    # ScreenResolution -> width y height
    wh = df["ScreenResolution"].str.extract(r'(?P<width>\d+)\s*x\s*(?P<height>\d+)')

    df["Screen_Width"] = wh["width"].astype(float)
    df["Screen_Height"] = wh["height"].astype(float)

    # PPI
    df["PPI"] = np.sqrt(df["Screen_Width"]**2 + df["Screen_Height"]**2) / df["Inches"]

    # Flags
    df["Touchscreen"] = df["ScreenResolution"].str.contains("Touchscreen", case=False, na=False).astype(int)
    df["IPS_Panel"] = df["ScreenResolution"].str.contains("IPS", case=False, na=False).astype(int)

    # CPU GHz
    df["Cpu_GHz"] = df["Cpu"].str.extract(r'(\d+(\.\d+)?)\s*GHz')[0].astype(float)

    return df


In [None]:
# Feature Engineering

# Aplicar FE SOLO despu√©s del split

X_train_fe = feature_engineering(X_train)
X_test_fe  = feature_engineering(X_test)


In [None]:
# aplicar get_dummies a train y test
X_train_encoded = pd.get_dummies(X_train_fe, drop_first=True)
X_test_encoded  = pd.get_dummies(X_test_fe, drop_first=True)


In [None]:
# alinear columnas
X_train_encoded, X_test_encoded = X_train_encoded.align(
    X_test_encoded,
    join="left",
    axis=1,
    fill_value=0
)


In [None]:
print("X_train_encoded:", X_train_encoded.shape)
print("X_test_encoded :", X_test_encoded.shape)

print("Columnas iguales:", (X_train_encoded.columns == X_test_encoded.columns).all())


Se aplica OneHotEncoding a las variables categ√≥ricas. Posteriormente se alinean las columnas entre train y test para asegurar que ambos conjuntos tengan exactamente las mismas features.

## 3. Procesado de datos

Nuestro target es la columna `Price_in_euros`

-----------------------------------------------------------------------------------------------------------------

## 4. Modelado

### 4.1 Baseline de modelos


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import root_mean_squared_error

# Crear modelo baseline
rf_model = RandomForestRegressor(
    random_state=42,
    n_jobs=-1
)

# Entrenar
rf_model.fit(X_train_encoded, y_train)

# Predicciones en test interno
predictions = rf_model.predict(X_test_encoded)

# M√©trica RMSE
rmse = root_mean_squared_error(y_test, predictions)

print("RMSE baseline:", rmse)


### 4.2 Sacar m√©tricas, valorar los modelos

Recuerda que en la competici√≥n se va a evaluar con la m√©trica de ``RMSE``.

In [None]:
from sklearn.metrics import root_mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

# Calcular m√©tricas

rmse = root_mean_squared_error(y_test, predictions)
mae  = mean_absolute_error(y_test, predictions)
r2   = r2_score(y_test, predictions)

print("RMSE:", rmse)
print("MAE :", mae)
print("R2  :", r2)


### 4.3 Optimizaci√≥n (up to you ü´∞üèª)

-----------------------------------------------------------------

## Una vez listo el modelo, toca predecir ``test.csv``

**RECUERDA: APLICAR LAS TRANSFORMACIONES QUE HAYAS REALIZADO EN `train.csv` a `test.csv`.**


V√©ase:
- Estandarizaci√≥n/Normalizaci√≥n
- Eliminaci√≥n de Outliers
- Eliminaci√≥n de columnas
- Creaci√≥n de columnas nuevas
- Gesti√≥n de valores nulos
- Y un largo etc√©tera de t√©cnicas que como Data Scientist hayas considerado las mejores para tu dataset.

## 1. Carga los datos de `test.csv` para predecir.


In [None]:
X_pred = pd.read_csv("./data/test.csv")
X_pred.head()

In [None]:
X_pred.tail()

In [None]:
X_pred.info()

 ## 2. Replicar el procesado para ``test.csv``

In [None]:
X_pred_fe = feature_engineering(test)


In [None]:
# 1) aplicar FE al test
X_pred_fe = feature_engineering(test)

# 2) dummies
X_pred = pd.get_dummies(X_pred_fe, drop_first=True)

# 3) alinear con train
X_pred_aligned = X_pred.reindex(columns=X_train_encoded.columns, fill_value=0)

# 4) predecir
predictions_submit = rf_model.predict(X_pred_aligned)

len(predictions_submit)


In [None]:
X_pred

In [None]:
predictions_submit = rf_model.predict(X_pred_aligned)
predictions_submit[:5]


In [None]:
len(predictions_submit)


**¬°OJO! ¬øPor qu√© me da error?**

IMPORTANTE:

- SI EL ARRAY CON EL QUE HICISTEIS `.fit()` ERA DE 4 COLUMNAS, PARA `.predict()` DEBEN SER LAS MISMAS
- SI AL ARRAY CON EL QUE HICISTEIS `.fit()` LO NORMALIZASTEIS, PARA `.predict()` DEB√âIS NORMALIZARLO
- TODO IGUAL SALVO **BORRAR FILAS**, EL N√öMERO DE ROWS SE DEBE MANTENER EN ESTE SET, PUES LA PREDICCI√ìN DEBE TENER **391 FILAS**, SI O SI

**Entonces, si al cargar los datos de ``train.csv`` usaste `index_col=0`, ¬øtendr√© que hacer lo tambi√©n para el `test.csv`?**

In [None]:
# ¬øQu√© opin√°is?
# ¬øS√≠, no?.

![wow.jpeg](attachment:wow.jpeg)

## 3. **¬øQu√© es lo que subir√°s a Kaggle?**

**Para subir a Kaggle la predicci√≥n esta tendr√° que tener una forma espec√≠fica.**

En este caso, la **MISMA** forma que `sample_submission.csv`.

In [None]:
# 1) reconstruir sample limpio (variable que usa el chequeador)
sample = pd.read_csv("./data/sample_submission.csv")

# 2) reconstruir submission desde la plantilla
submission = sample.copy()

# 3) meter predicciones (aseg√∫rate de que predictions_submit existe y tiene 391 valores)
price_col = [c for c in submission.columns if "price" in c.lower()][0]
submission[price_col] = predictions_submit

print("sample shape:", sample.shape)
print("submission shape:", submission.shape)
print("IDs iguales:", submission["laptop_ID"].equals(sample["laptop_ID"]))
submission.head()


In [None]:
sample = pd.read_csv("./data/sample_submission.csv")  # IMPORTANT√çSIMO: el chequeador usa 'sample'
print("sample shape:", sample.shape)
print(sample.head())


In [None]:
submission = sample.copy()

price_col = [c for c in submission.columns if "price" in c.lower()][0]
submission[price_col] = predictions_submit

print("submission shape:", submission.shape)
print(submission.head())


## 4. Mete tus predicciones en un dataframe llamado ``submission``.

In [None]:
#¬øC√≥mo creamos la submission?
submission = pd.DataFrame()

In [None]:
submission.head()

In [None]:
submission.shape

In [None]:
print("sample shape:", sample.shape)
print("submission shape:", submission.shape)

print(sample.columns)
print(submission.columns)


## 5. P√°sale el CHEQUEADOR para comprobar que efectivamente est√° listo para subir a Kaggle.

In [None]:
def chequeador(df_to_submit):
    """
    Esta funci√≥n se asegura de que tu submission tenga la forma requerida por Kaggle.

    Si es as√≠, se guardar√° el dataframe en un `csv` y estar√° listo para subir a Kaggle.

    Si no, LEE EL MENSAJE Y HAZLE CASO.

    Si a√∫n no:
    - apaga tu ordenador,
    - date una vuelta,
    - enciendelo otra vez,
    - abre este notebook y
    - leelo todo de nuevo.
    Todos nos merecemos una segunda oportunidad. Tambi√©n t√∫.
    """
    if df_to_submit.shape == sample.shape:
        if df_to_submit.columns.all() == sample.columns.all():
            if df_to_submit.laptop_ID.all() == sample.laptop_ID.all():
                print("You're ready to submit!")
                df_to_submit.to_csv("submission.csv", index = False) #muy importante el index = False
                urllib.request.urlretrieve("https://www.mihaileric.com/static/evaluation-meme-e0a350f278a36346e6d46b139b1d0da0-ed51e.jpg", "gfg.png")
                img = Image.open("gfg.png")
                img.show()
            else:
                print("Check the ids and try again")
        else:
            print("Check the names of the columns and try again")
    else:
        print("Check the number of rows and/or columns and try again")
        print("\nMensaje secreto del TA: No me puedo creer que despu√©s de todo este notebook hayas hecho alg√∫n cambio en las filas de `test.csv`. Lloro.")

In [None]:
chequeador(submission)