## Regresión Lineal Múltiple


**La diferencia con la regresión lineal anterior es que ahora existen más de una variable predictiva, las cuales buscán estimar el valor de $y$.**

$$ \widehat{y} = \theta_0 + \theta_1 x_1 +  \theta_2 x_2 + ... + \theta_n x_n$$

$$ \widehat{y} = \theta^T X $$

$$ \theta^T = \begin{bmatrix}
\theta_0 &\theta_1  &\theta_2  &\theta_n 
\end{bmatrix} $$

$$X = \begin{bmatrix}
1\\ 
x_1\\ 
x_2\\
x_n\\
\end{bmatrix}$$

![ml_09.png](attachment:ml_09.png)

**La regresión lineal simple y la regresión multilineal se basan en los mismos conceptos y las mismas técnicas de evaluación.**

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

In [None]:
df = pd.read_csv("FuelConsumptionCo2.csv")

df.head(3)

In [None]:
# Esta vez vamos a utilizar 3 columnas para predecir "CO2EMISSIONS"

df[["ENGINESIZE", "CYLINDERS", "FUELCONSUMPTION_CITY"]].head()

In [None]:
X = np.array(df[["ENGINESIZE", "CYLINDERS", "FUELCONSUMPTION_CITY"]])

y = np.array(df["CO2EMISSIONS"])

In [None]:
X.shape, y.shape

### Train, Test

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30)

print(f"Conjunto de Train: {X_train.shape, X_test.shape}")
print(f"Conjunto de Test: {y_train.shape, y_test.shape}")

In [None]:
# Algoritmo de regresión lineal de sklearn

regresion_lineal = LinearRegression()
regresion_lineal.fit(X_train, y_train)

# Encontramos los coeficientes de la recta  
print ("weights:", regresion_lineal.coef_)
print ("w_0:", regresion_lineal.intercept_)

### Predicciones

In [None]:
yhat = regresion_lineal.predict(X_test)

for i, j in zip(yhat[:5], y_test[:5]):
    print(f"Predicción:{i} \tValor real:{j}")

### Metricas

In [None]:
# Sklearn tiene las formulas de algunas métricas en funciones.

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

In [None]:
# Relative Absolute Error
RAE = np.sum(np.abs(np.subtract(y_test, yhat))) / np.sum(np.abs(np.subtract(y_test, np.mean(y_test))))

# Relative Square Error
RSE = np.sum(np.square(np.subtract(y_test, yhat))) / np.sum(np.square(np.subtract(y_test, np.mean(y_test))))

# Adjusted R**2
r2_ajustada = 1 - (1 - regresion_lineal.score(X_test, y_test))*(len(y_test) - 1)/(len(y_test) - X_test.shape[1] - 1)

In [None]:
print(f"MAE:\t {mean_absolute_error(yhat, y_test)}")
print(f"MSE:\t {mean_squared_error(yhat, y_test)}")
print(f"R**2:\t {r2_score(yhat, y_test)}")
print(f"RAE:\t {RAE}")
print(f"RSE:\t {RSE}")
print(f"Adjusted R**2:\t {r2_ajustada}")

### y_test vs yhat

In [None]:
# Veamos los valores de yhat, y_test y su diferencia

df_pred = pd.DataFrame()

df_pred["y_test"] = y_test.flatten()
df_pred["yhat"] = yhat.flatten()

df_pred["diferencia"] = round(abs((df_pred["y_test"] - df_pred["yhat"]) / df_pred["y_test"] * 100), 4)

df_pred = df_pred.sort_values("diferencia")

df_pred.head(20)

In [None]:
df_pred.tail(20)

In [None]:
# Vamos a comparar que tan alejados estan los valores reales (y_test) y los valores predichos (y_train)

plt.figure(figsize = (8, 5))

sns.scatterplot(x = y_test.flatten(), y = yhat.flatten(), alpha = 0.5, color = "blue")

plt.xlabel("Valores Reales (y_train)", size = 18)
plt.ylabel("Predicciones (yhat)", size = 18)

plt.show()

In [None]:
# Si graficamos X_test vs y_test obtenemos la nube de puntos de valores reales

# Si graficamos X_test vs yhat obtenemos la "nube" de puntos de valores predichos

plt.figure(figsize = (8, 5))

plt.plot(X_test, y_test, marker = "o", linestyle = "", label = "y_test", alpha = 0.5)

plt.plot(X_test, yhat, marker = "o", linestyle = "", label = "yhat", alpha = 0.5)

plt.legend()
plt.show()

In [None]:
################################################################################################################################