# Linear Regression 

In this example, we will learn how to train a very basic linear regression model with **scikit-learn**. <br/>
Furthermore, we will also practice analyzing data with Pandas.

#### Dataset: California Housing

As an example dataset we will use the California Housing dataset. The data contains information from the 1990 California census.  A description of this dataset can be found here: https://www.kaggle.com/datasets/camnugent/california-housing-prices

Fortunately, the dataset is already provided in scikit-learn. So, there is no need to fetch the data from Kaggle manually. See: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html

In [None]:
# Import modules which are relevant for this project
from sklearn.datasets import fetch_california_housing
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split

# For 3D plotting
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

In [None]:
# To ensure reproducibility

RANDOM_STATE = 42

np.random.seed(RANDOM_STATE)

## Load the dataset

The steps are detailed in the previous notebook

In [None]:
dataset = fetch_california_housing(as_frame=True)

In [None]:
housing_df = dataset['data']
target_df = dataset['target']

In [None]:
# Insert the housing prices in the housing df
housing_df['HousePrice'] = target_df

## Prepare a training and testing set

### Manually splitting the dataframe

In [None]:
def custom_split_train_test(df, test_ratio):
    
    # TODO: Randomly split the data into two parts

In [None]:
# Use 80% for training and 20% for testing
train_df, test_df = custom_split_train_test(housing_df, 0.2)

In [None]:
# Briefly check whether we have the correct set sizes
print('Test ratio: ', len(test_df) / (len(train_df) + len(test_df)))

### Splitting the dataframe (using sklearn)

In [1]:
# TODO: Split the dataset using sklearn's train_test_split()

In [None]:
# Briefly check whether we have the correct set sizes
print('Test ratio: ', len(test_df) / (len(train_df) + len(test_df)))

## Train and test the linear classifier (Features: MedInc)

### Train the model on the training set

In [None]:
# TODO: Train the model on the training set

### Evaluate the model on the training and testing set

In [None]:
# Question: What would be a good error metric?

In [2]:
# TODO: Use the error metric to evaluate the performance on the testing set

In [3]:
# Question: What can we say about the performance of our model?
# Does it overfit or underfit the training data?
# How can we improve the result?

## Adding another input feature (Features: MedInc, HouseAge)

Next, we will add another input feature (HouseAge) to see whether this improves the performance. And also, to check out Matplotlib's 3D plotting capabilities ;-)

Note that a linear regression model can process an arbitrarily large number of input features simply by fitting a hyperplane instead of a line.

In [None]:
# Add the additional input feature HouseAge
feature_names = ['MedInc', 'HouseAge']

x_train = train_df[feature_names]
y_train = train_df[['HousePrice']]

x_test = test_df[feature_names]
y_test = test_df[['HousePrice']]

### Visualize the input space using Matplotlib

First, let's look at how to visualize the 3D space using Matplotlib.

In [None]:
# Create a new figure
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111, projection='3d')

# Sample some points from the dataset to make the visualization more sparse
vis_df = train_df.sample(5000)

# Plot the chosen points in the 3D space
ax.scatter(vis_df['MedInc'], vis_df['HouseAge'], vis_df['HousePrice'], s=1, c='red')

ax.set_xlabel("MedInc")
ax.set_ylabel("HouseAge")
ax.set_zlabel("HousePrice")

plt.show()

### Train and evaluate the new model

In [None]:
# Train the model
reg_model.fit(x_train, y_train)

In [None]:
y_pred_test = reg_model.predict(x_test)
y_pred_train = reg_model.predict(x_train)

rmse_test = mean_squared_error(y_test, y_pred_test, squared=False)
rmse_train = mean_squared_error(y_train, y_pred_train, squared=False)

print('Train RMSE:', rmse_train)
print('Test RMSE:', rmse_test)

As can be seen, adding the new features gives a slight increase in performance. Unfortunately, the model still performs badly.

### Visualize the hyperplane

In [None]:
# Create a new figure
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111, projection='3d')

# Sample some points from the dataset to make the visualization more sparse
vis_df = train_df.sample(5000)

# Plot the chosen points in the 3D spcae
ax.scatter(vis_df['MedInc'], vis_df['HouseAge'], vis_df['HousePrice'], s=1, c='red')

# Obtain coefficients from the linear classifier
coef = reg_model.coef_
intercept = reg_model.intercept_

# Obtain the range across the hyperplane should span
x_range = np.arange(vis_df['MedInc'].min(), vis_df['MedInc'].max())
y_range = np.arange(vis_df['HouseAge'].min(), vis_df['HouseAge'].max())

# Draw hyperplane
xx, yy = np.meshgrid(x_range, y_range)

zz = coef[0, 0]*xx + coef[0, 1]*yy + intercept
ax.plot_surface(xx, yy, zz, alpha=0.5)

ax.set_xlabel("MedInc")
ax.set_ylabel("HouseAge")
ax.set_zlabel("HousePrice")

# Rotate the view to better see the hyperplane
ax.view_init(elev=10, azim=110, roll=0)

plt.show()