# Linear Regression 

In this example, we will learn how to train a very basic linear regression model with **scikit-learn**. <br/>
Furthermore, we will also practice analyzing data with Pandas.

#### Dataset: California Housing

As an example dataset we will use the California Housing dataset. The data contains information from the 1990 California census.  A description of this dataset can be found here: https://www.kaggle.com/datasets/camnugent/california-housing-prices

Fortunately, the dataset is already provided in scikit-learn. So, there is no need to fetch the data from Kaggle manually. See: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html

In [2]:
# Import modules which are relevant for this project
from sklearn.datasets import fetch_california_housing
import numpy as np
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

# For 3D plotting
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

In [3]:
# To ensure reproducability
np.random.seed(42)
#np.random.seed(103)

## Load the dataset

The steps are detailed in the previous notebook

In [4]:
dataset = fetch_california_housing(as_frame=True)

In [5]:
housing_df = dataset['data']
target_df = dataset['target']

In [6]:
# Insert the housing prices in the housing df
housing_df['HousePrice'] = target_df

## Prepare the training and testing set


### Splitting the dataframe (using sklearn)

In [7]:
train_df, test_df = train_test_split(housing_df, test_size=0.2)

In [8]:
# Briefly check whether we have the correct set sizes
print('Test ratio: ', len(test_df) / (len(train_df) + len(test_df)))

Test ratio:  0.2


In [9]:
feature_columns = housing_df.columns[housing_df.columns != 'HousePrice']
target_column = 'HousePrice'

x_train = train_df[feature_columns].values
y_train = train_df[['HousePrice']].values

x_test = test_df[feature_columns].values
y_test = test_df[['HousePrice']].values

### Create Polynomial features

In [10]:
poly_features = PolynomialFeatures(degree=5, include_bias=False)

x_train_poly = poly_features.fit_transform(x_train)
x_test_poly = poly_features.fit_transform(x_test)

### Feature Scaling

In [11]:
scaler = StandardScaler()

scaler.fit(x_train_poly)

x_scaled_train = scaler.transform(x_train_poly)
x_scaled_test = scaler.transform(x_test_poly)

In [12]:
# Train a Ridge regression model on the scaled data
reg_model = Ridge()
reg_model.fit(x_scaled_train, y_train)

### Evaluate the new model

In [13]:
y_pred_test = reg_model.predict(x_scaled_test)
y_pred_train = reg_model.predict(x_scaled_train)

rmse_test = mean_squared_error(y_test, y_pred_test, squared=False)
rmse_train = mean_squared_error(y_train, y_pred_train, squared=False)

print('Train RMSE:', rmse_train)
print('Test RMSE:', rmse_test)

Train RMSE: 0.597626012798639
Test RMSE: 6.817403455351779
