# Linear Regression 

In this example, we will learn how to train a very basic linear regression model with **scikit-learn**. <br/>
Furthermore, we will also practice analyzing data with Pandas.

#### Dataset: California Housing

As an example dataset we will use the California Housing dataset. The data contains information from the 1990 California census.  A description of this dataset can be found here: https://www.kaggle.com/datasets/camnugent/california-housing-prices

Fortunately, the dataset is already provided in scikit-learn. So, there is no need to fetch the data from Kaggle manually. See: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_california_housing.html

In [1]:
# Import modules which are relevant for this project
from sklearn.datasets import fetch_california_housing
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
import seaborn as sns

## Load the dataset

In [2]:
# as_frame: 
# If True, the data is a pandas DataFrame including columns with appropriate dtypes (numeric, string or categorical). 
# The target is a pandas DataFrame or Series depending on the number of target_columns.
dataset = fetch_california_housing(as_frame=True)

In [3]:
# Check what keys are available.
# We are interested in <data> and <target>
print(dataset.keys())

dict_keys(['data', 'target', 'frame', 'target_names', 'feature_names', 'DESCR'])


In [4]:
housing_df = dataset['data']
target_df = dataset['target']

## Analyze the data

Since we have now successfully loaded the data, our first step will be to get an understanding of our data. 

In [2]:
# TODO: Analyze the datasets
# - How many instances does the dataset have?
# - What columns does it have and what do they represent?
# - What are the datatypes?
# - What is our target variable and what are the features?

Let's take a closer look at the values in `housing_df`.

Looking at the mean and median values, we can already gain some information about the distribution of the data and whether they are skewed.

Recall that if a distribution is **left-skewed**, it has a tail on the left which means that the Mean < Median < Mode. <br/>
If it's **right-skewed**, it has a tail on the right which means that Mode < Median < Mean.

In [None]:
# TODO: Inspect the statistics of housing_df

However, we typically get a better understanding if we visualize the data. Let's take a look at the histograms.

In [None]:
# TODO: Use the pandas plotting capabilities to plot the distribution of individual features

However, doing this for every feature individually is a little bit inconvenient. Luckily, there is a faster way to do this for all features.

In [3]:
# TODO: Plot the distribution of all features in a grid
# What insights can we gain from these distributions?

Since our goal is to train a linear regression model, it's important to understand which colums well correlated with the house price. <br/>
A fast way to obtain this informations is by computing the pairwise correlations between the columns.

The **Person Correlation coefficient** is a way to measure how correlated to variables are. A value close to 1 indicates that two variables exhibit perform positive correlation. A values close to -1 indicates perfect negative correlation. See: https://en.wikipedia.org/wiki/Pearson_correlation_coefficient

In [4]:
# TODO: Compute the pairwise correlation between the different variables

Alternatively, we can use [Seaborn](https://seaborn.pydata.org/index.html) to graphically visualize the correlation matrix. Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

In [5]:
# TODO:Use seaborn to create a heatmap of the correlation matrix

We are primarily interested in the correlation of each variable with `HousePrice`. So let's do some filtering. <br/>
What we would like to see are values close to 1 or -1.

In [14]:
# TODO: Obtain the correlation coefficient of all variables with HousingPrice
# Which variable can be expect to 

HousePrice    1.000000
MedInc        0.688075
AveRooms      0.151948
HouseAge      0.105623
AveOccup     -0.023737
Population   -0.024650
Longitude    -0.045967
AveBedrms    -0.046701
Latitude     -0.144160
Name: HousePrice, dtype: float64

In [6]:
# TODO: Visually inspect the correlation between the chosen variable and HousePrice
# What is a good plot type?
# What insight can we gain from this plot

## Preview: Train our first linear regression model

To get familiar with linear regression models, we will now train our first linear regressor on two variables only (`MedInc` and `HousePrice`). In this notebook, we will not split the dataset into two subsets (train and test). This will be done in the next notebook. For now, we only care how a linear regression can be created and trained with scikit-learn.

I recommend taking a look at the documentation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

This page provides some nice examples that show how to fit a line to some points: https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py

In [None]:
# TODO: Create a linear regression model

In [None]:
# TODO: Extract the training data from the pandas dataframe

In [None]:
# TODO: Train the model

That's it, we just trained your first model. Let's take a look at its predictions.

In [17]:
# Sample points at regular intervals
x_line = np.arange(0, 10, 0.01).reshape(-1, 1)

In [None]:
# TODO: Predict the house price for each test point x (x_line)

In [7]:
# TODO: Visualize the result by plotting the test points and the training data

In [None]:
# Question: 
# How can we measure the performance of our model?
# Is our model good or bad?