## <font color='blue'>Linear regression on California housing data</font>

The California Housing data set was obtained from the 1990 California census. One use of it is to predict housing prices based on features such as house age, location, number of bedrooms, etc.

In the data, the housing has been divided into "blocks", each a geographically compact area containing on average 1400 individuals. There are 20,640 data points, one per block.

Each data point has the following information about the corresponding block:
* median income (multiples of 10K) in that block
* median house age
* average number of rooms in housing in that block
* average number of bedrooms
* population
* average occupancy of houses in block
* latitude
* longitude
* median house value (multiples of 100K)
The regression problem is to predict the house value based on the other 8 features.

### <font color='blue'>1. Loading the data and getting some summary statistics</font>

In addition to `numpy` and `matplotlib` we will be using `pandas`. This gives us a handy way of storing the data in "frames" which include attribute names.

In [None]:
import pandas as pd
import numpy as np

Now let's load in the data and take a quick look at it. The display has one point per row. Notice how nice the formatting is, and how each column is named according to its feature.

In [None]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
df = housing.frame  # a Pandas data-frame
display(df)

Now let's look at the correlations between these 9 variables. We can use the `corr()` method in `pandas` for this, and then display the resulting matrix using some nice formatting.

In [None]:
# Compute correlation matrix
corr_matrix = df.corr() 
# Print it nicely
corr_matrix.style \
    .background_gradient(cmap='coolwarm') \
    .format(precision=2)

<font color='magenta'>Some questions for you:</font>
* Which (other) feature is most highly correlated with median house value?
* Which pair of features are the most strongly correlated?
* Which pair of features are the most negatively correlated?

### <font color='blue'>2. The regression problem</font>

Next, we'll separate the predictor variables (the first eight columns) from the response variable (the last column). 

We will also split the data into training and test set. There is a nice method for this in `sklearn.model_selection`.

In [None]:
from sklearn.model_selection import train_test_split

# Separate predictor variables (X) from response (y)
X = df.drop(columns=['MedHouseVal'])  # Features
y = df['MedHouseVal']                 # Target

# Split data into training set (X_train, y_train) and test set (X_test, y_test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<font color='magenta'>Some questions for you:</font>
* What are the sizes of the training and test sets?
* Suppose we want to predict `y` (house value) without seeing `x`; what value of `y` would work best for the test set, and what would be the resulting mean squared error on the test set?

<font color='magenta'> To do: Use `sklearn.linear_model.LinearRegression` to fit a linear function to the training data using least-squares regression. Then display the resulting coefficients of each of the 8 features and give the mean squared error on the test set.

<font color='magenta'> To do: Again, we'll fit a linear function (using the training set) and get the mean squared error (on the test set). However, this time we will use just a subset of the features.</font>
* Use just the two features `Latitude` and `Longitude`
* Use just one feature; which is the best choice?

 In both cases, report the resulting mean squared error on the test set.