# Baruch ML exam group 09, Annie Yi and Daniel Tuzes

## Fetch and preview the data

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_california_housing
california_housing = fetch_california_housing(as_frame=True)
print(california_housing.DESCR)

In [None]:
california_housing.data.head()

In [None]:
california_housing.target.head()

In [None]:
plt.hist(california_housing.target, bins=30, alpha=0.5, label='y_train')
plt.xlabel('House Value')
plt.ylabel('Frequency')
plt.title('Distribution of house values')

We observed that the distribution has a cut off at 5(MM?), and house values above 5MM may be squeezed to the value 5MM. We don't the details of the data collection, and leave it as it is, believing it is not DQ issue.

Show the correlation between the features and the house prices as target variable

In [None]:
# show correlation between the columns of california_housing.data
corr_matrix = california_housing.data.join(california_housing.target).corr()
display(corr_matrix)
# color the correlation matrix, define plt
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')


The map shows that MedHouseVal is positively correlated with MedInc, HouseAge, and AveRooms, and negative with others.

- It is interesting that AveBedrooms is negatively correlated with MedHouseVal, which means that the more bedrooms, the less the house price. But the anti-correlation is not strong.
- The negative correlation betwwen population and house price means that that some of the most expensive houses are in less populated ares. To have a more detailed view on the pair distributions, below a plot of the pariwise correlations.
- Geography does play a role in the house prices, but we expact that it should be correlated with regional blocks, and not latitude and longitude.
- The more people live in 1 household, the lower the price is, because rich families with 1 kid or no kids can afford a more expensive house. Families with more kids can spend less on housing. The effect that more poeple must live in more bedrooms is not strong enough to compensate for the effect of the number of bedrooms on the price.
- There is a strong correlation between the average number of rooms and average number of bedrooms, so will take only one of the for the model.

In [None]:
sns.pairplot(california_housing.data.join(california_housing.target), diag_kind='kde')

## Feature selection
Based on the previous observations, and to be able to explain the model, we will use the following features: MedInc, HouseAge, AveRooms, AveOccup and Population.

## Filter and split the data
We restrict our data to the features we selected, and split the data into training and test sets with a 80/20 split. We also randomize the data in case the data is ordered in any way.

In [None]:

X = california_housing.data[['MedInc', 'HouseAge', 'AveRooms', 'AveOccup', "Population"]].values
# Append a column of ones to the data matrix to account for the bias term
X = np.c_[np.ones(X.shape[0]), X]
y = california_housing.target.values

# split the data into trainind and test set
split_ratio = 0.8
num_samples = X.shape[0]
np.random.seed(42)
indices = np.random.permutation(num_samples)  # just reshuffle the indices, now we can split the data in order
split_index = int(num_samples * split_ratio)
train_indices = indices[:split_index]
test_indices = indices[split_index:]
X_train, X_test = X[train_indices], X[test_indices]
y_train, y_test = y[train_indices], y[test_indices]

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape  # check the shapes

Now we can solve the OLS problem as shown in the pdf using numpy (we have done it in HW3 too):
$$\beta ={{\left( {{X}^{T}}X \right)}^{-1}}Xy$$

In [None]:
B = np.linalg.inv(X_train.T @ X_train) @ X_train.T @ y_train
B

Now let's see how well we perform on the train set.

In [None]:
# predict the target values for the train set
y_train_pred = X_train @ B
# plot the predicted values against the actual values
plt.figure(figsize=(8, 6))
plt.scatter(y_train, y_train_pred, alpha=0.5)
plt.xlabel('Actual Values')
# plot the f(x) = x line
plt.plot([0, 5], [0, 5], 'r--')
plt.legend(['actual vs trained', 'f(x) = x'])
plt.ylabel('Predicted Values')
plt.title('Train Set: Actual vs Predicted Values')
plt.show()

Now let's see how well we perform on the test set.

In [None]:
# predict the target values for the test set
y_test_pred = X_test @ B
# plot the predicted values against the actual values
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_test_pred, alpha=0.5)
plt.xlabel('Actual Values')
# plot the f(x) = x line
plt.plot([0, 5], [0, 5], 'r--')
plt.legend(['actual vs test', 'f(x) = x'])
plt.ylabel('Predicted Values')
plt.title('Test Set: Actual vs Predicted Values')
plt.show()

We can see that the on the test set, the results are not much worse than on the train set, which is a sign of no high variance. To quanitfy it, we calculate the R2 score, the  residual sum of squares.

In [None]:
ss_res_train = np.sum((y_train - y_train_pred)**2)
ss_tot_train = np.sum((y_train - np.mean(y_train_pred))**2)
r_squared_train = 1 - (ss_res_train / ss_tot_train)
r_squared_train

In [None]:
ss_res_test = np.sum((y_test - y_test_pred)**2)
ss_tot_test = np.sum((y_test - np.mean(y_test_pred))**2)
r_squared_test = 1 - (ss_res_test / ss_tot_test)
r_squared_test

The next step will be to implement ridge regression, to eliminate irrelevant features and to avoid overfitting. We will add back the features we dropped out, and also, filter out house prices which are at 5MM to see if it can improve the model.

## Filter out the 5MM house prices and redo the train/test split