# Session #2  Homework

This notebook represent the #2 Homework of ML-Zoomcamp 

### Dataset
In this homework, we will use the California Housing Prices from [Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).

https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

The goal of this homework is to create a regression model for predicting housing prices (column `'median_house_value'`).

### Features

For the rest of the homework, you'll need to use only these columns:

* `'latitude'`,
* `'longitude'`,
* `'housing_median_age'`,
* `'total_rooms'`,
* `'total_bedrooms'`,
* `'population'`,
* `'households'`,
* `'median_income'`,
* `'median_house_value'`

Select only them.

### Import packages

In [1]:
import pandas as pd
import numpy as np

### Get the data

In [2]:

url_data = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv'
df = pd.read_csv(url_data)

### Inspect the data

In [3]:
df.sample(5)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
7122,-118.03,33.91,32.0,4040.0,832.0,2526.0,798.0,3.2143,160100.0,<1H OCEAN
12076,-117.6,33.87,15.0,7626.0,1570.0,3823.0,1415.0,3.4419,138100.0,INLAND
16880,-122.39,37.59,32.0,4497.0,,1846.0,715.0,6.1323,500001.0,NEAR OCEAN
2747,-115.57,32.78,20.0,1534.0,235.0,871.0,222.0,6.2715,97200.0,INLAND
15158,-117.05,32.97,17.0,9911.0,1436.0,4763.0,1414.0,5.5882,194300.0,<1H OCEAN


Select only indicated columns

In [4]:
columns = ['latitude','longitude','housing_median_age','total_rooms','total_bedrooms','population','households','median_income','median_house_value']
df = df[columns]

# Question 1
Find a feature with missing values. How many missing values does it have?

In [5]:
df.isnull().sum()

latitude                0
longitude               0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
dtype: int64

# Question 2
What's the median (50% percentile) for variable 'population'?

In [6]:
df['population'].median()

1166.0

# Regression

### Split the data

* Shuffle the initial dataset, use seed 42.
* Split your data in train/val/test sets, with 60%/20%/20% distribution.
* Make sure that the target value ('median_house_value') is not in your dataframe.
* Apply the log transformation to the median_house_value variable using the np.log1p() function.

Shuffle the initial dataset, use seed 42.

# Validation framework
Split your data in train/val/test sets, with 60%/20%/20% distribution.

In [7]:
np.random.seed(42)

n = len(df)

n_val = int(0.2 * n)
n_test = int(0.2 * n)
n_train = n - (n_val + n_test)

idx = np.arange(n)
np.random.shuffle(idx)

df_shuffled = df.iloc[idx]

df_train = df_shuffled.iloc[:n_train].copy()
df_val = df_shuffled.iloc[n_train:n_train+n_val].copy()
df_test = df_shuffled.iloc[n_train+n_val:].copy()

In [8]:
print(n,n_val+n_test+n_train )

20640 20640


Separate target variable ('median_house_value') and remove it 

In [10]:
target_variable = 'median_house_value'
y_train = df_train[target_variable]
del df_train[target_variable]

Apply the log transformation to the median_house_value variable using the np.log1p() function.

In [11]:
y_train = np.log1p(y_train.values)

# Question 3
* We need to deal with missing values for the column from Q1.
* We have two options: fill it with 0 or with the mean of this variable.
* Try both options. For each, train a linear regression model without regularization using the code from the lessons.
* For computing the mean, use the training only!
* Use the validation dataset to evaluate the models and compare the RMSE of each option.
* Round the RMSE scores to 2 decimal digits using round(score, 2)
* Which option gives better RMSE?

In [12]:
def linear_regression(X, y):
    ones = np.ones(X.shape[0])
    X = np.column_stack([ones, X])

    XTX = X.T.dot(X)
    XTX_inv = np.linalg.inv(XTX)
    w = XTX_inv.dot(X.T).dot(y)
    
    return w[0], w[1:]

def rmse(y, y_pred):
    error = y_pred - y
    mse = (error ** 2).mean()
    return np.sqrt(mse)


Linear regression (Filling missing data with 0)

In [20]:
def prepare_X(df):
    X = df.copy()
    X = X.fillna(0).values
    return X

X_train = prepare_X(df_train)
w_0, w = linear_regression(X_train , y_train)
y_pred = w_0 + X_train@w
score =rmse(y_train, y_pred)
round(score,2)

0.34

Linear regression (Filling missing data with mean)

In [21]:
def prepare_X(df):
    X = df.copy()
    total_bedrooms_mean = df['total_bedrooms'].mean()
    X = X.fillna(total_bedrooms_mean).values
    return X

X_train = prepare_X(df_train)
w_0, w = linear_regression(X_train , y_train)
y_pred = w_0 + X_train@w
score =rmse(y_train, y_pred)
round(score,2)

0.34