# Session #3 Homework

### Dataset

In this homework, we will use the California Housing Prices data from __[Kaggle](https://www.kaggle.com/datasets/camnugent/california-housing-prices).__

We'll keep working with the 'median_house_value' variable, and we'll transform it to a classification task.

### Features

For the rest of the homework, you'll need to use only these columns:

- 'latitude'
- 'longitude'
- 'housing_median_age'
- 'total_rooms'
- 'total_bedrooms'
- 'population'
- 'households'
- 'median_income'
- 'median_house_value'
- 'ocean_proximity'


### Data preparation

- Select only the features from above and fill in the missing values with 0.
- Create a new column ***rooms_per_household*** by dividing the column ***total_rooms*** by the column ***households*** from dataframe.
- Create a new column ***bedrooms_per_room*** by dividing the column ***total_bedrooms*** by the column ***total_rooms*** from dataframe.
- Create a new column ***population_per_household*** by dividing the column ***population*** by the column ***households*** from dataframe.


In [1]:
import pandas as pd

Select only features.

In [2]:
colnames = ['latitude', 'longitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value', 'ocean_proximity']
df = pd.read_csv('housing.csv', usecols=colnames)

In [3]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


In [4]:
df.columns = df.columns.str.lower().str.replace(' ', '_')

string_columns = list(df.dtypes[df.dtypes == 'object'].index)

for col in string_columns:
    df[col] = df[col].str.lower().str.replace(' ', '_')

Fill the missing values with 0.

In [5]:
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [6]:
df.fillna(0, inplace=True)

In [7]:
df.isnull().sum()

longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

Create a new column ***rooms_per_household*** by dividing the column ***total_rooms*** by the column ***households*** from dataframe.

In [8]:
df['bedrooms_per_room'] = df['total_rooms'] / df['households']

Create a new column ***bedrooms_per_room*** by dividing the column ***total_bedrooms*** by the column ***total_rooms*** from dataframe.

In [9]:
df['rooms_per_household'] = df['total_bedrooms'] / df['total_rooms']

Create a new column ***population_per_household*** by dividing the column ***population*** by the column ***households*** from dataframe.

In [10]:
df['population_per_household'] = df['population'] / df['households']

In [11]:
df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,bedrooms_per_room,rooms_per_household,population_per_household
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,near_bay,6.984127,0.146591,2.555556
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,near_bay,6.238137,0.155797,2.109842
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,near_bay,8.288136,0.129516,2.80226
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,near_bay,5.817352,0.184458,2.547945
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,near_bay,6.281853,0.172096,2.181467


### Question 1

What is the most frequent observation (mode) for the column ocean_proximity?

Options:

- NEAR BAY
- <1H OCEAN
- INLAND
- NEAR OCEAN


In [12]:
df['ocean_proximity'].value_counts()

<1h_ocean     9136
inland        6551
near_ocean    2658
near_bay      2290
island           5
Name: ocean_proximity, dtype: int64

In [13]:
df['ocean_proximity'].mode()

0    <1h_ocean
Name: ocean_proximity, dtype: object

The answer is <1h_ocean.

### Split the data

- Split your data in train/val/test sets, with 60%/20%/20% distribution.
- Use Scikit-Learn for that (the train_test_split function) and set the seed to 42.
- Make sure that the target value (median_house_value) is not in your dataframe.

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
df_train_full, df_test = train_test_split(df, test_size=0.2, random_state=42)

In [16]:
df_train, df_val = train_test_split(df_train_full, test_size=0.25, random_state=42)

In [17]:
len(df_train), len(df_val), len(df_test)

(12384, 4128, 4128)

In [18]:
len(df), (len(df_train) + len(df_val) + len(df_test))

(20640, 20640)

In [19]:
y_train = df_train['median_house_value'].values
y_val = df_val['median_house_value'].values
y_test = df_test['median_house_value'].values

In [20]:
del df_train['median_house_value']
del df_val['median_house_value']
del df_test['median_house_value']

### Question 2

- Create the correlation matrix for the numerical features of your train dataset.
    - In a correlation matrix, you compute the correlation coefficient between every pair of features in the dataset.
- What are the two features that have the biggest correlation in this dataset?

Options:

- total_bedrooms and households
- total_bedrooms and total_rooms
- population and households
- population_per_household and total_rooms

In [21]:
categorical = ['ocean_proximity']
numerical = ['longitude', 'latitude', 'housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 
             'households', 'median_income', 'bedrooms_per_room', 'rooms_per_household', 'population_per_household']

In [23]:
df_train_full[numerical].corrwith(df_train_full['median_house_value']).to_frame('correlation')

Unnamed: 0,correlation
longitude,-0.046349
latitude,-0.142983
housing_median_age,0.103706
total_rooms,0.133989
total_bedrooms,0.04798
population,-0.026032
households,0.063714
median_income,0.690647
bedrooms_per_room,0.158485
rooms_per_household,-0.257419


### Make median_house_value binary

- We need to turn the median_house_value variable from numeric into binary.
- Let's create a variable above_average which is 1 if the median_house_value is above its mean value and 0 otherwise.

In [29]:
for y in [y_train, y_val, y_test]:
    y = pd.DataFrame(y, columns={'median_house_value'})
    y['above_average'] = y['median_house_value'].apply(lambda x: 1 if x > y['median_house_value'].mean() else 0)
    y = y['median_house_value'].values

### Question 3

- Calculate the mutual information score with the (binarized) price for the categorical variable that we have. Use the training set only.
- What is the value of mutual information?
- Round it to 2 decimal digits using round(score, 2)

Options:

- 0.26
- 0
- 0.10
- 0.16

In [31]:
from sklearn.metrics import mutual_info_score

In [32]:
round(mutual_info_score(df_train['ocean_proximity'], y_train), 2)

0.57

### Question 4


- Now let's train a logistic regression
- Remember that we have one categorical variable ocean_proximity in the data. Include it using one-hot encoding.
- Fit the model on the training dataset.
    - To make sure the results are reproducible across different versions of Scikit-Learn, fit the model with these parameters:
    - model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
- Calculate the accuracy on the validation dataset and round it to 2 decimal digits.

Options:

- 0.60
- 0.72
- 0.84
- 0.95


In [33]:
from sklearn.feature_extraction import DictVectorizer

train_dict = df_train[categorical + numerical].to_dict(orient='records')
dv = DictVectorizer(sparse=False)
dv.fit(train_dict)
X_train = dv.transform(train_dict)

In [38]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(solver="liblinear", C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

In [39]:
val_dict = df_val[categorical + numerical].to_dict(orient='records')
X_val = dv.transform(val_dict)

In [40]:
model.predict_proba(X_val)

array([[4.33498254e-008, 3.69717072e-006, 5.50182554e-037, ...,
        1.26442743e-006, 5.24431935e-004, 2.82240559e-003],
       [2.27471662e-008, 2.78970294e-007, 2.79848753e-047, ...,
        2.05401323e-007, 9.99680344e-004, 5.36940746e-004],
       [4.31576757e-027, 1.07389439e-016, 2.35541589e-154, ...,
        2.72718542e-013, 2.23631633e-003, 3.42617887e-002],
       ...,
       [1.32109000e-018, 2.37609524e-012, 2.23946844e-094, ...,
        2.85747931e-005, 4.00967234e-004, 5.36701750e-002],
       [2.60295936e-020, 2.79568864e-012, 1.18385345e-105, ...,
        6.33856203e-008, 1.64549299e-003, 1.05847926e-001],
       [6.27552116e-010, 7.71997248e-007, 1.18736643e-047, ...,
        2.41215052e-006, 1.78275598e-003, 1.23954150e-002]])

In [41]:
y_pred = model.predict_proba(X_val)[:, 1]

In [42]:
average = y_pred > 0.5

In [43]:
(y_val == average).mean()

0.0

### Question 5


- Let's find the least useful feature using the feature elimination technique.
- Train a model with all these features (using the same parameters as in Q4).
- Now exclude each feature from this set and train a model without it. Record the accuracy for each model.
- For each feature, calculate the difference between the original accuracy and the accuracy without the feature.
- Which of following feature has the smallest difference?
    - total_rooms
    - total_bedrooms
    - population
    - households

Note: the difference doesn't have to be positive


### Question 6


- For this question, we'll see how to use a linear regression model from Scikit-Learn
- We'll need to use the original column 'median_house_value'. Apply the logarithmic transformation to this column.
- Fit the Ridge regression model (model = Ridge(alpha=a, solver="sag", random_state=42)) on the training data.
- This model has a parameter alpha. Let's try the following values: [0, 0.01, 0.1, 1, 10]
- Which of these alphas leads to the best RMSE on the validation set? Round your RMSE scores to 3 decimal digits.

If there are multiple options, select the smallest alpha.

Options:

- 0
- 0.01
- 0.1
- 1
- 10