## Homework

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can the instructions from
[06-environment.md](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/01-intro/06-environment.md).

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline

### Question 1

What's the version of Pandas that you installed?

You can get the version information using the `__version__` field:

In [None]:
pd.__version__

'2.1.0'

### Getting the data 

For this homework, we'll use the California Housing Prices dataset. Download it from 
[here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv).

> Uncomment lines below to work with [GCS](https://cloud.google.com/storage/docs):

In [None]:
# # Get data and save it in a gcs bucket
# from gcp_python_client_functions.clients import *
# import io
# import requests

# PROJECT_ID = 'dz-learning-d'

# # Cloud Storage
# stg_obj = Storage(PROJECT_ID)

In [None]:
# bucket_name = 'gs://dz-d-stg-us-ml-zoomcamp'
# file_location = '00_intro/housing.csv'
# online_csv_file_path = "https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv"

# # Get the client object
# storage_client = stg_obj.client

# # Get a reference to the bucket
# bucket = storage_client.bucket(bucket_name)

# # Create a blob object
# blob = bucket.blob(file_location)

# # Get the contents of the online CSV file
# csv_data = requests.get(online_csv_file_path).content

# # Set the blob's content type
# blob.content_type = "text/csv"

# # Upload the CSV file to GCS as a string
# blob.upload_from_string(csv_data)

# # Read with Pandas
# df = pd.read_csv(bucket_name + '/' + file_location)

Now read it with Pandas.

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv')

In [None]:
df.head(6)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
5,-122.25,37.85,52.0,919.0,213.0,413.0,193.0,4.0368,269700.0,NEAR BAY


### Question 2

How many columns are in the dataset?

- 10 ✅
- 6560
- 10989
- 20640

In [None]:
df.shape[1]

10

### Question 3

Which columns in the dataset have missing values?

- `total_rooms`
- `total_bedrooms` ✅
- both of the above
- no empty columns in the dataset

In [None]:
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

In [None]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0


### Question 4

How many unique values does the `ocean_proximity` column have?

- 3
- 5 ✅
- 7
- 9

In [None]:
print('Unique values of ocean proximity: %d'%df.ocean_proximity.unique().shape[0])

Unique values of ocean proximity: 5


### Question 5

What's the average value of the `median_house_value` for the houses located near the bay?

- 49433
- 124805
- 259212 ✅
- 380440

In [None]:
df.ocean_proximity.unique()

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)

In [None]:
df_near_bay = df[df.ocean_proximity == 'NEAR BAY']
df_near_bay.head(2)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY


In [None]:
print('The avg median_house_value for the properties near the bay is $%.0f' % df_near_bay.median_house_value.mean())

The avg median_house_value for the properties near the bay is $259212


### Question 6

1. Calculate the average of `total_bedrooms` column in the dataset.
2. Use the `fillna` method to fill the missing values in `total_bedrooms` with the mean value from the previous step.
3. Now, calculate the average of `total_bedrooms` again.
4. Has it changed?

Has it changed?

> Hint: take into account only 3 digits after the decimal point.

- Yes
- No ✅

In [None]:
# 1. Average of total_bedrooms
avg_total_bedrooms = df.total_bedrooms.mean()
print('Average of number of bedrooms is %.3f'%avg_total_bedrooms)
# 2. Fill null total_bedroom values
df_clean = df.copy()
df_clean.total_bedrooms = df_clean.total_bedrooms.fillna(value=avg_total_bedrooms)
df_clean.isnull().sum()

Average of number of bedrooms is 537.871


longitude             0
latitude              0
housing_median_age    0
total_rooms           0
total_bedrooms        0
population            0
households            0
median_income         0
median_house_value    0
ocean_proximity       0
dtype: int64

In [None]:
# 3. Calculate average again:
print('New average of number of bedrooms is %.3f'%df_clean.total_bedrooms.mean())

New average of number of bedrooms is 537.871


### Question 7

1. Select all the options located on islands.
2. Select only columns `housing_median_age`, `total_rooms`, `total_bedrooms`.
3. Get the underlying NumPy array. Let's call it `X`.
4. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
5. Compute the inverse of `XTX`.
6. Create an array `y` with values `[950, 1300, 800, 1000, 1300]`.
7. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
8. What's the value of the last element of `w`?

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.

- -1.4812
- 0.001
- 5.6992 ✅
- 23.1233

In [None]:
# 1. Select island location 
df_island = df_clean[df_clean.ocean_proximity.str.contains('ISLAND')]
df_island

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
8314,-118.32,33.35,27.0,1675.0,521.0,744.0,331.0,2.1579,450000.0,ISLAND
8315,-118.33,33.34,52.0,2359.0,591.0,1100.0,431.0,2.8333,414700.0,ISLAND
8316,-118.32,33.33,52.0,2127.0,512.0,733.0,288.0,3.3906,300000.0,ISLAND
8317,-118.32,33.34,52.0,996.0,264.0,341.0,160.0,2.7361,450000.0,ISLAND
8318,-118.48,33.43,29.0,716.0,214.0,422.0,173.0,2.6042,287500.0,ISLAND


In [None]:
# 2. Select columns
df_selected = df_island[['housing_median_age', 'total_rooms', 'total_bedrooms']]
df_selected

Unnamed: 0,housing_median_age,total_rooms,total_bedrooms
8314,27.0,1675.0,521.0
8315,52.0,2359.0,591.0
8316,52.0,2127.0,512.0
8317,52.0,996.0,264.0
8318,29.0,716.0,214.0


In [None]:
# 3. Get numpy array
X = df_selected.values
X

array([[  27., 1675.,  521.],
       [  52., 2359.,  591.],
       [  52., 2127.,  512.],
       [  52.,  996.,  264.],
       [  29.,  716.,  214.]])

In [None]:
# 4. Matrix-matrix multiplication between the transpose of X and X
XTX = np.matmul(X.T, X)
XTX

array([[9.6820000e+03, 3.5105300e+05, 9.1357000e+04],
       [3.5105300e+05, 1.4399307e+07, 3.7720360e+06],
       [9.1357000e+04, 3.7720360e+06, 9.9835800e+05]])

In [None]:
# 5. Inverse of XTX
XTX_inv = np.linalg.inv(XTX)
XTX_inv

array([[ 9.19403586e-04, -3.66412216e-05,  5.43072261e-05],
       [-3.66412216e-05,  8.23303633e-06, -2.77534485e-05],
       [ 5.43072261e-05, -2.77534485e-05,  1.00891325e-04]])

In [None]:
# 6. Array y
y = np.array([950, 1300, 800, 1000, 1300])
y

array([ 950, 1300,  800, 1000, 1300])

In [None]:
# 7. Get w
w = XTX_inv.dot(X.T).dot(y)
w

array([23.12330961, -1.48124183,  5.69922946])

In [None]:
# 8. Value of the last element of `w`
print('Last element of w: %.4f'%w[-1])

Last element of w: 5.6992
