## Homework

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can the instructions from
[06-environment.md](https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/01-intro/06-environment.md).

### Question 1

What's the version of Pandas that you installed?

You can get the version information using the `__version__` field:

In [76]:
import pandas as pd
pd.__version__

'2.0.3'

### Getting the data 

For this homework, we'll use the California Housing Prices dataset. Download it from 
[here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv).

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv
```

Or just open it with your browser and click "Save as...".

Now read it with Pandas.

In [77]:
df = pd.read_csv('./housing.csv')
df.head().T

Unnamed: 0,0,1,2,3,4
longitude,-122.23,-122.22,-122.24,-122.25,-122.25
latitude,37.88,37.86,37.85,37.85,37.85
housing_median_age,41.0,21.0,52.0,52.0,52.0
total_rooms,880.0,7099.0,1467.0,1274.0,1627.0
total_bedrooms,129.0,1106.0,190.0,235.0,280.0
population,322.0,2401.0,496.0,558.0,565.0
households,126.0,1138.0,177.0,219.0,259.0
median_income,8.3252,8.3014,7.2574,5.6431,3.8462
median_house_value,452600.0,358500.0,352100.0,341300.0,342200.0
ocean_proximity,NEAR BAY,NEAR BAY,NEAR BAY,NEAR BAY,NEAR BAY


### Question 2

How many columns are in the dataset?

- 10 ✅
- 6560
- 10989
- 20640

In [78]:
len(df.columns)

10

### Question 3

Which columns in the dataset have missing values?

- `total_rooms`
- `total_bedrooms` ✅
- both of the above
- no empty columns in the dataset

In [79]:
df.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        207
population              0
households              0
median_income           0
median_house_value      0
ocean_proximity         0
dtype: int64

### Question 4

How many unique values does the `ocean_proximity` column have?

- 3
- 5 ✅
- 7
- 9

In [80]:
df.ocean_proximity.nunique()

5

### Question 5

What's the average value of the `median_house_value` for the houses located near the bay?

- 49433
- 124805
- 259212 ✅
- 380440

In [81]:
df.ocean_proximity.unique()

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)

In [82]:
int(df[df.ocean_proximity == 'NEAR BAY'].median_house_value.mean())

259212

### Question 6

1. Calculate the average of `total_bedrooms` column in the dataset.
2. Use the `fillna` method to fill the missing values in `total_bedrooms` with the mean value from the previous step.
3. Now, calculate the average of `total_bedrooms` again.
4. Has it changed?

Has it changed?

> Hint: take into account only 3 digits after the decimal point.

- Yes
- No ✅


In [83]:
print(df.total_bedrooms.mean().round(3))
print(df.total_bedrooms.isnull().sum())
df.total_bedrooms.fillna(value=df.total_bedrooms.mean(), inplace=True)
print(df.total_bedrooms.isnull().sum())
print(df.total_bedrooms.mean().round(3))

537.871
207
0
537.871


### Question 7

1. Select all the options located on islands.
2. Select only columns `housing_median_age`, `total_rooms`, `total_bedrooms`.
3. Get the underlying NumPy array. Let's call it `X`.
4. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
5. Compute the inverse of `XTX`.
6. Create an array `y` with values `[950, 1300, 800, 1000, 1300]`.
7. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
8. What's the value of the last element of `w`?

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.

- -1.4812
- 0.001
- 5.6992 ✅
- 23.1233

In [84]:
import numpy as np

X = (df[
        (df.ocean_proximity == 'ISLAND') # Select all the options located on islands.
        ][['housing_median_age', 'total_rooms', 'total_bedrooms']] # Select only columns `housing_median_age`, `total_rooms`, `total_bedrooms`.
    ).to_numpy() # Get the underlying NumPy array. Let's call it `X`.

In [85]:
XTX = X.T @ X # Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.

In [86]:
XTX_inverse = np.linalg.inv(XTX) # Compute the inverse of `XTX`.

In [87]:
y = np.array([950, 1300, 800, 1000, 1300])

In [88]:
w = XTX_inverse @ X.T @ y

In [89]:
w[-1]

5.699229455065594







## Submit the results

* Submit your results here: https://forms.gle/jneGM91mzDZ23i8HA
* You can submit your solution multiple times. In this case, only the last submission will be used 
* If your answer doesn't match options exactly, select the closest one


## Deadline

The deadline for submitting is 18 September 2023 (Monday), 23:00 CEST (Berlin time).

After that, the form will be closed.