## Homework for week01

### Setup environment

- name env `zoomcamp`

```python
micromamba create -n zoomcamp python=3.9
micromamba activate zoomcamp
micromamba install numpy pandas scikit-learn seaborn jupyter
```

### Get data

```bash
!wget -P <path> -O <out_file_name>
```

In [None]:
# !wget -P ../data https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv

#### Data Dictionary
[source](https://www.kaggle.com/datasets/camnugent/california-housing-prices)

Take note that the numbers in dataset is referring to residential buildings `within a block` and is not for single-family homes.


> 1. longitude: A measure of how far west a house is; a higher value is farther west
> 
> 2. latitude: A measure of how far north a house is; a higher value is farther north
> 
> 3. housingMedianAge: Median age of a house within a block; a lower number is a newer building
> 
> 4. totalRooms: Total number of rooms within a block
> 
> 5. totalBedrooms: Total number of bedrooms within a block
> 
> 6. population: Total number of people residing within a block
> 
> 7. households: Total number of households, a group of people residing within a home unit, for a block
> 
> 8. medianIncome: Median income for households within a block of houses (measured in tens of thousands of US Dollars)
> 
> 9. medianHouseValue: Median house value for households within a block (measured in US Dollars)
> 
> 10. oceanProximity: Location of the house w.r.t ocean/sea

#### What is a block?

[source](https://www.tripadvisor.co.uk/ShowTopic-g60763-i5-k550785-What_is_a_block-New_York_City_New_York.html)

> While a block properly so-called is just as described above, most New Yorkers (and indeed, most Americans) commonly use it to mean "the distance along a street between intersections."
> 
> If I start at the corner of Fifth Avenue and 50th Street and walk north along Fifth Avenue, when I come to 51st Street I have walked "one block", when I arrive at 53rd Street I have walked "three blocks", and so on.
> 
> Naturally, blocks are NOT the same length -- it is three or four times longer to walk the one block from Fifth to Sixth Avenues than it is to walk the one block from 35th to 36th Streets.

In [None]:
!which python

## Import packages

In [None]:
import numpy as np
import pandas as pd


### Question 1

q: What's the version of Pandas that you installed?

a: 2.1.0

In [None]:
print(f'{pd.__version__ = }')

### Question 2

q: How many columns are in the dataset?

a: 10

In [None]:
df = pd.read_csv('../data/housing.csv')
print(f'{df.shape =}')

In [None]:
df.info()

In [None]:

numerical_cols = ['housing_median_age', 'total_rooms', 'total_bedrooms', 'population', 'households', 'median_income', 'median_house_value']

In [None]:
df['housing_median_age'].plot.box()

In [None]:
df['total_bedrooms'].plot.box()


### Question 3

q: Which columns in the dataset have missing values?

a: 'total_bedrooms'

In [None]:
df.isnull().sum()

In [None]:
df.columns[df.isnull().sum() > 0]

### Question 4

q: How many unique values does the `ocean_proximity` column have?

a: 5

In [None]:
df['ocean_proximity'].nunique()

In [None]:
df['ocean_proximity'].value_counts()

### Question 5

q: What's the average value of the `median_house_value` for the houses located near the bay?

a: 

In [None]:
df[df['ocean_proximity'] == 'NEAR BAY']['median_house_value'].mean()

### Question 6

q: 

1. Calculate the average of `total_bedrooms` column in the dataset.
1. Use the `fillna` method to fill the missing values in `total_bedrooms` with the mean value from the previous step.
1. Now, calculate the average of `total_bedrooms` again.
1. Has it changed?


a: No

1. 537.871
2. 
3. 537.871
4. 

In [None]:
df['total_bedrooms'].value_counts()

In [None]:
df.columns[df.isnull().sum() > 0]

In [None]:
mean_before_fill = df['total_bedrooms'].mean()
median_before_fill = df['total_bedrooms'].median()

In [None]:
df_mean = df.copy()
df_median = df.copy()

In [None]:
df_mean['total_bedrooms'] = df['total_bedrooms'].fillna(mean_before_fill)
df_median['total_bedrooms'] = df['total_bedrooms'].fillna(median_before_fill)

In [None]:
df_mean.columns[df_mean.isnull().sum() > 0]

In [None]:
mean_after_fill = df_mean['total_bedrooms'].mean()
median_after_fill = df_median['total_bedrooms'].median()
print(f'mean before fillna: {mean_before_fill:.3f}')
print(f'mean after fillna: {mean_after_fill:.3f}')
print()
print(f'median before fillna: {median_before_fill:.3f}')
print(f'median after fillna: {median_after_fill:.3f}')

q: Is it different when `median` is used instead of `mean`?
a: No

### Question 7

q:

1. Select all the options located on islands.
1. Select only columns housing_median_age, total_rooms, total_bedrooms.
1. Get the underlying NumPy array. Let's call it X.
1. Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use X.T. Let's call the result XTX.
1. Compute the inverse of XTX.
1. Create an array y with values [950, 1300, 800, 1000, 1300].
1. Multiply the inverse of XTX with the transpose of X, and then multiply the result by y. Call the result w.
1. What's the value of the last element of w?


a: code snippets below

1. Select all the options located on islands.

In [None]:
islands = df[df['ocean_proximity'] == 'ISLAND']

2. Select only columns housing_median_age, total_rooms, total_bedrooms.

In [None]:
subset = islands[['housing_median_age', 'total_rooms', 'total_bedrooms']]

3. Get the underlying NumPy array. Let's call it X.

In [None]:
X = subset.to_numpy()

In [None]:
print(f'{X.shape = }')
print(f'{X.T.shape = }')

4. Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use X.T. Let's call the result XTX.

In [None]:
XTX = X.T @ X
XTX

5. Compute the inverse of XTX.

In [None]:
XTX_inv = np.linalg.inv(XTX)
XTX_inv

6. Create an array y with values [950, 1300, 800, 1000, 1300].

In [None]:
y = np.array([950, 1300, 800, 1000, 1300])
y

7. Multiply the inverse of XTX with the transpose of X, and then multiply the result by y. Call the result w.

In [None]:
w = np.matmul(np.matmul(np.linalg.inv(XTX), X.T), y)

In [None]:
XTX_inv @ X.T @ y

8. What's the value of the last element of w?

In [None]:
w[-1]