Solving the [2023 Week 1 Homework](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/cohorts/2023/01-intro/homework.md)

# Section 1

In [1]:
import numpy as np
import pandas as pd

## Question 1

What's the version of Pandas that you installed?

In [2]:
pd.__version__

'2.1.0'

**Answer**: 2.1.0

# Section 2

For this homework, we'll use the California Housing Prices dataset. Download it from 
[here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/housing.csv).

Now read it with Pandas.

In [3]:
housing_df = pd.read_csv('data/housing.csv')
housing_df.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY


## Question 2

How many columns are in the dataset?

In [4]:
housing_df.shape[1]

10

**Answer**: 10

## Question 3

Which columns in the dataset have missing values?

- `total_rooms`
- `total_bedrooms`
- both of the above

In [5]:
(
    housing_df.isnull()
    .sum()  # count of missing values, by column
    .gt(0)  # True if column has missing values
    .sum()  # count of columns with missing values
)

1

In [6]:
housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


**Answer**: `total_bedrooms`

## Question 4

How many unique values does the `ocean_proximity` column have?

In [7]:
housing_df.ocean_proximity.unique()

array(['NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND'],
      dtype=object)

In [8]:
(
    housing_df.ocean_proximity
    .nunique()  # count of unique values
)

5

**Answer**: 5

## Question 5

What's the average value of the `median_house_value` for the houses located near the bay?

In [9]:
(
    housing_df[housing_df.ocean_proximity == "NEAR BAY"]  # filters for houses located near bay
    .median_house_value  # selects column
    .mean()  # average value
)

259212.31179039303

**Answer**: 259212

## Question 6

1. Calculate the average of `total_bedrooms` column in the dataset.
2. Use the `fillna` method to fill the missing values in `total_bedrooms` with the mean value from the previous step.
3. Now, calculate the average of `total_bedrooms` again.
4. Has it changed?

Hint: take into account only 3 digits after the decimal point.

In [10]:
avg_total_bedrooms = housing_df.total_bedrooms.mean()
avg_total_bedrooms

537.8705525375618

In [11]:
housing_df["total_bedrooms"] = housing_df.total_bedrooms.fillna(avg_total_bedrooms)
housing_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20640 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB


In [12]:
housing_df.total_bedrooms.mean()

537.8705525375617

**Answer**: No

# Section 3

In [13]:
from numpy.linalg import inv

## Question 7

1. Select all the options located on islands.
2. Select only columns `housing_median_age`, `total_rooms`, `total_bedrooms`.
3. Get the underlying NumPy array. Let's call it `X`.
4. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
5. Compute the inverse of `XTX`.
6. Create an array `y` with values `[950, 1300, 800, 1000, 1300]`.
7. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
8. What's the value of the last element of `w`?

In [14]:
housing_df["ocean_proximity"].value_counts()

ocean_proximity
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: count, dtype: int64

In [15]:
X = (
    housing_df[housing_df["ocean_proximity"]=='ISLAND']  # island houses
    .loc[:, ["housing_median_age", "total_rooms", "total_bedrooms"]]  # required columns
    .values  # numpy array
)

X.shape

(5, 3)

In [16]:
XTX = np.matmul(X.T, X)  # matrix multiplication
XTX.shape

(3, 3)

In [17]:
XTX_inv = inv(XTX)
XTX_inv.shape

(3, 3)

In [18]:
y = [950, 1300, 800, 1000, 1300]
y

[950, 1300, 800, 1000, 1300]

In [19]:
w = np.matmul(np.matmul(XTX_inv, X.T), y)
w.shape

(3,)

In [20]:
w[-1]

5.699229455065586

**Answer**: 5.6992