## Homework #1 

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can the instructions from <a href='https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/01-intro/06-environment.md'>06-environment.md</a>.

In [1]:
import numpy as np
import pandas as pd

### Question 1

What's the version of NumPy that you installed?

You can get the version information using the `__version__` field:

In [2]:
np.__version__

'1.26.4'

### Getting the data

For this homework, we'll use the Car price dataset. Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv'>here</a>. 

You can do it with wget:
```bash
wget https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
```
Or just open it with your browser and click "Save as...".

Now read it with Pandas.

In [3]:
data = pd.read_csv("https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv")

In [4]:
data.head()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


### Question 2

How many records are in the dataset?

Here you need to specify the number of rows.

In [5]:
data.shape[0]   # alternative is len(data.index)

9704

### Question 3

How many fuel types are presented in the dataset?

In [7]:
data['fuel_type'].value_counts().head()

fuel_type
Gasoline    4898
Diesel      4806
Name: count, dtype: int64

### Question 4

How many columns in the dataset have missing values?

In [8]:
data.isnull().sum()

engine_displacement      0
num_cylinders          482
horsepower             708
vehicle_weight           0
acceleration           930
model_year               0
origin                   0
fuel_type                0
drivetrain               0
num_doors              502
fuel_efficiency_mpg      0
dtype: int64

In [10]:
(data.isnull().sum() != 0).sum()

4

### Question 5

What's the maximum fuel efficiency of cars from Asia?

In [12]:
data.groupby('origin')['fuel_efficiency_mpg'].agg(['min', 'mean', 'max'])

Unnamed: 0_level_0,min,mean,max
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Asia,6.886245,14.97383,23.759123
Europe,6.200971,14.942532,25.967222
USA,6.695483,15.040204,24.971452


In [13]:
data[data.origin == 'Asia']['fuel_efficiency_mpg'].max()

23.759122836520497

### Question 6



1. Find the median value of "horsepower" column in the dataset.
2. Next, calculate the most frequent value of the same "horsepower". 
3. Use the `fillna` method to fill the missing values in "horsepower" with the most frequent value from the previous step.
4. Now, calculate the median value of "horsepower" once again.

Has it changed?

In [14]:
data['horsepower'].value_counts()

horsepower
152.0    142
145.0    141
151.0    134
148.0    130
141.0    130
        ... 
46.0       1
43.0       1
53.0       1
66.0       1
61.0       1
Name: count, Length: 192, dtype: int64

In [16]:
median_horsepower = data['horsepower'].median()
median_horsepower

149.0

In [17]:
mode_horsepower = data['horsepower'].mode()[0]
mode_horsepower

152.0

In [18]:
data['horsepower'].fillna(mode_horsepower).median()

152.0

Yes, it increased

### Question 7

1. Select all the cars from Asia.
2. Select only columns "vehicle_weight" and "model_year".
3. Select the first 7 values.
4. Get the underlying NumPy array. Let's call it 'X'.
5. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
6. Invert `XTX`.
7. Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100, 1200]`.
8. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
9. What's the sum of all the elements of the result?

In [27]:
Asia_df = data[data['origin'] == 'Asia']
Asia_df = Asia_df[['vehicle_weight', 'model_year']]
Asia_df = Asia_df[:7]
Asia_df

Unnamed: 0,vehicle_weight,model_year
8,2714.21931,2016
12,2783.868974,2010
14,3582.687368,2007
20,2231.808142,2011
21,2659.431451,2016
34,2844.227534,2014
38,3761.994038,2019


In [28]:
X = Asia_df.to_numpy()
XTX = X.T @ X
XTX_inv = np.linalg.inv(XTX)
XTX_inv

array([[ 5.71497081e-07, -8.34509443e-07],
       [-8.34509443e-07,  1.25380877e-06]])

In [31]:
y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200])

In [32]:
w = (XTX_inv @ X.T) @ y

In [33]:
w.sum()

0.5187709081074007

> **Note**: we just implemented normal equation


$$w = (X^T X)^{-1} X^T y$$


We'll talk about it more in the next week (Machine Learning for Regression)