## Homework

### Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can use the instructions from
[06-environment.md](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/01-intro/06-environment.md).

### Q1. Pandas version

What's the version of Pandas that you installed?

You can get the version information using the `__version__` field:

```python
pd.__version__
```

In [1]:
import pandas as pd

print("Pandas version:", pd.__version__)

Pandas version: 2.2.3


### Getting the data 

For this homework, we'll use the Car Fuel Efficiency dataset. Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'>here</a>.

You can do it with wget:
```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
```

Or just open it with your browser and click "Save as...".

Now read it with Pandas.

### Q2. Records count

How many records are in the dataset?

> 9704


In [2]:
car_data =  pd.read_csv('./data/car_fuel_efficiency.csv')

print("There are {} rows and {} columns".format(car_data.shape[0], car_data.shape[1]))

There are 9704 rows and 11 columns


### Q3. Fuel types

How many fuel types are presented in the dataset?

> 2


In [3]:
car_data.head()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


In [4]:
# Types of 'fuel_type' column
print(car_data['fuel_type'].value_counts())

fuel_type
Gasoline    4898
Diesel      4806
Name: count, dtype: int64


### Q4. Missing values

How many columns in the dataset have missing values?

> 4

In [5]:
# columns with missing values
print(car_data.isnull().sum())

engine_displacement      0
num_cylinders          482
horsepower             708
vehicle_weight           0
acceleration           930
model_year               0
origin                   0
fuel_type                0
drivetrain               0
num_doors              502
fuel_efficiency_mpg      0
dtype: int64


### Q5. Max fuel efficiency

What's the maximum fuel efficiency of cars from Asia?

> 23.75


In [7]:
# Maximum efficiency of cars from Asia
print(car_data.loc[car_data['origin'] == 'Asia', 'fuel_efficiency_mpg'].max())

23.759122836520497


### Q6. Median value of horsepower

1. Find the median value of `horsepower` column in the dataset.
2. Next, calculate the most frequent value of the same `horsepower` column.
3. Use `fillna` method to fill the missing values in `horsepower` column with the most frequent value from the previous step.
4. Now, calculate the median value of `horsepower` once again.

Has it changed?
> No

In [8]:
# Mediam horsepower value
print(car_data['horsepower'].median())

149.0


In [11]:
# Most frequent value of 'horsepower' column
print(car_data['horsepower'].mode()[0])

152.0


In [19]:
# fill missing values in 'horsepower' column with the most frequent value
car_data['horsepower'].fillna(car_data['horsepower'].mode()[0], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  car_data['horsepower'].fillna(car_data['horsepower'].mode()[0], inplace=True)


In [21]:
# calculate median horsepower value again
print(car_data['horsepower'].median())

152.0


### Q7. Sum of weights

1. Select all the cars from Asia
2. Select only columns `vehicle_weight` and `model_year`
3. Select the first 7 values
4. Get the underlying NumPy array. Let's call it `X`.
5. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
6. Invert `XTX`.
7. Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100, 1200]`.
8. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
9. What's the sum of all the elements of the result?

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.


> 0.51


In [22]:
# select all cars from Asia
car_asia = car_data[car_data['origin'] == 'Asia']

In [None]:
# select only columns `vehicle_weight`, `model_year`
# select first 7 values
car_asia_subset = car_asia[['vehicle_weight', 'model_year']].head(7)
print(car_asia_subset)

    vehicle_weight  model_year
8      2714.219310        2016
12     2783.868974        2010
14     3582.687368        2007
20     2231.808142        2011
21     2659.431451        2016
34     2844.227534        2014
38     3761.994038        2019


In [25]:
# set tue subset a as a numpy array
X = car_asia_subset.to_numpy()
print(X)

[[2714.21930965 2016.        ]
 [2783.86897424 2010.        ]
 [3582.68736772 2007.        ]
 [2231.8081416  2011.        ]
 [2659.43145076 2016.        ]
 [2844.22753389 2014.        ]
 [3761.99403819 2019.        ]]


In [26]:
# Compute matrix-matrix multiplication between the transpose of `X` and `X`. 
# To get the transpose, use `X.T`. Let's call the result `XTX`.
XTX = X.T.dot(X)
print(XTX)

[[62248334.33150762 41431216.5073268 ]
 [41431216.5073268  28373339.        ]]


In [28]:
# inverse of XTX
import numpy as np
XTX_inv = np.linalg.inv(XTX)
print(XTX_inv)

[[ 5.71497081e-07 -8.34509443e-07]
 [-8.34509443e-07  1.25380877e-06]]


In [None]:
# Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100, 1200]`.
y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200])

# Multiply the inverse of `XTX` with the transpose of `X`,
# and then multiply the result by `y`. Call the result `w`.
w = XTX_inv.dot(X.T).dot(y)
print(w)

[0.01386421 0.5049067 ]


In [30]:
# What's the sum of all the elements of the result?
print(w.sum())

0.5187709081073995
