# Homework

## Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib, and Seaborn. For that, you can use the instructions from [06-environment.md](https://github.com/DataTalksClub/machine-learning-zoomcamp/blob/master/01-intro/06-environment.md).

## Q1. Pandas version

What version of Pandas did you install?

You can get the version information using the `__version__` field:

In [3]:
import pandas as pd

In [4]:
pd.__version__

'2.3.2'

## Getting the data

For this homework, we'll use the Car Fuel Efficiency dataset. Download it from [here](https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv).

You can do it with wget:

!wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv

In [7]:
df = pd.read_csv('car_fuel_efficiency.csv')

In [8]:
df.shape

(9704, 11)

## Q2. Records count

How many records are in the dataset?

- 4704
- 8704
- 9704
- 17704

**Answer: 9704**

## Q3. Fuel types

How many fuel types are presented in the dataset?

- 1
- 2
- 3
- 4

**Answer: 2**

In [9]:
df.head()

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


In [11]:
df['fuel_type'].unique()

array(['Gasoline', 'Diesel'], dtype=object)

## Q4. Missing values

How many columns in the dataset have missing values?

- 0
- 1
- 2
- 3
- 4

**Answer: 4**

In [15]:
df.isnull().sum()

engine_displacement      0
num_cylinders          482
horsepower             708
vehicle_weight           0
acceleration           930
model_year               0
origin                   0
fuel_type                0
drivetrain               0
num_doors              502
fuel_efficiency_mpg      0
dtype: int64

## Q5. Max fuel efficiency

What's the maximum fuel efficiency of cars from Asia?

- 13.75
- 23.75
- 33.75
- 43.75

**Answer: 23.75**

In [18]:
df[df['origin']=='Asia']['fuel_efficiency_mpg'].max()

23.759122836520497

## Question 6. Median value of horsepower

1. Find the median value of the `horsepower` column in the dataset.
2. Next, calculate the most frequent value of the same `horsepower` column.
3. Use the `fillna` method to fill the missing values in the `horsepower` column with the most frequent value from the previous step.
4. Now, calculate the median value of `horsepower` once again.

Has it changed?

- Yes, it increased
- Yes, it decreased
- No

**Answer: No**

In [24]:
df['horsepower'].median()

149.0

In [26]:
df['horsepower'].mode()

0    152.0
Name: horsepower, dtype: float64

In [30]:
df['horsepower'].isnull().sum()

np.int64(708)

In [38]:
df['horsepower'].fillna(df['horsepower'].mode(), inplace=True)

In [39]:
df['horsepower'].isnull().sum()

np.int64(0)

In [40]:
df['horsepower'].median()

149.0

## Q7. Sum of weights

1. Select all the cars from Asia
2. Select only columns `vehicle_weight` and `model_year`
3. Select the first 7 values
4. Get the underlying NumPy array. Let's call it `X`.
5. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
6. Invert `XTX`.
7. Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100, 1200]`.
8. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
9 What's the sum of all the elements of the result?

```Note: You just implemented linear regression. We'll talk about it in the next lesson.```

- 0.051
- 0.51
- 5.1
- 51

**Answer: 0.51**

In [42]:
df[df['origin']=='Asia'][['vehicle_weight', 'model_year']]

Unnamed: 0,vehicle_weight,model_year
8,2714.219310,2016
12,2783.868974,2010
14,3582.687368,2007
20,2231.808142,2011
21,2659.431451,2016
...,...,...
9688,3948.404625,2018
9692,3680.341381,2016
9693,2545.070139,2012
9698,3107.427820,2005


In [43]:
subset_df = df[df['origin']=='Asia'][['vehicle_weight', 'model_year']]

In [44]:
subset_df.head(7)

Unnamed: 0,vehicle_weight,model_year
8,2714.21931,2016
12,2783.868974,2010
14,3582.687368,2007
20,2231.808142,2011
21,2659.431451,2016
34,2844.227534,2014
38,3761.994038,2019


In [46]:
X = subset_df.head(7).values

In [65]:
XTX = X.T.dot(X)

In [73]:
import numpy as np

In [75]:
# invert XTX

np.linalg.inv(XTX)

array([[ 5.71497081e-07, -8.34509443e-07],
       [-8.34509443e-07,  1.25380877e-06]])

In [50]:
y = [1100, 1300, 800, 900, 1000, 1100, 1200]

In [84]:
(np.linalg.inv(XTX).dot(X.T)).dot(y)

array([0.01386421, 0.5049067 ])

In [81]:
w = (np.linalg.inv(XTX).dot(X.T)).dot(y)

In [86]:
w[0] + w[1]

np.float64(0.5187709081074007)