# Homework

## Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib, and Seaborn. For that, you can use the instructions from
[06-environment.md](../../../01-intro/06-environment.md).

## Q1. Pandas version

What version of Pandas did you install?

You can get the version information using the `__version__` field:

```python
pd.__version__
```
### Answer

```python
>>> pd.__version__
'2.3.2'
```


## Getting the data 

For this homework, we'll use the Car Fuel Efficiency dataset. Download it from <a href='https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'>here</a>.

You can do it with wget:
```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
```

Or just open it with your browser and click "Save as...".

---- Saved to ../data

Now read it with Pandas.

## Q2. Records count

How many records are in the dataset?

- 4704
- 8704
- 9704  <---
- 17704

### Answer

In [2]:
import pandas as pd

pd.__version__

df = pd.read_csv("../data/car_fuel_efficiency.csv")

df.count()

engine_displacement    9704
num_cylinders          9222
horsepower             8996
vehicle_weight         9704
acceleration           8774
model_year             9704
origin                 9704
fuel_type              9704
drivetrain             9704
num_doors              9202
fuel_efficiency_mpg    9704
dtype: int64

## Q3. Fuel types

How many fuel types are presented in the dataset?

- 1
- 2 <---
- 3
- 4

### Answer

In [3]:
df['fuel_type'].value_counts()

fuel_type
Gasoline    4898
Diesel      4806
Name: count, dtype: int64

## Q4. Missing values

How many columns in the dataset have missing values?

- 0
- 1
- 2
- 3
- 4 <---

### Answer


In [7]:
missing_values = df.isnull().sum()
print(missing_values[missing_values > 0])

num_cylinders    482
horsepower       708
acceleration     930
num_doors        502
dtype: int64




## Q5. Max fuel efficiency

What's the maximum fuel efficiency of cars from Asia?

- 13.75
- 23.75 <---
- 33.75
- 43.75

### Answer


In [9]:
asian_cars = df[df['origin'] == 'Asia']

asian_cars.head()

asian_cars['fuel_efficiency_mpg'].max()


23.759122836520497



## Q6. Median value of horsepower

1. Find the median value of the `horsepower` column in the dataset.
2. Next, calculate the most frequent value of the same `horsepower` column.
3. Use the `fillna` method to fill the missing values in the `horsepower` column with the most frequent value from the previous step.
4. Now, calculate the median value of `horsepower` once again.

Has it changed?


- Yes, it increased  <---
- Yes, it decreased
- No

### Answer


In [12]:
df['horsepower'].describe()

#median
median_before = df['horsepower'].median()

# most frequent from value_counts:
most_freq = df['horsepower'].value_counts().idxmax()

# fill missing with most frequent and recalc median
filled = df['horsepower'].fillna(most_freq)
median_after = filled.median()

print("median before:", median_before)
print("most frequent:", most_freq)
print("median after filling:", median_after)

median before: 149.0
most frequent: 152.0
median after filling: 152.0





## Q7. Sum of weights

1. Select all the cars from Asia
2. Select only columns `vehicle_weight` and `model_year`
3. Select the first 7 values
4. Get the underlying NumPy array. Let's call it `X`.
5. Compute matrix-matrix multiplication between the transpose of `X` and `X`. To get the transpose, use `X.T`. Let's call the result `XTX`.
6. Invert `XTX`.
7. Create an array `y` with values `[1100, 1300, 800, 900, 1000, 1100, 1200]`.
8. Multiply the inverse of `XTX` with the transpose of `X`, and then multiply the result by `y`. Call the result `w`.
9. What's the sum of all the elements of the result?

> **Note**: You just implemented linear regression. We'll talk about it in the next lesson.

- 0.051
- 0.51 <---
- 5.1
- 51


### Answer

In [14]:
import numpy as np

# cars from Asia, only vehicle_weight and model_year, first 7 values
X = asian_cars[['vehicle_weight', 'model_year']].head(7).values

# matrix multiplication X.T * X
XTX = X.T.dot(X)

# Invert XTX
XTX_inv = np.linalg.inv(XTX)

# array y
y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200])

# Multiply inverse of XTX with X.T, then multiply by y
w = XTX_inv.dot(X.T).dot(y)

# 9. Sum all elements
result = w.sum()
print("Sum of all elements in w:", result)

Sum of all elements in w: 0.5187709081074008
