# 01 - Intro Homework

## Set up the environment

You need to install Python, NumPy, Pandas, Matplotlib and Seaborn. For that, you can use the instructions from 06-environment.md.

## Q1. Pandas version

What's the version of Pandas that you installed?

You can get the version information using the __version__ field:

```python
pd.__version__
```

In [25]:
import numpy as np
import pandas as pd

In [26]:
pd.__version__

'2.3.1'

## Getting the data

For this homework, we'll use the Car Fuel Efficiency dataset. Download it from here.

You can do it with wget:

```bash
wget https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
```

Or just open it with your browser and click "Save as...".

Now read it with Pandas.

## Q2. Records count

How many records are in the dataset?

- 4704
- 8704
- 9704
- 17704

In [27]:
df = pd.read_csv("https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv")
df

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.870990,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369
...,...,...,...,...,...,...,...,...,...,...,...
9699,140,5.0,164.0,2981.107371,17.3,2013,Europe,Diesel,Front-wheel drive,,15.101802
9700,180,,154.0,2439.525729,15.0,2004,USA,Gasoline,All-wheel drive,0.0,17.962326
9701,220,2.0,138.0,2583.471318,15.1,2008,USA,Diesel,All-wheel drive,-1.0,17.186587
9702,230,4.0,177.0,2905.527390,19.4,2011,USA,Diesel,Front-wheel drive,1.0,15.331551


In [28]:
df.shape[0]

9704

## Q3. Fuel types

How many fuel types are presented in the dataset?

- 1
- 2
- 3
- 4


In [29]:
df.fuel_type.value_counts()

fuel_type
Gasoline    4898
Diesel      4806
Name: count, dtype: int64

In [30]:
df.fuel_type.nunique()

2

## Q4. Missing values

How many columns in the dataset have missing values?

- 0
- 2
- 3
- 4

In [31]:
df.isna().max().sum()

np.int64(4)

In [32]:
df.isnull().max().sum()

np.int64(4)

## Q5. Max fuel efficiency

What's the maximum fuel efficiency of cars from Asia?

- 13.75
- 23.75
- 33.75
- 43.75


In [33]:
df.fuel_efficiency_mpg[df.origin == "Asia"].max()

np.float64(23.759122836520497)

## Q6. Median value of horsepower

1. Find the median value of horsepower column in the dataset.
2. Next, calculate the most frequent value of the same horsepower column.
3. Use ```fillna``` method to fill the missing values in horsepower column with the most frequent value from the previous step.
4. Now, calculate the median value of horsepower once again.

Has it changed?

- Yes, it increased
- Yes, it decreased
- No

In [34]:
# 1)
df.horsepower.median()

np.float64(149.0)

In [35]:
# 2)
df.horsepower.value_counts()

horsepower
152.0    142
145.0    141
151.0    134
148.0    130
141.0    130
        ... 
40.0       1
57.0       1
245.0      1
252.0      1
61.0       1
Name: count, Length: 192, dtype: int64

In [36]:
most_frequent = df.horsepower.mode()[0]
most_frequent

np.float64(152.0)

In [37]:
# 3)
df.horsepower = df.horsepower.fillna(value=most_frequent)
df.horsepower.isna().sum()

np.int64(0)

In [38]:
df = df.fillna(df.mode().iloc[0])
df

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,4.0,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,152.0,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.870990,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369
...,...,...,...,...,...,...,...,...,...,...,...
9699,140,5.0,164.0,2981.107371,17.3,2013,Europe,Diesel,Front-wheel drive,0.0,15.101802
9700,180,4.0,154.0,2439.525729,15.0,2004,USA,Gasoline,All-wheel drive,0.0,17.962326
9701,220,2.0,138.0,2583.471318,15.1,2008,USA,Diesel,All-wheel drive,-1.0,17.186587
9702,230,4.0,177.0,2905.527390,19.4,2011,USA,Diesel,Front-wheel drive,1.0,15.331551


In [39]:
# 4)
df.horsepower.median()

np.float64(152.0)

## Q7. Sum of weights

1. Select all the cars from Asia
2. Select only columns vehicle_weight and model_year
3. Select the first 7 values
4. Get the underlying NumPy array. Let's call it X.
5. Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use X.T. Let's call the result XTX.
6. Invert XTX.
7. Create an array y with values [1100, 1300, 800, 900, 1000, 1100, 1200].
8. Multiply the inverse of XTX with the transpose of X, and then multiply the result by y. Call the result w.
9. What's the sum of all the elements of the result?

**_NOTE:_** You just implemented linear regression. We'll talk about it in the next lesson.

- 0.051
- 0.51
- 5.1
- 51


In [40]:
# 1)
asianCars = df[df.origin == "Asia"]
asianCars

Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
8,250,1.0,174.0,2714.219310,10.3,2016,Asia,Diesel,Front-wheel drive,-1.0,16.823554
12,320,5.0,145.0,2783.868974,15.1,2010,Asia,Diesel,All-wheel drive,1.0,16.175820
14,200,6.0,160.0,3582.687368,14.9,2007,Asia,Diesel,All-wheel drive,0.0,11.871091
20,150,3.0,197.0,2231.808142,18.7,2011,Asia,Gasoline,Front-wheel drive,1.0,18.889083
21,160,4.0,133.0,2659.431451,15.6,2016,Asia,Gasoline,Front-wheel drive,-1.0,16.077730
...,...,...,...,...,...,...,...,...,...,...,...
9688,260,4.0,152.0,3948.404625,15.5,2018,Asia,Diesel,All-wheel drive,-1.0,11.054830
9692,180,3.0,188.0,3680.341381,18.0,2016,Asia,Gasoline,Front-wheel drive,1.0,11.711653
9693,280,2.0,148.0,2545.070139,15.6,2012,Asia,Diesel,All-wheel drive,0.0,17.202782
9698,180,1.0,131.0,3107.427820,13.2,2005,Asia,Gasoline,Front-wheel drive,-2.0,13.933716


In [41]:
# 2)
weightAndYear = asianCars[["vehicle_weight", "model_year"]]
weightAndYear

Unnamed: 0,vehicle_weight,model_year
8,2714.219310,2016
12,2783.868974,2010
14,3582.687368,2007
20,2231.808142,2011
21,2659.431451,2016
...,...,...
9688,3948.404625,2018
9692,3680.341381,2016
9693,2545.070139,2012
9698,3107.427820,2005


In [42]:
# 3)
first7 = weightAndYear.head(7)
first7

Unnamed: 0,vehicle_weight,model_year
8,2714.21931,2016
12,2783.868974,2010
14,3582.687368,2007
20,2231.808142,2011
21,2659.431451,2016
34,2844.227534,2014
38,3761.994038,2019


In [43]:
# 4)
X = first7.to_numpy()
X

array([[2714.21930965, 2016.        ],
       [2783.86897424, 2010.        ],
       [3582.68736772, 2007.        ],
       [2231.8081416 , 2011.        ],
       [2659.43145076, 2016.        ],
       [2844.22753389, 2014.        ],
       [3761.99403819, 2019.        ]])

In [44]:
# 5)
XTX = X.T.dot(X)
XTX

array([[62248334.33150762, 41431216.50732678],
       [41431216.50732678, 28373339.        ]])

In [45]:
# 6)
inverse = np.linalg.inv(XTX)
inverse


array([[ 5.71497081e-07, -8.34509443e-07],
       [-8.34509443e-07,  1.25380877e-06]])

In [46]:
#7)
y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200])
y

array([1100, 1300,  800,  900, 1000, 1100, 1200])

In [47]:
# 8. Multiply the inverse of XTX with the transpose of X, and then multiply the result by y. Call the result w.
w = inverse.dot(X.T).dot(y)
w

array([0.01386421, 0.5049067 ])

In [48]:
# 9) 9. What's the sum of all the elements of the result?
w.sum()

np.float64(0.5187709081074007)

## Submit the results

- Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2025/homework/hw01
- If your answer doesn't match options exactly, select the closest one
