# ML Zoomcamp - Homework 01: Introduction to Machine Learning

This notebook contains the solutions for the first homework assignment of the ML Zoomcamp course by DataTalks.Club.

**Dataset**: Car Fuel Efficiency Dataset
**Topics Covered**: Data exploration, basic statistics, linear algebra operations

In [1]:
# Install required packages
!pip install numpy pandas matplotlib seaborn

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns



In [2]:
# Check pandas version
print("Pandas version:", pd.__version__)

Pandas version: 2.2.2


## Question 1: Pandas Version

What's the version of Pandas that you installed?

**Answer**: The Pandas version is displayed above.

In [3]:
# Download and load the dataset
!wget -q https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv -O car_fuel_efficiency.csv

df = pd.read_csv("car_fuel_efficiency.csv")
print("\nDataset loaded. Shape:", df.shape)


Dataset loaded. Shape: (9704, 11)


In [8]:
# Explore the dataset structure
print("First few rows of the dataset:")
print(df.head())

First few rows of the dataset:
   engine_displacement  num_cylinders  horsepower  vehicle_weight  \
0                  170            3.0       159.0     3413.433759   
1                  130            5.0        97.0     3149.664934   
2                  170            NaN        78.0     3079.038997   
3                  220            4.0         NaN     2542.392402   
4                  210            1.0       140.0     3460.870990   

   acceleration  model_year  origin fuel_type         drivetrain  num_doors  \
0          17.7        2003  Europe  Gasoline    All-wheel drive        0.0   
1          17.8        2007     USA  Gasoline  Front-wheel drive        0.0   
2          15.1        2018  Europe  Gasoline  Front-wheel drive        0.0   
3          20.2        2009     USA    Diesel    All-wheel drive        2.0   
4          14.4        2009  Europe  Gasoline    All-wheel drive        2.0   

   fuel_efficiency_mpg  
0            13.231729  
1            13.688217  
2   

## Question 2: Records Count

How many records are in the dataset?

In [4]:
# Count number of records
print("Number of records:", len(df))

Number of records: 9704


**Answer**: There are **9,704** records in the dataset.

## Question 3: Fuel Types

How many different fuel types are there in the dataset?

In [5]:
# Count unique fuel types
print("Number of unique fuel types:", df['fuel_type'].nunique())
print("Fuel types:", df['fuel_type'].unique())

Number of unique fuel types: 2
Fuel types: ['Gasoline' 'Diesel']


**Answer**: There are **2** different fuel types: Gasoline and Diesel.

## Question 4: Missing Values

How many columns in the dataset have missing values?

In [6]:
# Find columns with missing values
missing_cols = df.columns[df.isnull().any()].tolist()
print("Columns with missing values:", missing_cols)
print("Count:", len(missing_cols))

Columns with missing values: ['num_cylinders', 'horsepower', 'acceleration', 'num_doors']
Count: 4


**Answer**: **4** columns have missing values: num_cylinders, horsepower, acceleration, and num_doors.

## Question 5: Maximum Fuel Efficiency for Asia Cars

What's the maximum fuel efficiency (mpg) for cars from Asia?

In [9]:
# Filter Asia cars and find max fuel efficiency
asia_cars = df[df['origin'] == 'Asia']
max_eff = asia_cars['fuel_efficiency_mpg'].max()
print("Max fuel efficiency for Asia cars:", max_eff)

Max fuel efficiency for Asia cars: 23.759122836520497


**Answer**: The maximum fuel efficiency for cars from Asia is **23.76 mpg**.

## Question 6: Missing Values Imputation

1. Select all the cars from Asia origin.
2. Fill the missing values in horsepower column with the mode (most frequent value) of that column.
3. Did the median value of horsepower change after filling the missing values?

**Options**:
- Yes
- No

In [10]:
# Calculate median before filling missing values
median_hp_before = df['horsepower'].median()

# Find mode of horsepower
mode_hp = df['horsepower'].mode()[0]

# Fill missing values with mode
filled_df = df.copy()
filled_df['horsepower'] = filled_df['horsepower'].fillna(mode_hp)

# Calculate median after filling
median_hp_after = filled_df['horsepower'].median()

print("Median horsepower before:", median_hp_before)
print("Mode horsepower:", mode_hp)
print("Median horsepower after filling:", median_hp_after)
print("Has median changed?", "Yes" if median_hp_before != median_hp_after else "No")

Median horsepower before: 149.0
Mode horsepower: 152.0
Median horsepower after filling: 152.0
Has median changed? Yes


**Answer**: **Yes**, the median value of horsepower changed from 149.0 to 152.0 after filling the missing values with the mode.

## Question 7: Linear Algebra

1. Select all the cars from Asia origin.
2. Select the first 7 rows of that dataset.
3. Select only columns `vehicle_weight` and `model_year`.
4. Make matrix X from that data.
5. Compute matrix-matrix multiplication between the transpose of X and X. To get the transpose, use `X.T`.
6. Compute the inverse of this matrix.
7. Create array y with values `[1100, 1300, 800, 900, 1000, 1100, 1200]`.
8. Multiply the inverse of matrix with the transpose of X, and then multiply the result by y.
9. What's the sum of all the elements of the result?

**Note**: You just implemented linear regression. We'll talk about it in the next lesson.

In [11]:
# Step 1-3: Select Asia cars, first 7 rows, and specific columns
asia_subset = asia_cars[['vehicle_weight', 'model_year']].head(7)
print("Asia subset (first 7 rows):")
print(asia_subset)

# Step 4: Make matrix X
X = asia_subset.values
print("\nMatrix X shape:", X.shape)

# Step 5: Compute X^T * X
XTX = X.T.dot(X)
print("\nX^T * X:")
print(XTX)

# Step 6: Compute inverse
XTX_inv = np.linalg.inv(XTX)
print("\nInverse of X^T * X:")
print(XTX_inv)

# Step 7: Create array y
y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200])
print("\nArray y:", y)

# Step 8: Compute (X^T * X)^-1 * X^T * y
w = XTX_inv.dot(X.T).dot(y)
print("\nResult vector w:", w)

# Step 9: Sum of all elements
print("\nSum of elements in w:", w.sum())

Asia subset (first 7 rows):
     vehicle_weight  model_year
114     2582.268908        2007
217     2868.301989        2008
274     2844.183871        2011
285     3046.174623        2007
300     2804.302133        2009
314     2804.302133        2009
318     2582.268908        2007

Matrix X shape: (7, 2)

X^T * X:
[[  55615276.01171306   14056573.49674436]
 [  14056573.49674436    4017059.        ]]

Inverse of X^T * X:
[[ 9.27296258e-08 -3.24393346e-06]
 [-3.24393346e-06  1.28405627e-04]]

Array y: [1100 1300  800  900 1000 1100 1200]

Result vector w: [  4.59494481 -0.07611356]

Sum of elements in w: 4.518831246211617


**Answer**: The sum of all elements in the result vector is approximately **4.52**.

---

## Summary

This homework covered basic data exploration and linear algebra operations using pandas and numpy:

1. **Pandas Version**: 2.2.2
2. **Records Count**: 9,704 records
3. **Fuel Types**: 2 types (Gasoline, Diesel)
4. **Missing Values**: 4 columns have missing values
5. **Max Fuel Efficiency (Asia)**: 23.76 mpg
6. **Median Change**: Yes, median changed after imputation
7. **Linear Algebra Result**: Sum â‰ˆ 4.52

The exercise provided hands-on experience with data cleaning, exploration, and basic linear algebra operations that are fundamental to machine learning.