# Machine Learning Zoomcamp - Homework 1

This notebook contains solutions to the homework questions from the ML Zoomcamp course.

## Question 1: Pandas Version

What's the version of Pandas that you installed?

In [1]:
import pandas as pd
print(f"Pandas version: {pd.__version__}")

Pandas version: 2.2.2


## Question 2: Records Count

How many records are in the dataset?

In [2]:
# Load the dataset
url = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'
df = pd.read_csv(url)

# Get the number of records
num_records = len(df)
print(f"Number of records: {num_records}")
print(f"Dataset shape: {df.shape}")

Number of records: 9704
Dataset shape: (9704, 11)


## Question 3: Fuel Types

How many fuel types are presented in the dataset?

In [3]:
# Check the fuel type column (assuming it's named 'fuel_type' or similar)
# First, let's see the column names
print("Column names:")
print(df.columns.tolist())
print()

# Look for fuel-related columns
fuel_columns = [col for col in df.columns if 'fuel' in col.lower()]
print(f"Fuel-related columns: {fuel_columns}")
print()

# Assuming the fuel type column exists, count unique values
if fuel_columns:
    fuel_col = fuel_columns[0]  # Take the first fuel-related column
    unique_fuel_types = df[fuel_col].nunique()
    print(f"Number of unique fuel types: {unique_fuel_types}")
    print(f"Fuel types: {df[fuel_col].unique()}")
else:
    print("No fuel-related column found. Let's check all columns for possible fuel data:")

Column names:
['engine_displacement', 'num_cylinders', 'horsepower', 'vehicle_weight', 'acceleration', 'model_year', 'origin', 'fuel_type', 'drivetrain', 'num_doors', 'fuel_efficiency_mpg']

Fuel-related columns: ['fuel_type', 'fuel_efficiency_mpg']

Number of unique fuel types: 2
Fuel types: ['Gasoline' 'Diesel']


## Question 4: Missing Values

How many columns have missing values?

In [4]:
# Check for missing values in each column
missing_values = df.isnull().sum()
print("Missing values per column:")
print(missing_values)
print()

# Count how many columns have missing values
columns_with_missing = (missing_values > 0).sum()
print(f"Number of columns with missing values: {columns_with_missing}")

# Show which columns have missing values
columns_with_missing_names = missing_values[missing_values > 0]
if len(columns_with_missing_names) > 0:
    print(f"Columns with missing values: {list(columns_with_missing_names.index)}")
else:
    print("No columns have missing values")

Missing values per column:
engine_displacement      0
num_cylinders          482
horsepower             708
vehicle_weight           0
acceleration           930
model_year               0
origin                   0
fuel_type                0
drivetrain               0
num_doors              502
fuel_efficiency_mpg      0
dtype: int64

Number of columns with missing values: 4
Columns with missing values: ['num_cylinders', 'horsepower', 'acceleration', 'num_doors']


## Question 5: Max Fuel Efficiency

What's the maximum fuel efficiency of cars from Asia?

In [5]:
# Filter for Asian cars and find maximum fuel efficiency
# Using the specific column names from the dataset

# Filter for cars from Asia
asia_cars = df[df['origin'] == 'Asia']
print(f"Number of cars from Asia: {len(asia_cars)}")

# Find the maximum fuel efficiency for Asian cars
max_efficiency_asia = asia_cars['fuel_efficiency_mpg'].max()
print(f"Maximum fuel efficiency for Asian cars: {max_efficiency_asia}")

# Show some additional info
print(f"Mean fuel efficiency for Asian cars: {asia_cars['fuel_efficiency_mpg'].mean():.2f}")
print(f"All Asian cars fuel efficiency values:")
print(asia_cars['fuel_efficiency_mpg'].sort_values(ascending=False).head(10))

Number of cars from Asia: 3247
Maximum fuel efficiency for Asian cars: 23.759122836520497
Mean fuel efficiency for Asian cars: 14.97
All Asian cars fuel efficiency values:
9387    23.759123
343     23.204566
7739    23.033673
9401    22.919968
5416    22.858156
9654    22.709442
1854    22.592785
7505    22.507648
9340    22.489480
8647    22.487174
Name: fuel_efficiency_mpg, dtype: float64


## Question 6: Median Horsepower

Find the median horsepower, fill missing values with the most frequent horsepower value, and check if the median changes.

In [6]:
# Question 6: Median horsepower analysis
# Using the correct column name for horsepower

print("Dataset columns:")
print(df.columns.tolist())
print()

# Use the correct horsepower column name
hp_col = 'horsepower'

# Step 1: Find the original median horsepower
original_median = df[hp_col].median()
print(f"Original median horsepower: {original_median}")

# Check for missing values in horsepower column
missing_hp = df[hp_col].isnull().sum()
print(f"Missing values in {hp_col}: {missing_hp}")

if missing_hp > 0:
    # Step 2: Find the most frequent (mode) horsepower value
    most_frequent_hp = df[hp_col].mode()[0]
    print(f"Most frequent horsepower value: {most_frequent_hp}")
    
    # Step 3: Fill missing values with the most frequent value
    df_filled = df.copy()
    df_filled[hp_col] = df_filled[hp_col].fillna(most_frequent_hp)
    
    # Step 4: Calculate the new median
    new_median = df_filled[hp_col].median()
    print(f"New median horsepower after filling: {new_median}")
    
    # Step 5: Check if median changed
    print(f"Original median: {original_median}")
    print(f"New median: {new_median}")
    
    if new_median > original_median:
        print("Answer: Yes (increased)")
    elif new_median < original_median:
        print("Answer: Yes (decreased)")
    else:
        print("Answer: No")
else:
    print("No missing values found in horsepower column")
    print("Answer: No (no missing values to fill)")

Dataset columns:
['engine_displacement', 'num_cylinders', 'horsepower', 'vehicle_weight', 'acceleration', 'model_year', 'origin', 'fuel_type', 'drivetrain', 'num_doors', 'fuel_efficiency_mpg']

Original median horsepower: 149.0
Missing values in horsepower: 708
Most frequent horsepower value: 152.0
New median horsepower after filling: 152.0
Original median: 149.0
New median: 152.0
Answer: Yes (increased)


## Question 7: Sum of Weights

Complex matrix calculation involving Asian cars - compute matrix operations and calculate sum of resulting elements.

In [7]:
import numpy as np

# Question 7: Matrix operations following the exact homework steps

# Step 1: Select all cars from Asia
asia_cars = df[df['origin'] == 'Asia']
print(f"Number of Asian cars: {len(asia_cars)}")

# Step 2: Select only the specified columns
columns_to_select = ["vehicle_weight", "model_year"]
X_subset = asia_cars[columns_to_select]
print(f"Selected columns: {columns_to_select}")

# Step 3: Select the first 7 values
X_first_7 = X_subset.head(7)
print(f"First 7 rows shape: {X_first_7.shape}")
print("First 7 rows:")
print(X_first_7)

# Step 4: Get the underlying NumPy array (call it X)
X = X_first_7.values
print(f"X shape: {X.shape}")
print("X array:")
print(X)

# Step 5: Compute matrix-matrix multiplication X.T * X (call result XTX)
XTX = X.T @ X
print(f"XTX shape: {XTX.shape}")
print("XTX:")
print(XTX)

# Step 6: Invert XTX
XTX_inv = np.linalg.inv(XTX)
print("XTX inverse:")
print(XTX_inv)

# Step 7: Create array y with specified values
y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200])
print(f"y array: {y}")

# Step 8: Multiply: (XTX^-1) * X.T * y
result = XTX_inv @ X.T @ y
print(f"Result shape: {result.shape}")
print("Result (w):")
print(result)

# Step 9: Calculate the sum of all elements in the result
sum_elements = np.sum(result)
print(f"Sum of all elements: {sum_elements}")
print(f"Sum rounded to 3 decimal places: {sum_elements:.3f}")

Number of Asian cars: 3247
Selected columns: ['vehicle_weight', 'model_year']
First 7 rows shape: (7, 2)
First 7 rows:
    vehicle_weight  model_year
8      2714.219310        2016
12     2783.868974        2010
14     3582.687368        2007
20     2231.808142        2011
21     2659.431451        2016
34     2844.227534        2014
38     3761.994038        2019
X shape: (7, 2)
X array:
[[2714.21930965 2016.        ]
 [2783.86897424 2010.        ]
 [3582.68736772 2007.        ]
 [2231.8081416  2011.        ]
 [2659.43145076 2016.        ]
 [2844.22753389 2014.        ]
 [3761.99403819 2019.        ]]
XTX shape: (2, 2)
XTX:
[[62248334.33150762 41431216.5073268 ]
 [41431216.5073268  28373339.        ]]
XTX inverse:
[[ 5.71497081e-07 -8.34509443e-07]
 [-8.34509443e-07  1.25380877e-06]]
y array: [1100 1300  800  900 1000 1100 1200]
Result shape: (2,)
Result (w):
[0.01386421 0.5049067 ]
Sum of all elements: 0.5187709081074007
Sum rounded to 3 decimal places: 0.519
