
# ML Zoomcamp 2025 — Module 1 Homework (Professional Walkthrough)

This notebook documents a complete, reproducible solution to the Module 1 homework. It includes environment setup, data acquisition, structured analysis for Q1–Q7, and final answers. The goal is to show not only the results, but the reasoning and best practices behind each step.

Author: Your Name
Kernel: Python (mlzoomcamp)



## Environment and Setup
- Virtual environment created with `python -m venv .venv`
- Kernel registered as "Python (mlzoomcamp)"
- Core libraries used: `numpy`, `pandas`, `matplotlib`, `seaborn`

Below we verify versions to ensure reproducibility.


In [None]:

import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

print('Python version:', sys.version)
print('NumPy version:', np.__version__)
print('Pandas version:', pd.__version__)



## Data Acquisition
We use the Car Fuel Efficiency dataset from the course resources. The file is downloaded directly from the official datasets repository and stored locally for repeatable runs.

- Source: `https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv`
- Storage path: `cohorts/2025/01-intro/car_fuel_efficiency.csv`


In [None]:

from pathlib import Path
import urllib.request

DATA_URL = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'
DATA_PATH = Path('cohorts/2025/01-intro/car_fuel_efficiency.csv')
DATA_PATH.parent.mkdir(parents=True, exist_ok=True)
urllib.request.urlretrieve(DATA_URL, DATA_PATH)
print('Downloaded to:', DATA_PATH.resolve())



## Load and Inspect Data
We load the dataset into a Pandas DataFrame and review the schema along with the first few records to understand the column names and data types.


In [None]:

df = pd.read_csv(DATA_PATH)
df.head()


In [None]:

df.info()


In [None]:

df.describe(include='all').T



## Q1. Pandas version
We read the installed `pandas` version from `pd.__version__`.


In [None]:

pd.__version__



## Q2. Records count
We obtain the number of rows with `len(df)`.


In [None]:

num_records = len(df)
num_records



## Q3. Fuel types
We count distinct values in the `fuel_type` column using `nunique()`.


In [None]:

num_fuel_types = df['fuel_type'].nunique()
num_fuel_types



## Q4. Missing values
We count how many columns contain at least one missing value.


In [None]:

num_missing_columns = df.columns[df.isnull().any()].shape[0]
num_missing_columns



## Q5. Max fuel efficiency (Asia)
Filter by `origin == 'Asia'` and take the maximum of `fuel_efficiency_mpg`.


In [None]:

max_eff_asia = df[df['origin'] == 'Asia']['fuel_efficiency_mpg'].max()
max_eff_asia



## Q6. Median horsepower before/after filling missing values
1. Compute the median of `horsepower`.
2. Compute the mode of `horsepower`.
3. Fill NAs with the mode.
4. Compute the median again and compare.


In [None]:

median_hp_before = df['horsepower'].median()
mode_hp = df['horsepower'].mode().iloc[0]
median_hp_after = df.assign(horsepower=df['horsepower'].fillna(mode_hp))['horsepower'].median()
median_hp_before, mode_hp, median_hp_after



Interpretation: If the median increased, then the answer is "Yes, it increased"; if it decreased, "Yes, it decreased"; otherwise, "No".


In [None]:

if abs(median_hp_after - median_hp_before) < 1e-12:
    change = 'No'
elif median_hp_after > median_hp_before:
    change = 'Yes, it increased'
else:
    change = 'Yes, it decreased'
change



## Q7. Sum of weights via normal equation
Steps:
- Select Asia cars, take `vehicle_weight` and `model_year`
- Take first 7 rows to form matrix `X`
- Compute `XTX = X.T @ X`, invert it
- Define `y` and compute `w = (XTX^-1) X^T y`
- Sum the elements of `w`

This mirrors the linear regression normal equation.


In [None]:

import numpy as np
asia = df[df['origin'] == 'Asia'][['vehicle_weight', 'model_year']].head(7)
X = asia.values
XTX = X.T @ X
XTX_inv = np.linalg.inv(XTX)
y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200])
w = XTX_inv @ X.T @ y
float(w.sum())



## Conclusions and Final Answers
- Q1 (Pandas version): expect ~2.3.x
- Q2 (Records): 9704
- Q3 (Fuel types): 2
- Q4 (Columns with missing values): 4
- Q5 (Max fuel efficiency in Asia): ~23.75
- Q6 (Median horsepower change after fillna(mode)): Yes, it increased
- Q7 (Sum of w elements): ~0.51

These align with the multiple-choice options in the homework. See the course repository for details and submission instructions.

References:
- Course repo: https://github.com/DataTalksClub/machine-learning-zoomcamp
- Dataset: https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
