
# ML Zoomcamp 2025 — Module 1 Homework (Professional Walkthrough)

This notebook documents a complete, reproducible solution to the Module 1 homework. It includes environment setup, data acquisition, structured analysis for Q1–Q7, and final answers. The goal is to show not only the results, but the reasoning and best practices behind each step.

Author: Yunus Emre Almaoğlu
Mail Adress : emrealmaogluu@gmail.com
Kernel: Python (mlzoomcamp)



## Environment and Setup
- Virtual environment created with `python -m venv .venv`
- Kernel registered as "Python (mlzoomcamp)"
- Core libraries used: `numpy`, `pandas`, `matplotlib`, `seaborn`

Below we verify versions to ensure reproducibility.


In [14]:

import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

print('Python version:', sys.version)
print('NumPy version:', np.__version__)
print('Pandas version:', pd.__version__)


Python version: 3.13.7 (main, Aug 14 2025, 11:12:11) [Clang 17.0.0 (clang-1700.0.13.3)]
NumPy version: 2.3.3
Pandas version: 2.3.3



## Data Acquisition
We use the Car Fuel Efficiency dataset from the course resources. The file is downloaded directly from the official datasets repository and stored locally for repeatable runs.

- Source: `https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv`
- Storage path: `cohorts/2025/01-intro/car_fuel_efficiency.csv`


In [15]:

from pathlib import Path
import urllib.request

DATA_URL = 'https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv'
DATA_PATH = Path('cohorts/2025/01-intro/car_fuel_efficiency.csv')
DATA_PATH.parent.mkdir(parents=True, exist_ok=True)
urllib.request.urlretrieve(DATA_URL, DATA_PATH)
print('Downloaded to:', DATA_PATH.resolve())


Downloaded to: /Users/emrealmaoglu/Desktop/MLZoomCamp/machine-learning-zoomcamp/cohorts/2025/01-intro/cohorts/2025/01-intro/car_fuel_efficiency.csv



## Load and Inspect Data
We load the dataset into a Pandas DataFrame and review the schema along with the first few records to understand the column names and data types.


In [16]:

df = pd.read_csv(DATA_PATH)
df.head()


Unnamed: 0,engine_displacement,num_cylinders,horsepower,vehicle_weight,acceleration,model_year,origin,fuel_type,drivetrain,num_doors,fuel_efficiency_mpg
0,170,3.0,159.0,3413.433759,17.7,2003,Europe,Gasoline,All-wheel drive,0.0,13.231729
1,130,5.0,97.0,3149.664934,17.8,2007,USA,Gasoline,Front-wheel drive,0.0,13.688217
2,170,,78.0,3079.038997,15.1,2018,Europe,Gasoline,Front-wheel drive,0.0,14.246341
3,220,4.0,,2542.392402,20.2,2009,USA,Diesel,All-wheel drive,2.0,16.912736
4,210,1.0,140.0,3460.87099,14.4,2009,Europe,Gasoline,All-wheel drive,2.0,12.488369


In [17]:

df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9704 entries, 0 to 9703
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   engine_displacement  9704 non-null   int64  
 1   num_cylinders        9222 non-null   float64
 2   horsepower           8996 non-null   float64
 3   vehicle_weight       9704 non-null   float64
 4   acceleration         8774 non-null   float64
 5   model_year           9704 non-null   int64  
 6   origin               9704 non-null   object 
 7   fuel_type            9704 non-null   object 
 8   drivetrain           9704 non-null   object 
 9   num_doors            9202 non-null   float64
 10  fuel_efficiency_mpg  9704 non-null   float64
dtypes: float64(6), int64(2), object(3)
memory usage: 834.1+ KB


In [18]:

df.describe(include='all').T


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
engine_displacement,9704.0,,,,199.708368,49.455319,10.0,170.0,200.0,230.0,380.0
num_cylinders,9222.0,,,,3.962481,1.999323,0.0,3.0,4.0,5.0,13.0
horsepower,8996.0,,,,149.657292,29.879555,37.0,130.0,149.0,170.0,271.0
vehicle_weight,9704.0,,,,3001.280993,497.89486,952.681761,2666.248985,2993.226296,3334.957039,4739.077089
acceleration,8774.0,,,,15.021928,2.510339,6.0,13.3,15.0,16.7,24.3
model_year,9704.0,,,,2011.484027,6.659808,2000.0,2006.0,2012.0,2017.0,2023.0
origin,9704.0,3.0,Europe,3254.0,,,,,,,
fuel_type,9704.0,2.0,Gasoline,4898.0,,,,,,,
drivetrain,9704.0,2.0,All-wheel drive,4876.0,,,,,,,
num_doors,9202.0,,,,-0.006412,1.048162,-4.0,-1.0,0.0,1.0,4.0



## Q1. Pandas version
We read the installed `pandas` version from `pd.__version__`.


In [19]:

pd.__version__


'2.3.3'


## Q2. Records count
We obtain the number of rows with `len(df)`.


In [20]:

num_records = len(df)
num_records


9704


## Q3. Fuel types
We count distinct values in the `fuel_type` column using `nunique()`.


In [21]:

num_fuel_types = df['fuel_type'].nunique()
num_fuel_types


2


## Q4. Missing values
We count how many columns contain at least one missing value.


In [22]:

num_missing_columns = df.columns[df.isnull().any()].shape[0]
num_missing_columns


4


## Q5. Max fuel efficiency (Asia)
Filter by `origin == 'Asia'` and take the maximum of `fuel_efficiency_mpg`.


In [23]:

max_eff_asia = df[df['origin'] == 'Asia']['fuel_efficiency_mpg'].max()
max_eff_asia


np.float64(23.759122836520497)


## Q6. Median horsepower before/after filling missing values
1. Compute the median of `horsepower`.
2. Compute the mode of `horsepower`.
3. Fill NAs with the mode.
4. Compute the median again and compare.


In [24]:

median_hp_before = df['horsepower'].median()
mode_hp = df['horsepower'].mode().iloc[0]
median_hp_after = df.assign(horsepower=df['horsepower'].fillna(mode_hp))['horsepower'].median()
median_hp_before, mode_hp, median_hp_after


(np.float64(149.0), np.float64(152.0), np.float64(152.0))


Interpretation: If the median increased, then the answer is "Yes, it increased"; if it decreased, "Yes, it decreased"; otherwise, "No".


In [25]:

if abs(median_hp_after - median_hp_before) < 1e-12:
    change = 'No'
elif median_hp_after > median_hp_before:
    change = 'Yes, it increased'
else:
    change = 'Yes, it decreased'
change


'Yes, it increased'


## Q7. Sum of weights via normal equation
Steps:
- Select Asia cars, take `vehicle_weight` and `model_year`
- Take first 7 rows to form matrix `X`
- Compute `XTX = X.T @ X`, invert it
- Define `y` and compute `w = (XTX^-1) X^T y`
- Sum the elements of `w`

This mirrors the linear regression normal equation.


In [26]:

import numpy as np
asia = df[df['origin'] == 'Asia'][['vehicle_weight', 'model_year']].head(7)
X = asia.values
XTX = X.T @ X
XTX_inv = np.linalg.inv(XTX)
y = np.array([1100, 1300, 800, 900, 1000, 1100, 1200])
w = XTX_inv @ X.T @ y
float(w.sum())


0.5187709081074006


## Conclusions and Final Answers
- Q1 (Pandas version): expect ~2.3.x
- Q2 (Records): 9704
- Q3 (Fuel types): 2
- Q4 (Columns with missing values): 4
- Q5 (Max fuel efficiency in Asia): ~23.75
- Q6 (Median horsepower change after fillna(mode)): Yes, it increased
- Q7 (Sum of w elements): ~0.51

These align with the multiple-choice options in the homework. See the course repository for details and submission instructions.

References:
- Course repo: https://github.com/DataTalksClub/machine-learning-zoomcamp
- Dataset: https://raw.githubusercontent.com/alexeygrigorev/datasets/master/car_fuel_efficiency.csv
