# Class 2 Notebook - Machine Learning Basics

This notebook is for hands-on exercises in Class 2.

Run the first cell to confirm your environment works in both macOS and Windows.

## Class material

In this class we cover ML data preparation: missing values, categorical encoding, and feature scaling.

Slides: https://docs.google.com/presentation/d/1co_VPwdvYgVmQNQC8GRQ1C2AMpBA_5sl/edit?usp=sharing&ouid=103898867136891335922&rtpof=true&sd=true

## Run in the browser (no local setup)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adzuci/ai-fundamentals/blob/main/class-2-machine-learning-basics/class-2.ipynb)

In [1]:
# Environment sanity check
import platform

print("Python:", platform.python_version())
print("OS:", platform.system(), platform.release())

# Core data libraries for this class
try:
    import numpy as np
    import pandas as pd
    from sklearn.preprocessing import LabelEncoder, StandardScaler

    print("NumPy:", np.__version__)
    print("Pandas:", pd.__version__)
    sample = pd.DataFrame({"x": [1, 2, 3], "y": [10, 20, 30]})
    print(sample)
except ModuleNotFoundError as exc:
    print("Missing dependency:", exc)
    print("Install with: python -m pip install numpy pandas scikit-learn")
    raise

Python: 3.10.14
OS: Darwin 25.2.0
NumPy: 2.2.6
Pandas: 2.3.3
   x   y
0  1  10
1  2  20
2  3  30


In [3]:
# Concept: create a small dataset (tabular data)
# We use a Python dict to simulate raw data
mydata = {
    "Age": [30, 25, np.nan, 40, 35],
    "Salary": [45000, 40000, 50000, np.nan, 65000],
    "City": ["Mumbai", "Pune", "Mumbai", "Delhi", "Pune"],
    "Purchased": ["Yes", "No", "Yes", "Yes", "No"],
}

In [4]:
# Concept: basic DataFrame operations
# Convert dict to DataFrame and inspect structure
# Keep raw data visible for comparison later
df = pd.DataFrame(mydata)
print("Original DataFrame:")
print(df)
print("\nDataFrame info:")
print(df.info())

In [None]:
# Concept: preview rows
# Quick look at the first few records
df.head()

In [None]:
# Concept: data quality check (missing values)
# This helps decide which columns need cleaning
df.isnull().sum()

In [None]:
# Concept: handle missing values (imputation)
# Replace NaNs with the column mean for numeric features
df["Age"].fillna(df["Age"].mean(), inplace=True)
df["Salary"].fillna(df["Salary"].mean(), inplace=True)

print("DataFrame after handling missing values:")
print(df)

In [None]:
# Concept: encode categorical data
# Convert city names into numeric labels
le = LabelEncoder()
df["City"] = le.fit_transform(df["City"])
print("City mapping:", dict(zip(le.classes_, le.transform(le.classes_))))

In [None]:
# Concept: inspect the current DataFrame
df

In [None]:
# Concept: separate features (X) and target (y)
# Encode target for classification; keep features in a matrix X
le_purchased = LabelEncoder()
y = pd.Series(le_purchased.fit_transform(df["Purchased"]), name="Purchased")
X = df[["Age", "Salary", "City"]].copy()

In [None]:
y

In [None]:
scaler = StandardScaler()

In [None]:
X_scaled = scaler.fit_transform(X)

In [None]:
# Scaled features (mean=0, unit variance) â€” ready for model training
X_scaled

In [None]:
# Concept: feature scaling (standardization)
# Scale numerical features to mean=0 and std=1
scaler = StandardScaler()
df[["Age", "Salary"]] = scaler.fit_transform(df[["Age", "Salary"]])

print("DataFrame with scaled numerical features:")
print(df)