# Class 2 Notebook - Machine Learning Basics

This notebook is for hands-on exercises in Class 2.

Run the first cell to confirm your environment works in both macOS and Windows.

## ML Preprocessing with NumPy, Pandas & Scikit-learn

**NumPy** (`import numpy as np`): array manipulation, `np.nan` for missing values, reshaping/indexing, statistical functions (mean, std, etc.)

**Pandas** (`import pandas as pd`): loading data (`pd.read_csv`, `pd.read_excel`), missing data (`dropna`, `fillna`), exploration (`head`, `describe`, `info`), feature engineering, filtering and selecting data

**sklearn.preprocessing**: `StandardScaler` (normalize to mean=0, std=1), `MinMaxScaler` (scale to [0,1]), `LabelEncoder` (categories → numbers), `OneHotEncoder` (binary columns for categories), `train_test_split` (split data for training/testing)

**Data cleaning steps**: remove duplicates → handle missing values (drop or impute) → fix data types → remove outliers → encode categoricals → scale numerical features

These tools prepare raw data before feeding it into ML algorithms.

---

Slides: https://docs.google.com/presentation/d/1co_VPwdvYgVmQNQC8GRQ1C2AMpBA_5sl/edit?usp=sharing&ouid=103898867136891335922&rtpof=true&sd=true

## Run in the browser (no local setup)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adzuci/ai-fundamentals/blob/main/class-2-machine-learning-basics/ml-preprocessing.ipynb)

In [40]:
# Environment sanity check
import platform

print("Python:", platform.python_version())
print("OS:", platform.system(), platform.release())

# Core data libraries for this class
try:
    import numpy as np
    import pandas as pd
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import LabelEncoder, StandardScaler, MinMaxScaler, OneHotEncoder

    print("NumPy:", np.__version__)
    print("Pandas:", pd.__version__)
    sample = pd.DataFrame({"x": [1, 2, 3], "y": [10, 20, 30]})
    print(sample)
except ModuleNotFoundError as exc:
    print("Missing dependency:", exc)
    print("Install with: python -m pip install numpy pandas scikit-learn")
    raise

Python: 3.10.14
OS: Darwin 25.2.0
NumPy: 2.2.6
Pandas: 2.3.3
   x   y
0  1  10
1  2  20
2  3  30


### Data cleaning steps (applied in this notebook)

1. **Remove duplicates** — `df.drop_duplicates()`
2. **Handle missing values** — drop or impute (e.g. `fillna` with mean)
3. **Fix data types** — ensure numeric/categorical types are correct
4. **Remove outliers** — optional (e.g. IQR or z-score)
5. **Encode categorical variables** — `LabelEncoder` or `OneHotEncoder`
6. **Scale numerical features** — `StandardScaler` (mean=0, std=1) or `MinMaxScaler` ([0,1])
7. **Split for ML** — `train_test_split` for training vs testing sets

In [41]:
# Concept: create a small dataset (tabular data)
# We use a Python dict to simulate raw data
mydata = {
    "Age": [30, 25, np.nan, 40, 35],
    "Salary": [45000, 40000, 50000, np.nan, 65000],
    "City": ["Mumbai", "Pune", "Mumbai", "Delhi", "Pune"],
    "Purchased": ["Yes", "No", "Yes", "Yes", "No"],
}

In [None]:
# Concept: basic DataFrame operations
# Convert dict to DataFrame
df = pd.DataFrame(mydata)

In [42]:
df

Original DataFrame:
    Age   Salary    City Purchased
0  30.0  45000.0  Mumbai       Yes
1  25.0  40000.0    Pune        No
2   NaN  50000.0  Mumbai       Yes
3  40.0      NaN   Delhi       Yes
4  35.0  65000.0    Pune        No

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Age        4 non-null      float64
 1   Salary     4 non-null      float64
 2   City       5 non-null      object 
 3   Purchased  5 non-null      object 
dtypes: float64(2), object(2)
memory usage: 288.0+ bytes
None


In [43]:
# Concept: preview rows
# Quick look at the first few records
df.head()

Unnamed: 0,Age,Salary,City,Purchased
0,30.0,45000.0,Mumbai,Yes
1,25.0,40000.0,Pune,No
2,,50000.0,Mumbai,Yes
3,40.0,,Delhi,Yes
4,35.0,65000.0,Pune,No


In [44]:
# Concept: data quality check (missing values)
# This helps decide which columns need cleaning
df.isnull().sum()

Age          1
Salary       1
City         0
Purchased    0
dtype: int64

In [45]:
# Concept: handle missing values (imputation)
# Replace NaNs with the column mean for numeric features
df["Age"].fillna(df["Age"].mean(), inplace=True)
df["Salary"].fillna(df["Salary"].mean(), inplace=True)

print("DataFrame after handling missing values:")
print(df)

DataFrame after handling missing values:
    Age   Salary    City Purchased
0  30.0  45000.0  Mumbai       Yes
1  25.0  40000.0    Pune        No
2  32.5  50000.0  Mumbai       Yes
3  40.0  50000.0   Delhi       Yes
4  35.0  65000.0    Pune        No


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Age"].fillna(df["Age"].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df["Salary"].fillna(df["Salary"].mean(), inplace=True)


In [46]:
# Concept: encode categorical data
# Convert city names into numeric labels
le = LabelEncoder()
df["City"] = le.fit_transform(df["City"])
print("City mapping:", dict(zip(le.classes_, le.transform(le.classes_))))

City mapping: {'Delhi': np.int64(0), 'Mumbai': np.int64(1), 'Pune': np.int64(2)}


In [47]:
# Concept: inspect the current DataFrame
df

Unnamed: 0,Age,Salary,City,Purchased
0,30.0,45000.0,1,Yes
1,25.0,40000.0,2,No
2,32.5,50000.0,1,Yes
3,40.0,50000.0,0,Yes
4,35.0,65000.0,2,No


In [48]:
# Concept: separate features (X) and target (y)
# Encode target for classification; keep features in a matrix X
le_purchased = LabelEncoder()
y = pd.Series(le_purchased.fit_transform(df["Purchased"]), name="Purchased")
X = df[["Age", "Salary", "City"]].copy()

In [49]:
y

0    1
1    0
2    1
3    1
4    0
Name: Purchased, dtype: int64

In [50]:
scaler = StandardScaler()

In [51]:
X_scaled = scaler.fit_transform(X)

In [52]:
# Scaled features (mean=0, unit variance) — ready for model training
X_scaled

array([[-0.5       , -0.5976143 , -0.26726124],
       [-1.5       , -1.19522861,  1.06904497],
       [ 0.        ,  0.        , -0.26726124],
       [ 1.5       ,  0.        , -1.60356745],
       [ 0.5       ,  1.79284291,  1.06904497]])

In [53]:
# Concept: split data for training and testing
# 80% train, 20% test; random_state=42 for reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

In [54]:
# Inspect split: training vs test set sizes
print("X_train shape:", X_train.shape, "| y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape, "| y_test shape:", y_test.shape)

X_train shape: (4, 3) | y_train shape: (4,)
X_test shape: (1, 3) | y_test shape: (1,)


In [55]:
# Concept: feature scaling (standardization)
# Scale numerical features to mean=0 and std=1
scaler = StandardScaler()
df[["Age", "Salary"]] = scaler.fit_transform(df[["Age", "Salary"]])

print("DataFrame with scaled numerical features:")
print(df)

DataFrame with scaled numerical features:
   Age    Salary  City Purchased
0 -0.5 -0.597614     1       Yes
1 -1.5 -1.195229     2        No
2  0.0  0.000000     1       Yes
3  1.5  0.000000     0       Yes
4  0.5  1.792843     2        No
