# Data Visualization and Scaling

This notebook covers **plotting** (Matplotlib, Pandas) and **preprocessing/scaling** (missing values, encoding, StandardScaler, train/test split). You'll use scatter plots, histograms, and then clean and scale the same data for ML-style workflows.

## Topics Covered
- Basic plotting: scatter, histogram, pandas `.plot()`
- Data visualization from DataFrames
- Preprocessing: missing values, categorical encoding, feature scaling
- Train/test split for modeling

Slides: https://docs.google.com/presentation/d/1co_VPwdvYgVmQNQC8GRQ1C2AMpBA_5sl/edit?usp=sharing&ouid=103898867136891335922&rtpof=true&sd=true

## Run in the browser (no local setup)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/adzuci/ai-fundamentals/blob/main/class-2-machine-learning-basics/data-visualization.ipynb)

In [1]:
# Concept: environment check and imports
import platform

print("Python:", platform.python_version())
print("OS:", platform.system(), platform.release())

# Core data libraries for this class
try:
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.preprocessing import LabelEncoder, StandardScaler
    from sklearn.model_selection import train_test_split

    print("NumPy:", np.__version__)
    print("Pandas:", pd.__version__)
    sample = pd.DataFrame({"x": [1, 2, 3], "y": [10, 20, 30]})
    print(sample)
except ModuleNotFoundError as exc:
    print("Missing dependency:", exc)
    print("Install with: python -m pip install numpy pandas scikit-learn matplotlib")
    raise

ModuleNotFoundError: No module named 'matplotlib'

In [None]:
# Concept: create a small dataset (tabular data)
mydata = {
    "Age": [30, 25, np.nan, 40, 35],
    "Salary": [45000, 40000, 50000, np.nan, 65000],
    "City": ["Mumbai", "Pune", "Mumbai", "Delhi", "Pune"],
    "Purchased": ["Yes", "No", "Yes", "Yes", "No"],
}

In [None]:
# Concept: basic DataFrame operations
df = pd.DataFrame(mydata)
print("Original DataFrame:")
print(df)
print("\nDataFrame info:")
print(df.info())

In [None]:
# Concept: preview rows
df.head()

### Visualization: histograms and scatter

First we visualize the numeric columns; then we'll preprocess and scale.

In [None]:
# Concept: distribution of numeric columns (histogram)
df[["Age", "Salary"]].hist()
plt.tight_layout()
plt.show()

In [None]:
# Concept: scatter plot (matplotlib)
plt.figure(figsize=(8, 6))
plt.scatter(df["Age"], df["Salary"], alpha=0.6)
plt.xlabel("Age")
plt.ylabel("Salary")
plt.title("Age vs Salary")
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Concept: scatter plot (pandas .plot())
df.plot(x="Age", y="Salary", kind="scatter", figsize=(8, 6))
plt.title("Age vs Salary (Pandas)")
plt.grid(True, alpha=0.3)
plt.show()

### Preprocessing and scaling

Handle missing values, encode categories, scale numerics, then split for train/test.

In [None]:
# Concept: data quality check (missing values)
df.isnull().sum()

#### What are histograms good for?

**Histograms** show the **distribution** (spread and frequency) of a single numeric variable. They're useful for:

1. **Understanding data shape**: Is it symmetric (normal/bell-shaped), skewed left/right, or uniform?
2. **Finding outliers**: Very tall or very short bars at the edges indicate unusual values.
3. **Checking assumptions**: Many ML algorithms assume normal distributions; histograms help verify this.
4. **Comparing before/after**: See how scaling (e.g. StandardScaler) changes the distribution shape.

#### How to interpret a histogram:

- **X-axis**: The range of values (e.g., Age from 25 to 45, Salary from 30k to 70k)
- **Y-axis**: How many data points fall into each "bin" (range)
- **Peak(s)**: The tallest bar(s) show the most common values (the mode)
- **Spread**: Wide histograms = high variance; narrow = low variance
- **Skewness**: 
  - **Right-skewed** (tail to the right): Most values are low, a few high outliers
  - **Left-skewed** (tail to the left): Most values are high, a few low outliers
  - **Symmetric**: Values cluster around the center (like a bell curve)

In [None]:
# Concept: handle missing values (imputation)
df["Age"] = df["Age"].fillna(df["Age"].mean())
df["Salary"] = df["Salary"].fillna(df["Salary"].mean())
print("After imputation:")
print(df)

In [None]:
# Concept: encode categorical data (City)
le = LabelEncoder()
df["City"] = le.fit_transform(df["City"])
print("City mapping:", dict(zip(le.classes_, le.transform(le.classes_))))

In [None]:
# Concept: feature scaling (standardization)
scaler = StandardScaler()
df[["Age", "Salary"]] = scaler.fit_transform(df[["Age", "Salary"]])
print("DataFrame with scaled features:")
print(df)

In [None]:
# Concept: train/test split
X = df[["Age", "Salary", "City"]].copy()
le_p = LabelEncoder()
y = pd.Series(le_p.fit_transform(df["Purchased"]), name="Purchased")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("X_train shape:", X_train.shape, "| y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape, "| y_test shape:", y_test.shape)

In [None]:
# (Optional) Scatter of scaled features
plt.figure(figsize=(6, 4))
plt.scatter(df["Age"], df["Salary"], alpha=0.6)
plt.xlabel("Age (scaled)")
plt.ylabel("Salary (scaled)")
plt.title("Age vs Salary (after scaling)")
plt.show()

---
## Try it on your own

**Visualization**
1. **Line plot**: Plot `Age` vs `Salary` as a line (`plt.plot(...)` or `df.plot(kind='line', ...)`). Compare with the scatter plot.
2. **Two plots**: Use `plt.subplots(1, 2)` to show a scatter (Age vs Salary) and a histogram of `Salary` side by side.

**Preprocessing**
3. **Missing values**: Add a row with `np.nan` in a numeric column; re-run from DataFrame creation and check `df.isnull().sum()` and imputation.
4. **Encode Purchased**: Use `LabelEncoder` on `Purchased` and show the label mapping.
5. **Train/test size**: Change `test_size=0.2` to `0.4` and compare `X_train`/`X_test` shapes.
6. **MinMaxScaler**: Use `MinMaxScaler()` on `df[['Age','Salary']]` and compare the value range with `StandardScaler`.