# Introduction to Data Visualization (Matplotlib) — Using Your Stress Datasets

This **separate** notebook introduces common charts and demonstrates them **with your datasets** located at `./Data/`:

Charts covered:
- **Histogram** — distribution of one numeric variable  
- **Scatter plot** — relationship between two numeric variables  
- **Bar plot** — comparisons across categories  
- **Violin plot** — distribution shape per category  
- **Box plot** — quartiles, median, outliers per category  
- **Heatmap** — 2‑D matrix visualization (e.g., correlations)

We keep things simple and use **Matplotlib only**, one chart per figure.


## Helpers (auto‑picking columns)

The functions below automatically choose sensible columns for each chart type.  
You can change the chosen columns after the printout if you want to explore others.


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def choose_columns(df: pd.DataFrame):
    numeric = list(df.select_dtypes(include=[np.number]).columns)
    cat_like = list(df.select_dtypes(include=['object','bool','category']).columns)
    low_card_num = [c for c in numeric if df[c].nunique(dropna=True) <= 10]
    categorical = list(dict.fromkeys(cat_like + low_card_num))
    
    picks = {
        "numeric_all": numeric,
        "categorical_all": categorical,
        "hist": numeric[0] if len(numeric) >= 1 else None,
        "scatter_x": numeric[0] if len(numeric) >= 1 else None,
        "scatter_y": numeric[1] if len(numeric) >= 2 else None,
        "bar_cat": categorical[0] if len(categorical) >= 1 else None,
        "bar_val": numeric[0] if len(numeric) >= 1 else None,
        "group": categorical[0] if len(categorical) >= 1 else None,
        "violin_val": numeric[0] if len(numeric) >= 1 else None,
        "box_val": numeric[0] if len(numeric) >= 1 else None,
    }
    return picks

def describe_choices(name, picks):
    print(f"\nDataset: {name}")
    print("  numeric columns:", picks['numeric_all'])
    print("  categorical-like columns:", picks['categorical_all'])
    print("  histogram:", picks['hist'])
    print("  scatter: x =", picks['scatter_x'], ", y =", picks['scatter_y'])
    print("  bar: category =", picks['bar_cat'], ", value =", picks['bar_val'])
    print("  violin: group =", picks['group'], ", value =", picks['violin_val'])
    print("  box: group =", picks['group'], ", value =", picks['box_val'])

## Load your datasets (from `./Data/`)

> Make sure the two CSV files exist at the following paths relative to this notebook:
- `./Data/StressLevelDataset.csv`
- `./Data/Stress_Dataset.csv`


In [None]:
path_A = "./Data/StressLevelDataset.csv"
path_B = "./Data/Stress_Dataset.csv"

df_A = pd.read_csv(path_A)
df_B = pd.read_csv(path_B)

print("StressLevelDataset shape:", df_A.shape)
print("Stress_Dataset shape:", df_B.shape)

picks_A = choose_columns(df_A)
picks_B = choose_columns(df_B)

describe_choices("StressLevelDataset", picks_A)
describe_choices("Stress_Dataset", picks_B)

display(df_A.head())
display(df_B.head())

---

# StressLevelDataset: Examples

## Histogram — distribution of one numeric variable (StressLevelDataset)

**Use when:** understanding the distribution (shape, center, spread, skew).

In [None]:
col = picks_A['hist']
if col is not None:
    plt.figure()
    plt.hist(df_A[col].dropna(), bins=30)
    plt.title(f"Histogram — {col}")
    plt.xlabel(col)
    plt.ylabel("Count")
    plt.grid(True, linestyle="--", alpha=0.4)
    plt.show()
else:
    print("No numeric column available for histogram in StressLevelDataset.")

## Scatter plot — two numeric variables (StressLevelDataset)

**Use when:** exploring relationships/correlation and spotting clusters or outliers.

In [None]:
xcol, ycol = picks_A['scatter_x'], picks_A['scatter_y']
if xcol is not None and ycol is not None:
    plt.figure()
    plt.scatter(df_A[xcol], df_A[ycol])
    plt.title(f"Scatter — {xcol} vs {ycol}")
    plt.xlabel(xcol); plt.ylabel(ycol)
    plt.grid(True, linestyle=":", alpha=0.5)
    plt.show()
else:
    print("Need at least two numeric columns for a scatter plot in StressLevelDataset.")

## Bar plot — compare categories (StressLevelDataset)

**Use when:** comparing means/totals/counts across categories.

In [None]:
cat, val = picks_A['bar_cat'], picks_A['bar_val']
if cat is not None and val is not None:
    top_cats = df_A[cat].value_counts(dropna=False).index[:10]
    subset = df_A[df_A[cat].isin(top_cats)]
    means = subset.groupby(cat)[val].mean()
    plt.figure()
    plt.bar(means.index.astype(str), means.values)
    plt.title(f"Bar — mean of {val} by {cat} (top 10 categories)")
    plt.xlabel(cat); plt.ylabel(f"Mean of {val}")
    plt.xticks(rotation=45, ha="right")
    plt.grid(axis="y", linestyle="-.", alpha=0.3)
    plt.tight_layout()
    plt.show()
elif cat is not None:
    counts = df_A[cat].value_counts(dropna=False).head(10)
    plt.figure()
    plt.bar(counts.index.astype(str), counts.values)
    plt.title(f"Bar — counts of {cat} (top 10)")
    plt.xlabel(cat); plt.ylabel("Count")
    plt.xticks(rotation=45, ha="right")
    plt.grid(axis="y", linestyle="-.", alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    print("No categorical-like column found for a bar plot in StressLevelDataset.")

## Violin plot — distribution per category (StressLevelDataset)

**Use when:** comparing full distribution shapes across groups (KDE‑based).

In [None]:
group, val = picks_A['group'], picks_A['violin_val']
if group is not None and val is not None:
    cats = list(pd.Series(df_A[group]).dropna().value_counts().index[:6])
    data = [df_A.loc[df_A[group] == g, val].dropna().values for g in cats]
    if len(cats) >= 2 and all(len(a) > 0 for a in data):
        plt.figure()
        vp = plt.violinplot(data, showmeans=True, showextrema=True, showmedians=True)
        plt.xticks(range(1, len(cats)+1), [str(g) for g in cats], rotation=0)
        plt.title(f"Violin — {val} by {group} (top groups)")
        plt.xlabel(group); plt.ylabel(val)
        plt.grid(True, linestyle="--", alpha=0.3)
        plt.show()
    else:
        print("Not enough non‑empty groups for a violin plot in StressLevelDataset.")
else:
    print("Need one categorical-like group and one numeric value for a violin plot in StressLevelDataset.")

## Box plot — quartiles, median, outliers (StressLevelDataset)

**Use when:** summarizing distributions compactly across categories.

In [None]:
group, val = picks_A['group'], picks_A['box_val']
if group is not None and val is not None:
    cats = list(pd.Series(df_A[group]).dropna().value_counts().index[:6])
    data = [df_A.loc[df_A[group] == g, val].dropna().values for g in cats]
    if len(cats) >= 2 and all(len(a) > 0 for a in data):
        plt.figure()
        plt.boxplot(data, vert=True, showmeans=False)
        plt.xticks(range(1, len(cats)+1), [str(g) for g in cats], rotation=0)
        plt.title(f"Box — {val} by {group} (top groups)")
        plt.xlabel(group); plt.ylabel(val)
        plt.grid(True, linestyle="--", alpha=0.3)
        plt.show()
    else:
        print("Not enough non‑empty groups for a box plot in StressLevelDataset.")
else:
    print("Need one categorical-like group and one numeric value for a box plot in StressLevelDataset.")

## Heatmap — correlation matrix (StressLevelDataset)

**Use when:** showing pairwise relationships among numeric variables as a matrix of colors.

In [None]:
num_cols = df_A.select_dtypes(include=[np.number]).columns.tolist()
if len(num_cols) >= 2:
    corr = df_A[num_cols].corr()
    plt.figure()
    im = plt.imshow(corr, aspect="auto")
    plt.colorbar(im, label="Correlation")
    plt.xticks(range(len(num_cols)), num_cols, rotation=45, ha="right")
    plt.yticks(range(len(num_cols)), num_cols)
    plt.title("Heatmap — Correlation Matrix")
    plt.tight_layout()
    plt.show()
else:
    print("Need at least two numeric columns to compute a correlation heatmap in StressLevelDataset.")

---

# Stress_Dataset: Examples

## Histogram — distribution of one numeric variable (Stress_Dataset)

**Use when:** understanding the distribution (shape, center, spread, skew).

In [None]:
col = picks_B['hist']
if col is not None:
    plt.figure()
    plt.hist(df_B[col].dropna(), bins=30)
    plt.title(f"Histogram — {col}")
    plt.xlabel(col)
    plt.ylabel("Count")
    plt.grid(True, linestyle="--", alpha=0.4)
    plt.show()
else:
    print("No numeric column available for histogram in Stress_Dataset.")

## Scatter plot — two numeric variables (Stress_Dataset)

**Use when:** exploring relationships/correlation and spotting clusters or outliers.

In [None]:
xcol, ycol = picks_B['scatter_x'], picks_B['scatter_y']
if xcol is not None and ycol is not None:
    plt.figure()
    plt.scatter(df_B[xcol], df_B[ycol])
    plt.title(f"Scatter — {xcol} vs {ycol}")
    plt.xlabel(xcol); plt.ylabel(ycol)
    plt.grid(True, linestyle=":", alpha=0.5)
    plt.show()
else:
    print("Need at least two numeric columns for a scatter plot in Stress_Dataset.")

## Bar plot — compare categories (Stress_Dataset)

**Use when:** comparing means/totals/counts across categories.

In [None]:
cat, val = picks_B['bar_cat'], picks_B['bar_val']
if cat is not None and val is not None:
    top_cats = df_B[cat].value_counts(dropna=False).index[:10]
    subset = df_B[df_B[cat].isin(top_cats)]
    means = subset.groupby(cat)[val].mean()
    plt.figure()
    plt.bar(means.index.astype(str), means.values)
    plt.title(f"Bar — mean of {val} by {cat} (top 10 categories)")
    plt.xlabel(cat); plt.ylabel(f"Mean of {val}")
    plt.xticks(rotation=45, ha="right")
    plt.grid(axis="y", linestyle="-.", alpha=0.3)
    plt.tight_layout()
    plt.show()
elif cat is not None:
    counts = df_B[cat].value_counts(dropna=False).head(10)
    plt.figure()
    plt.bar(counts.index.astype(str), counts.values)
    plt.title(f"Bar — counts of {cat} (top 10)")
    plt.xlabel(cat); plt.ylabel("Count")
    plt.xticks(rotation=45, ha="right")
    plt.grid(axis="y", linestyle="-.", alpha=0.3)
    plt.tight_layout()
    plt.show()
else:
    print("No categorical-like column found for a bar plot in Stress_Dataset.")

## Violin plot — distribution per category (Stress_Dataset)

**Use when:** comparing full distribution shapes across groups (KDE‑based).

In [None]:
group, val = picks_B['group'], picks_B['violin_val']
if group is not None and val is not None:
    cats = list(pd.Series(df_B[group]).dropna().value_counts().index[:6])
    data = [df_B.loc[df_B[group] == g, val].dropna().values for g in cats]
    if len(cats) >= 2 and all(len(a) > 0 for a in data):
        plt.figure()
        vp = plt.violinplot(data, showmeans=True, showextrema=True, showmedians=True)
        plt.xticks(range(1, len(cats)+1), [str(g) for g in cats], rotation=0)
        plt.title(f"Violin — {val} by {group} (top groups)")
        plt.xlabel(group); plt.ylabel(val)
        plt.grid(True, linestyle="--", alpha=0.3)
        plt.show()
    else:
        print("Not enough non‑empty groups for a violin plot in Stress_Dataset.")
else:
    print("Need one categorical-like group and one numeric value for a violin plot in Stress_Dataset.")

## Box plot — quartiles, median, outliers (Stress_Dataset)

**Use when:** summarizing distributions compactly across categories.

In [None]:
group, val = picks_B['group'], picks_B['box_val']
if group is not None and val is not None:
    cats = list(pd.Series(df_B[group]).dropna().value_counts().index[:6])
    data = [df_B.loc[df_B[group] == g, val].dropna().values for g in cats]
    if len(cats) >= 2 and all(len(a) > 0 for a in data):
        plt.figure()
        plt.boxplot(data, vert=True, showmeans=False)
        plt.xticks(range(1, len(cats)+1), [str(g) for g in cats], rotation=0)
        plt.title(f"Box — {val} by {group} (top groups)")
        plt.xlabel(group); plt.ylabel(val)
        plt.grid(True, linestyle="--", alpha=0.3)
        plt.show()
    else:
        print("Not enough non‑empty groups for a box plot in Stress_Dataset.")
else:
    print("Need one categorical-like group and one numeric value for a box plot in Stress_Dataset.")

## Heatmap — correlation matrix (Stress_Dataset)

**Use when:** showing pairwise relationships among numeric variables as a matrix of colors.

In [None]:
num_cols = df_B.select_dtypes(include=[np.number]).columns.tolist()
if len(num_cols) >= 2:
    corr = df_B[num_cols].corr()
    plt.figure()
    im = plt.imshow(corr, aspect="auto")
    plt.colorbar(im, label="Correlation")
    plt.xticks(range(len(num_cols)), num_cols, rotation=45, ha="right")
    plt.yticks(range(len(num_cols)), num_cols)
    plt.title("Heatmap — Correlation Matrix")
    plt.tight_layout()
    plt.show()
else:
    print("Need at least two numeric columns to compute a correlation heatmap in Stress_Dataset.")

---
## References (Matplotlib docs)

- Pyplot tutorial — https://matplotlib.org/stable/tutorials/pyplot.html  
- Histogram (`plt.hist`) — https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html  
- Scatter (`plt.scatter`) — https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html  
- Bar (`plt.bar`) — https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html  
- Violin (`plt.violinplot`) — https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.violinplot.html  
- Box (`plt.boxplot`) — https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html  
- Heatmap (`plt.imshow`) — https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.imshow.html  
