<a href="https://colab.research.google.com/github/asifahsaan/data-preprocessing-beginners/blob/main/notebooks/01_intro_to_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🧾 01 — Introduction to the Dataset

Welcome! In this notebook, we'll get familiar with the dataset we'll be using throughout this preprocessing series.

By the end of this notebook, you'll know:

- What the dataset looks like
- Which columns are numeric vs. categorical
- Where missing values exist
- Basic statistics and distributions


In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv("../data/sample_data.csv")

# Preview the first few rows
df.head()

In [None]:
# Dataset shape (rows, columns)
df.shape

In [None]:
# Column names and types
df.info()


In [None]:
# Basic statistics for numeric columns
df.describe()

In [None]:
# Summary of all columns (categorical + numeric)
df.describe(include="all")

In [None]:
# Separate column names by data type
numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()
categorical_cols = df.select_dtypes(include=["object", "category"]).columns.tolist()

print("Numeric Columns:", numeric_cols)
print("Categorical Columns:", categorical_cols)

In [None]:
# Count missing values per column
df.isnull().sum()

In [None]:
# Percentage of missing values
(df.isnull().sum() / len(df) * 100).round(2)

In [None]:
import matplotlib.pyplot as plt

# Bar chart: missing values
df.isnull().sum().sort_values(ascending=False).plot.bar()
plt.title("Missing Values per Column")
plt.ylabel("Count")
plt.show()

## ✅ Summary

Here's what we learned:

- The dataset contains **X rows** and **Y columns**
- There are **Z numeric features** and **W categorical features**
- Several columns contain missing values that need handling

In the **next notebook**, we'll explore how to **handle missing values** using techniques like median imputation and placeholder tokens.

➡️ **Next Up**: `02_handling_missing_values.ipynb`