# Data Handling and Dataset Splitting

This notebook focuses on how to load data and split it correctly
before any preprocessing or modeling is performed.


## 1. Why Data Handling Matters

Most machine learning mistakes are not caused by model choice,
but by incorrect data handling.

Improper data handling can lead to:
- Data leakage
- Over-optimistic evaluation
- Models that fail in real-world settings


In [3]:
import pandas as pd
from sklearn.datasets import load_breast_cancer

dataset = load_breast_cancer(as_frame=True)
X = dataset.data
y = dataset.target

X.head(), y.head()

(   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
 0        17.99         10.38          122.80     1001.0          0.11840   
 1        20.57         17.77          132.90     1326.0          0.08474   
 2        19.69         21.25          130.00     1203.0          0.10960   
 3        11.42         20.38           77.58      386.1          0.14250   
 4        20.29         14.34          135.10     1297.0          0.10030   
 
    mean compactness  mean concavity  mean concave points  mean symmetry  \
 0           0.27760          0.3001              0.14710         0.2419   
 1           0.07864          0.0869              0.07017         0.1812   
 2           0.15990          0.1974              0.12790         0.2069   
 3           0.28390          0.2414              0.10520         0.2597   
 4           0.13280          0.1980              0.10430         0.1809   
 
    mean fractal dimension  ...  worst radius  worst texture  worst perimeter 

## 3. Inputs and Target

- **X** contains the feature matrix
- **y** contains the target labels

All preprocessing decisions must be made with this separation in mind.

## 4. Train/Test Split

The dataset must be split into training and test sets
*before* any preprocessing steps.

The test set must remain untouched until final evaluation.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

X_train.shape, X_test.shape

((455, 30), (114, 30))

### Why this matters

- `random_state` ensures reproducibility
- `stratify=y` preserves class distribution
- The test set simulates unseen data


## 5. What NOT to Do

The following operations must NOT be performed before splitting:
- Filling missing values using global statistics
- Feature scaling
- Encoding categorical variables
- Feature selection


## 6. Data Leakage

Data leakage occurs when information from outside the training set
is used to build the model.

This leads to overly optimistic performance estimates
and poor generalization.

Any operation that learns from the data must be fitted on the training set only.


## 7. Summary

In this notebook, we:
- Loaded a dataset
- Clearly separated features and targets
- Performed a proper train/test split
- Discussed common data handling mistakes

No preprocessing or modeling was performed.
