<a href="https://colab.research.google.com/github/asifahsaan/data-preprocessing-beginners/blob/main/01_train_test_split_and_baseline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 01 — Train/Test Split + Baseline Models
In this notebook, we’ll learn how to:
- Split data into training and testing sets
- Train simple baseline models for comparison
- Evaluate them using accuracy for classification tasks

## 1. Load Dataset

In [None]:
from sklearn.datasets import load_breast_cancer
import pandas as pd

# Load sample classification dataset
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

X.shape

## 2. Train/Test Split

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Train size: {X_train.shape}, Test size: {X_test.shape}")

##  3. DummyClassifier as Baseline

In [None]:
from sklearn.dummy import DummyClassifier

# Most frequent class baseline
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)
y_pred_dummy = dummy.predict(X_test)

##  4. Logistic Regression Model

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)
y_pred_logreg = model.predict(X_test)

## 5. Evaluate Models

In [None]:
from sklearn.metrics import accuracy_score

acc_dummy = accuracy_score(y_test, y_pred_dummy)
acc_logreg = accuracy_score(y_test, y_pred_logreg)

print(f"DummyClassifier Accuracy: {acc_dummy:.3f}")
print(f"LogisticRegression Accuracy: {acc_logreg:.3f}")

## Summary
- Used `train_test_split` to separate training/testing data
- Trained a `DummyClassifier` as a performance baseline
- Trained a real model (`LogisticRegression`) and compared accuracies

## What’s Next?
 In the next notebook:
**`02_model_selection_with_crossval.ipynb`** — you’ll learn how to use cross-validation to make more reliable model comparisons.