# Train-Test Split

In Machine Learning, we split the dataset into **training** and **testing** parts to evaluate model performance.

---

## Why Split the Data?
- **Training Set** → used to fit/train the model  
- **Testing Set** → used to evaluate how well the model generalizes to unseen data  

This prevents **overfitting**, where a model performs well on training data but poorly on new/unseen data.

---

## Typical Ratios
- 70% Training, 30% Testing
- 80% Training, 20% Testing
- 60% Training, 40% Testing

---

## Formula
If total data = $N$:

$$
N_{train} = p \times N, \quad N_{test} = (1-p) \times N
$$

where:  
- $N_{train}$ → number of training samples  
- $N_{test}$ → number of testing samples  
- $p$ → proportion for training (e.g., 0.8)  


In [7]:
from sklearn.model_selection import train_test_split
import numpy as np

# Sample dataset
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8]])
y = np.array([10, 20, 30, 40, 50, 60, 70, 80])

# Split into 75% train and 25% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, shuffle=True
)

print("Training Data (X_train):", X_train.flatten())
print("Training Labels (y_train):", y_train)
print("Testing Data (X_test):", X_test.flatten())
print("Testing Labels (y_test):", y_test)

Training Data (X_train): [1 8 3 5 4 7]
Training Labels (y_train): [10 80 30 50 40 70]
Testing Data (X_test): [2 6]
Testing Labels (y_test): [20 60]
