
# 🎯 Preprocessing for Machine Learning Models

This notebook provides **code templates and checklists** for **preparing datasets for machine learning**. Proper preprocessing ensures models receive clean, well-structured input data.

### 🔹 What’s Covered:
- Handling missing data
- Encoding categorical variables
- Feature scaling & normalization
- Splitting data for training & testing


In [None]:

# Ensure required libraries are installed (Uncomment if necessary)
# !pip install pandas numpy sklearn



## 🚫 Handling Missing Data

✅ Identify missing values in the dataset.  
✅ Decide whether to **drop** or **impute** missing values.  
✅ Choose the right imputation strategy (mean, median, mode).  


In [None]:

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

# Sample dataset with missing values
df = pd.DataFrame({
    'age': [25, 30, np.nan, 40, 35],
    'salary': [50000, 70000, 90000, np.nan, 65000],
    'city': ['NY', 'LA', 'SF', 'NY', np.nan]
})

# Identify missing values
print(df.isnull().sum())

# Impute missing values for numerical columns (mean strategy)
imputer = SimpleImputer(strategy='mean')
df[['age', 'salary']] = imputer.fit_transform(df[['age', 'salary']])

# Impute missing values for categorical columns (most frequent value)
imputer_cat = SimpleImputer(strategy='most_frequent')
df[['city']] = imputer_cat.fit_transform(df[['city']])

print(df)



## 🔤 Encoding Categorical Variables

✅ Convert categorical features into numerical representations.  
✅ Use **One-Hot Encoding** for non-ordinal categories.  
✅ Use **Label Encoding** for ordinal categories.  


In [None]:

from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['city'], drop_first=True)

# Label Encoding (Example for ordinal categories)
le = LabelEncoder()
df['encoded_city'] = le.fit_transform(df['city'])

print(df_encoded.head())



## 📏 Feature Scaling & Normalization

✅ Normalize numerical features to ensure comparability.  
✅ Use **Min-Max Scaling** (0-1 range) or **Standardization** (Z-score).  


In [None]:

from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Min-Max Scaling
scaler = MinMaxScaler()
df[['salary_scaled']] = scaler.fit_transform(df[['salary']])

# Standardization (Z-score normalization)
scaler = StandardScaler()
df[['salary_standardized']] = scaler.fit_transform(df[['salary']])

print(df.head())



## 📂 Splitting Data for Training & Testing

✅ Ensure a **proper split** between training & testing data.  
✅ Use **stratified sampling** for imbalanced classification problems.  
✅ Avoid **data leakage** when preparing features.  


In [None]:

from sklearn.model_selection import train_test_split

# Define features (X) and target variable (y)
X = df.drop(columns=['salary'])
y = df['salary']

# Split dataset into 80% training and 20% testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training data shape: {X_train.shape}")
print(f"Testing data shape: {X_test.shape}")



## ✅ Best Practices & Common Pitfalls

- **Ensure consistent scaling**: Apply the same scaler used on training data to test data.  
- **Check for class imbalance**: Consider stratified splits for imbalanced datasets.  
- **Avoid data leakage**: Don't use test data when normalizing or encoding training data.  
- **Use pipelines**: Combine preprocessing steps using `sklearn.pipeline.Pipeline` for cleaner code.  
