# Scikit-Learn Data Preprocessing

**Course:** MLM-101 - Machine Learning Mastery  
**Phase 6:** Scikit-Learn Fundamentals (Lectures 43-46)  
**Topics:** Scaling, Encoding, Train/Test Split, Pipelines

---

## üìö Learning Objectives

By the end of this notebook, you will be able to:

‚úÖ Scale numerical features (StandardScaler, MinMaxScaler)  
‚úÖ Encode categorical variables (LabelEncoder, OneHotEncoder)  
‚úÖ Split data into train and test sets  
‚úÖ Create preprocessing pipelines  
‚úÖ Handle missing values with imputers  
‚úÖ Prepare data for ML models

---

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Libraries imported successfully!")

## 1Ô∏è‚É£ Feature Scaling

Most ML algorithms perform better when features are on similar scales.

In [None]:
# Create sample data with different scales
data = pd.DataFrame({
    'age': [25, 30, 35, 40, 45, 50],
    'salary': [30000, 45000, 55000, 70000, 85000, 95000],
    'years_experience': [1, 3, 5, 8, 12, 15]
})

print("Original Data:")
print(data)
print("\nStatistics:")
print(data.describe())

### StandardScaler (Z-score Normalization)

Transforms features to have mean=0 and std=1

In [None]:
# StandardScaler
scaler_standard = StandardScaler()
data_standardized = scaler_standard.fit_transform(data)

df_standardized = pd.DataFrame(data_standardized, columns=data.columns)

print("Standardized Data (mean=0, std=1):")
print(df_standardized)
print("\nStatistics:")
print(df_standardized.describe())

print("\nFormula: (x - mean) / std")
print(f"Example for age=25: ({25} - {data['age'].mean():.2f}) / {data['age'].std():.2f} = {df_standardized['age'].iloc[0]:.2f}")

### MinMaxScaler

Scales features to a fixed range [0, 1]

In [None]:
# MinMaxScaler
scaler_minmax = MinMaxScaler()
data_minmax = scaler_minmax.fit_transform(data)

df_minmax = pd.DataFrame(data_minmax, columns=data.columns)

print("MinMax Scaled Data [0, 1]:")
print(df_minmax)
print("\nStatistics:")
print(df_minmax.describe())

print("\nFormula: (x - min) / (max - min)")
print(f"Example for age=25: ({25} - {data['age'].min()}) / ({data['age'].max()} - {data['age'].min()}) = {df_minmax['age'].iloc[0]:.2f}")

### üéØ When to Use Which Scaler?

- **StandardScaler**: Most common, works well with algorithms assuming normally distributed data (Linear Regression, Logistic Regression, SVM, Neural Networks)
- **MinMaxScaler**: When you need features in specific range, good for neural networks with sigmoid/tanh activations
- **RobustScaler**: When data has outliers (uses median and IQR instead of mean and std)

---

## 2Ô∏è‚É£ Encoding Categorical Variables

ML algorithms work with numbers, so we need to convert categories to numbers.

### Label Encoding

Converts categories to integers (0, 1, 2, ...)

In [None]:
# Label Encoding for target variable
df_labels = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'David'],
    'department': ['Engineering', 'Marketing', 'Engineering', 'Sales'],
    'performance': ['Good', 'Excellent', 'Average', 'Good']
})

print("Original Data:")
print(df_labels)

# Encode department
le_dept = LabelEncoder()
df_labels['department_encoded'] = le_dept.fit_transform(df_labels['department'])

print("\nWith Label Encoding:")
print(df_labels[['department', 'department_encoded']])

print("\nMapping:")
for i, label in enumerate(le_dept.classes_):
    print(f"  {label} ‚Üí {i}")

### One-Hot Encoding

Creates binary columns for each category (better for nominal data)

In [None]:
# One-Hot Encoding
df_onehot = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'New York', 'London']
})

print("Original:")
print(df_onehot)

# Using pandas get_dummies
df_encoded = pd.get_dummies(df_onehot, columns=['city'], prefix='city')

print("\nOne-Hot Encoded:")
print(df_encoded)

# Using sklearn
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)
city_encoded = ohe.fit_transform(df_onehot[['city']])

print("\nUsing sklearn OneHotEncoder:")
print(city_encoded)
print(f"Categories: {ohe.categories_}")

### üéØ Label Encoding vs One-Hot Encoding

- **Label Encoding**: For ordinal data (Low < Medium < High) or target variables
- **One-Hot Encoding**: For nominal data (Red, Blue, Green) where no order exists

---

## 3Ô∏è‚É£ Train/Test Split

In [None]:
# Create sample dataset
np.random.seed(42)
X = np.random.rand(100, 4)  # 100 samples, 4 features
y = np.random.randint(0, 2, 100)  # Binary classification

print(f"Total samples: {len(X)}")
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

In [None]:
# Split into train and test (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,  # 20% for testing
    random_state=42,  # For reproducibility
    stratify=y  # Maintain class distribution
)

print("Split Results:")
print(f"  Training samples: {len(X_train)} ({len(X_train)/len(X)*100:.0f}%)")
print(f"  Test samples: {len(X_test)} ({len(X_test)/len(X)*100:.0f}%)")

print("\nClass Distribution:")
print(f"  Original: {np.bincount(y)}")
print(f"  Train: {np.bincount(y_train)}")
print(f"  Test: {np.bincount(y_test)}")

### üéØ Important: Fit on Train, Transform on Test

In [None]:
# CORRECT way to scale
scaler = StandardScaler()

# Fit only on training data
scaler.fit(X_train)

# Transform both train and test using training statistics
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("‚úÖ CORRECT: Fit on train, transform on both")
print(f"Train mean: {X_train_scaled.mean(axis=0)}")
print(f"Test mean: {X_test_scaled.mean(axis=0)}")
print("\n‚ö†Ô∏è Test mean is not exactly 0 because we used training statistics")

# WRONG way (data leakage)
print("\n‚ùå WRONG: Fitting on test data causes data leakage!")
print("Never do: scaler.fit(X_test) or scaler.fit_transform(X_test)")

---

## 4Ô∏è‚É£ Handling Missing Values

In [None]:
# Create data with missing values
df_missing = pd.DataFrame({
    'age': [25, np.nan, 35, 40, np.nan, 50],
    'salary': [50000, 60000, np.nan, 70000, 80000, np.nan],
    'experience': [2, 5, 7, np.nan, 12, 15]
})

print("Data with missing values:")
print(df_missing)
print("\nMissing count:")
print(df_missing.isnull().sum())

In [None]:
# SimpleImputer with mean strategy
imputer_mean = SimpleImputer(strategy='mean')
df_imputed_mean = pd.DataFrame(
    imputer_mean.fit_transform(df_missing),
    columns=df_missing.columns
)

print("Imputed with mean:")
print(df_imputed_mean)

# SimpleImputer with median strategy
imputer_median = SimpleImputer(strategy='median')
df_imputed_median = pd.DataFrame(
    imputer_median.fit_transform(df_missing),
    columns=df_missing.columns
)

print("\nImputed with median:")
print(df_imputed_median)

# Constant value
imputer_constant = SimpleImputer(strategy='constant', fill_value=0)
df_imputed_constant = pd.DataFrame(
    imputer_constant.fit_transform(df_missing),
    columns=df_missing.columns
)

print("\nImputed with constant (0):")
print(df_imputed_constant)

---

## 5Ô∏è‚É£ Preprocessing Pipelines

Combine multiple preprocessing steps into one pipeline.

In [None]:
# Create sample dataset
df = pd.DataFrame({
    'age': [25, 30, np.nan, 40, 45],
    'salary': [50000, 60000, 70000, np.nan, 90000],
    'department': ['Engineering', 'Marketing', 'Engineering', 'Sales', 'Marketing'],
    'purchased': [0, 1, 1, 0, 1]
})

print("Original Dataset:")
print(df)

In [None]:
# Separate features and target
X = df.drop('purchased', axis=1)
y = df['purchased']

# Define numeric and categorical columns
numeric_features = ['age', 'salary']
categorical_features = ['department']

# Create transformers for numeric features
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Create transformers for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Fit and transform
X_processed = preprocessor.fit_transform(X)

print("\nProcessed Features:")
print(X_processed)
print(f"\nShape: {X.shape} ‚Üí {X_processed.shape}")
print("(2 numeric + 3 one-hot encoded categorical = 5 features)")

---

## üéØ Complete ML Preprocessing Pipeline

In [None]:
# Create realistic dataset
np.random.seed(42)
n_samples = 200

df_complete = pd.DataFrame({
    'age': np.random.randint(18, 65, n_samples),
    'income': np.random.randint(20000, 150000, n_samples),
    'credit_score': np.random.randint(300, 850, n_samples),
    'employment': np.random.choice(['Employed', 'Self-Employed', 'Unemployed'], n_samples),
    'education': np.random.choice(['High School', 'Bachelor', 'Master', 'PhD'], n_samples),
    'loan_approved': np.random.randint(0, 2, n_samples)
})

# Add some missing values
df_complete.loc[np.random.choice(df_complete.index, 20), 'age'] = np.nan
df_complete.loc[np.random.choice(df_complete.index, 15), 'income'] = np.nan
df_complete.loc[np.random.choice(df_complete.index, 10), 'employment'] = np.nan

print("Dataset Info:")
print(df_complete.info())
print("\nMissing Values:")
print(df_complete.isnull().sum())
print("\nFirst 5 rows:")
print(df_complete.head())

In [None]:
# Prepare data
X = df_complete.drop('loan_approved', axis=1)
y = df_complete['loan_approved']

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

# Define feature types
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['employment', 'education']

# Numeric pipeline
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

# Complete preprocessor
preprocessor = ColumnTransformer([
    ('num', numeric_pipeline, numeric_features),
    ('cat', categorical_pipeline, categorical_features)
])

# Fit on train, transform both
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

print("\n‚úÖ Preprocessing Complete!")
print(f"\nTrain shape: {X_train.shape} ‚Üí {X_train_processed.shape}")
print(f"Test shape: {X_test.shape} ‚Üí {X_test_processed.shape}")
print(f"\nFeatures breakdown:")
print(f"  - Numeric: {len(numeric_features)} features")
print(f"  - Categorical (one-hot): {X_train_processed.shape[1] - len(numeric_features)} features")
print(f"  - Total: {X_train_processed.shape[1]} features")

print("\nüéØ Data is now ready for ML models!")

---

## üéì Summary

In this notebook, you learned:

‚úÖ **Feature Scaling**: StandardScaler (mean=0, std=1), MinMaxScaler ([0,1])  
‚úÖ **Encoding**: LabelEncoder for ordinal, OneHotEncoder for nominal  
‚úÖ **Train/Test Split**: Proper data splitting with stratification  
‚úÖ **Missing Values**: SimpleImputer with mean/median/most_frequent strategies  
‚úÖ **Pipelines**: Automated preprocessing with Pipeline and ColumnTransformer  
‚úÖ **Best Practices**: Fit on train, transform on test (avoid data leakage)

### ‚ö†Ô∏è Critical Rules

1. **Always split BEFORE preprocessing**
2. **Fit only on training data**
3. **Use same preprocessing for test data**
4. **Never use test statistics (causes data leakage)**

### üöÄ Next Steps

Continue to:
- Apply these techniques to real datasets
- Build complete ML models with preprocessing pipelines
- Experiment with different scaling and encoding strategies

---

**Course:** MLM-101 - Machine Learning Mastery  
**Website:** [https://flowdiary.com.ng/course/MLM-101](https://flowdiary.com.ng/course/MLM-101)