# Practical Considerations and Best Practices
## Lecture Notebook for BS Data Science Students

---

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand data preprocessing techniques and their importance
2. Perform feature engineering and feature selection
3. Handle common pitfalls in machine learning
4. Understand model interpretability and its importance
5. Apply best practices throughout the ML pipeline
6. Recognize and avoid common mistakes

---

## 6.1 Data Preprocessing

### Handling Missing Values

**Missing values** are common in real-world datasets and must be handled appropriately. The approach depends on:
- **Amount of missing data**: Small vs. large proportion
- **Pattern of missingness**: Random vs. systematic
- **Type of feature**: Numerical vs. categorical
- **Importance of feature**: Critical vs. auxiliary

#### Strategies for Handling Missing Values

**Deletion**:
- **Listwise Deletion**: Remove rows with any missing values. Simple but can lose significant data.
- **Pairwise Deletion**: Use available data for each analysis. Can lead to inconsistent sample sizes.
- **When to Use**: When missing data is small (<5%) and appears random.

**Imputation**:
- **Mean/Median Imputation**: Replace missing numerical values with mean or median. Preserves sample size but underestimates variance.
- **Mode Imputation**: For categorical variables, use most frequent category.
- **Forward/Backward Fill**: For time series, use previous/next value.
- **K-NN Imputation**: Use values from k nearest neighbors.
- **Regression Imputation**: Predict missing values using other features.
- **When to Use**: When missing data is substantial and deletion would lose important information.

**Advanced Methods**:
- **Multiple Imputation**: Create multiple imputed datasets and combine results.
- **Model-Based Imputation**: Use sophisticated models to predict missing values.
- **Indicator Variables**: Create binary features indicating missingness (may be informative).

**Best Practices**:
- Always investigate why data is missing (may be informative)
- Consider creating "missing" as a category for categorical variables
- Don't impute test set using training statistics - fit imputer on training data only
- Document imputation strategy for reproducibility

### Feature Scaling

**Feature scaling** transforms features to similar scales. Many algorithms are sensitive to feature scale:

**Algorithms Requiring Scaling**:
- **Distance-based**: k-NN, SVM, K-Means (distance calculations affected)
- **Gradient-based**: Neural networks, logistic regression with regularization
- **Regularized models**: Ridge, Lasso, Elastic Net

**Algorithms Not Requiring Scaling**:
- **Tree-based**: Decision trees, Random Forest (splits based on thresholds)
- **Naive Bayes**: Based on probabilities, not distances

#### Scaling Methods

**Standardization (Z-score normalization)**:
- Formula: (x - μ) / σ
- Result: Mean = 0, Standard deviation = 1
- **When to Use**: When data follows normal distribution
- **Advantages**: Preserves outliers, works well for many algorithms

**Min-Max Normalization**:
- Formula: (x - min) / (max - min)
- Result: Range [0, 1]
- **When to Use**: When you need bounded range
- **Advantages**: Preserves relationships, intuitive scale
- **Disadvantages**: Sensitive to outliers

**Robust Scaling**:
- Uses median and IQR instead of mean and std
- **When to Use**: When data contains outliers
- **Advantages**: Less affected by outliers

**Best Practices**:
- Fit scaler on training data only, then transform both train and test
- Never fit on test data (data leakage)
- Store scaling parameters for production use
- Consider scaling even for algorithms that don't strictly require it (can help)

### Encoding Categorical Variables

Most machine learning algorithms require numerical inputs, so categorical variables must be encoded.

#### Encoding Methods

**One-Hot Encoding**:
- Creates binary columns for each category
- **Advantages**: No ordinal assumption, works for nominal categories
- **Disadvantages**: Creates many columns for high-cardinality features, can cause multicollinearity
- **When to Use**: Nominal categories with few unique values

**Label Encoding**:
- Assigns integer labels to categories (0, 1, 2, ...)
- **Advantages**: Preserves single column, efficient
- **Disadvantages**: Implies ordinality (may mislead algorithms)
- **When to Use**: Ordinal categories or tree-based algorithms (less sensitive)

**Target Encoding (Mean Encoding)**:
- Replaces category with mean target value for that category
- **Advantages**: Can capture predictive power, reduces dimensionality
- **Disadvantages**: Risk of overfitting, requires careful cross-validation
- **When to Use**: High-cardinality categorical features

**Best Practices**:
- Use one-hot encoding for nominal variables with few categories
- Be cautious with high-cardinality features (consider target encoding or grouping)
- For tree-based models, label encoding often sufficient
- Always handle unseen categories in test data

### Handling Imbalanced Data

**Imbalanced datasets** have unequal class distributions (e.g., 95% class A, 5% class B). This causes problems:
- Models biased toward majority class
- Accuracy misleading (can achieve high accuracy by always predicting majority)
- Minority class predictions poor

#### Strategies

**Resampling**:
- **Oversampling**: Increase minority class samples (e.g., SMOTE - Synthetic Minority Oversampling)
- **Undersampling**: Reduce majority class samples
- **Combination**: Both oversample minority and undersample majority

**Algorithm-Level**:
- **Class Weights**: Penalize misclassifying minority class more heavily
- **Threshold Tuning**: Adjust decision threshold (lower threshold for minority class)
- **Cost-Sensitive Learning**: Assign different costs to different misclassifications

**Evaluation Metrics**:
- Don't use accuracy with imbalanced data
- Use precision, recall, F1-score, ROC-AUC
- Consider precision-recall curve instead of ROC curve

**Best Practices**:
- Always check class distribution first
- Use appropriate metrics (not accuracy)
- Try multiple approaches and compare
- Consider business costs of different error types

---

## 6.2 Feature Engineering

### Creating New Features

**Feature engineering** is creating new features from existing ones. Often the most impactful step:

**Mathematical Transformations**:
- **Polynomial Features**: x², x³, x₁×x₂ (interactions)
- **Logarithmic**: log(x) for skewed distributions
- **Square Root**: √x for count data
- **Ratios**: x₁/x₂ (e.g., income per person)

**Temporal Features**:
- **Time Components**: hour, day of week, month, season
- **Time Since**: Days since last purchase, account creation
- **Cyclical Encoding**: sin/cos for cyclical patterns (hours, months)

**Binning**:
- Convert continuous to categorical (e.g., age groups)
- Can capture non-linear relationships
- Reduces overfitting risk

**Domain-Specific**:
- **Business Logic**: Revenue per customer, conversion rates
- **Domain Knowledge**: BMI from height/weight, speed from distance/time

**Best Practices**:
- Start with domain knowledge
- Visualize relationships before creating features
- Avoid creating too many features (curse of dimensionality)
- Validate that new features improve model performance

### Feature Selection

**Feature selection** chooses the most relevant features, improving:
- Model performance (removes noise)
- Training speed (fewer features)
- Interpretability (simpler models)
- Generalization (reduces overfitting)

#### Selection Methods

**Filter Methods**:
- **Correlation**: Remove highly correlated features
- **Statistical Tests**: Chi-square, ANOVA F-test
- **Mutual Information**: Measures dependency
- **Advantages**: Fast, independent of model
- **Disadvantages**: May miss feature interactions

**Wrapper Methods**:
- **Forward Selection**: Start empty, add best features iteratively
- **Backward Elimination**: Start with all, remove worst iteratively
- **Recursive Feature Elimination**: Remove least important iteratively
- **Advantages**: Considers feature interactions
- **Disadvantages**: Computationally expensive, risk of overfitting

**Embedded Methods**:
- **Lasso Regularization**: Automatically sets some coefficients to zero
- **Tree-Based Importance**: Use feature importance from trees
- **Advantages**: Efficient, model-specific
- **Disadvantages**: Tied to specific algorithm

**Best Practices**:
- Start with filter methods for quick reduction
- Use wrapper/embedded for final selection
- Consider computational cost vs. performance gain
- Validate selection on validation set

### Feature Importance

Understanding which features matter helps:
- **Interpretability**: Explain model decisions
- **Feature Selection**: Identify features to keep/remove
- **Domain Insights**: Understand what drives predictions

**Methods**:
- **Coefficient Magnitude**: For linear models, larger coefficients = more important
- **Permutation Importance**: Shuffle feature, measure performance drop
- **SHAP Values**: Shapley Additive Explanations (model-agnostic)
- **Tree Importance**: Built-in for tree-based models

---

## 6.3 Common Pitfalls and How to Avoid Them

### Data Leakage

**Data leakage** occurs when information from the future or test set leaks into training, creating unrealistically good performance.

**Types**:
- **Target Leakage**: Using features that wouldn't be available at prediction time
- **Train-Test Contamination**: Using test data during training (e.g., scaling, imputation)
- **Temporal Leakage**: Using future information to predict past

**How to Avoid**:
- **Temporal Ordering**: For time series, always train on past, test on future
- **Proper Splitting**: Split data before any preprocessing
- **Pipeline Approach**: Fit transformers on training data only
- **Domain Knowledge**: Question whether features would be available in production
- **Sanity Checks**: If performance seems too good, investigate leakage

**Red Flags**:
- Performance much better than expected
- Features with suspiciously high correlation with target
- Test performance better than validation performance
- Features that wouldn't exist at prediction time

### Overfitting and Underfitting

**Overfitting**: Model learns training data too well, including noise.

**Signs**:
- Training performance much better than validation/test
- Model too complex for data size
- Performance degrades on new data

**Solutions**:
- **Regularization**: Add penalties for complexity (L1, L2)
- **Reduce Model Complexity**: Fewer features, simpler models
- **More Data**: Collect more training examples
- **Cross-Validation**: Use to detect overfitting
- **Early Stopping**: Stop training before overfitting

**Underfitting**: Model too simple to capture patterns.

**Signs**:
- Poor performance on both training and test
- Model too simple
- High bias, low variance

**Solutions**:
- **Increase Model Complexity**: More features, deeper trees, more parameters
- **Feature Engineering**: Create better features
- **Reduce Regularization**: Lower regularization strength
- **Longer Training**: For iterative algorithms, train longer

### Ignoring Class Imbalance

**Problem**: Using accuracy with imbalanced data gives misleading results.

**Example**: 95% class A, 5% class B. Model predicting always A gets 95% accuracy but fails completely on class B.

**Solutions**:
- **Appropriate Metrics**: Use precision, recall, F1, ROC-AUC
- **Resampling**: Balance classes through oversampling/undersampling
- **Class Weights**: Penalize misclassifying minority class
- **Threshold Tuning**: Adjust decision threshold based on business needs

### Not Validating Properly

**Common Mistakes**:
- **Testing on Training Data**: Evaluating on same data used for training
- **Multiple Testing**: Testing multiple models on test set, choosing best (overfitting to test set)
- **Improper Splits**: Not respecting temporal order, data leakage
- **No Validation Set**: Using test set for hyperparameter tuning

**Best Practices**:
- **Three-Way Split**: Train, validation, test sets
- **Use Test Set Once**: Only for final evaluation
- **Cross-Validation**: For model selection and hyperparameter tuning
- **Temporal Awareness**: Respect time order for time series

### Other Common Mistakes

**Ignoring Data Quality**:
- Not checking for outliers, errors, inconsistencies
- Assuming data is clean
- **Solution**: Always explore and clean data first

**Feature Scaling Issues**:
- Scaling test data using test statistics (should use training statistics)
- Not scaling when required
- **Solution**: Fit scalers on training data, transform both train and test

**Categorical Encoding Mistakes**:
- Using label encoding for nominal variables (implies order)
- Not handling unseen categories in test data
- **Solution**: Use appropriate encoding, handle test-time scenarios

**Hyperparameter Tuning Mistakes**:
- Tuning on test set
- Not using cross-validation
- Overfitting to validation set
- **Solution**: Separate validation set, use cross-validation, keep test set separate

---

## 6.4 Model Interpretability

### Why Interpretability Matters

**Interpretability** is the ability to understand and explain how a model makes predictions.

**Importance**:
- **Trust**: Stakeholders need to trust model decisions
- **Regulation**: Many industries require explainable models (finance, healthcare)
- **Debugging**: Understanding failures helps improve models
- **Fairness**: Detecting and preventing bias
- **Insights**: Understanding what drives predictions provides business value

**When Interpretability is Critical**:
- **High-Stakes Decisions**: Medical diagnosis, loan approval, hiring
- **Regulated Industries**: Finance, healthcare, legal
- **Stakeholder Requirements**: Business users need explanations
- **Bias Detection**: Ensuring fair, unbiased predictions

### Interpretable Models

**Naturally Interpretable**:
- **Linear/Logistic Regression**: Coefficients show feature importance
- **Decision Trees**: Visual, rule-based explanations
- **Rule-Based Models**: IF-THEN rules

**Trade-offs**:
- Often simpler models
- May sacrifice some performance for interpretability
- Sufficient for many problems

### Model-Agnostic Interpretability Methods

**SHAP (SHapley Additive Explanations)**:
- Explains individual predictions
- Shows contribution of each feature
- Works with any model
- Based on game theory (Shapley values)

**LIME (Local Interpretable Model-agnostic Explanations)**:
- Approximates complex model locally with simple model
- Explains individual predictions
- Easy to understand
- May not always be accurate

**Partial Dependence Plots**:
- Show effect of feature on prediction
- Marginal effect holding other features constant
- Visual and intuitive
- Can show interactions

**Feature Importance**:
- Overall importance of each feature
- Available for many models
- Simple to understand
- May miss interactions

**Best Practices**:
- Start with interpretable models when possible
- Use model-agnostic methods for complex models
- Explain both individual predictions and overall model behavior
- Consider audience when choosing explanation method
- Validate explanations make sense (domain knowledge)

---

## Summary and Key Takeaways

1. **Data Preprocessing** (handling missing values, scaling, encoding) is crucial for model performance.

2. **Feature Engineering** can significantly improve models - domain knowledge is key.

3. **Feature Selection** reduces overfitting and improves interpretability.

4. **Data Leakage** is a critical issue - always validate that features would be available at prediction time.

5. **Overfitting and Underfitting** must be balanced through regularization, complexity control, and validation.

6. **Class Imbalance** requires appropriate metrics and handling strategies.

7. **Proper Validation** requires careful data splitting and respecting temporal order.

8. **Interpretability** is essential for trust, regulation, and debugging - choose methods appropriate for your context.

9. **Best Practices** throughout the pipeline prevent common mistakes and improve results.

10. **Domain Knowledge** should guide preprocessing, feature engineering, and interpretation.

---

## Further Reading

- Molnar, C. (2020). *Interpretable Machine Learning*. https://christophm.github.io/interpretable-ml-book/
- Scikit-learn Preprocessing Documentation: https://scikit-learn.org/stable/modules/preprocessing.html
- Scikit-learn Feature Selection Documentation: https://scikit-learn.org/stable/modules/feature_selection.html

---

## Practice Exercises

1. Why is it important to fit scalers on training data only? What happens if you fit on the entire dataset?

2. Explain the difference between one-hot encoding and label encoding. When would you use each?

3. What is data leakage? Give an example and explain how to prevent it.

4. How would you handle a classification problem with 99% class A and 1% class B?

5. Why is model interpretability important? When might you choose a less interpretable model?

6. Describe a scenario where feature engineering would significantly improve model performance.

7. What are the signs of overfitting? How would you address it?

8. Why shouldn't you use the test set for hyperparameter tuning?
