# ML Assignment – 1  
**Submitter Name:** Aasif Majeed  
**Date:** 24 May 2024  

This notebook contains **all questions (1–80)** from the ML Assignment PDF and provides clear, exam-ready answers.  
Where helpful, short **Python demo code** is included for key concepts (splits, scaling, SMOTE, interpolation, PCA, encoding, VIF, RFE).


---
## 0) Python Setup (for demo code)
> The answers are mostly theoretical. The code below is only for *demonstration*.
---


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)
pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 140)


---
## Demo Dataset (Synthetic) for code examples
We create a small synthetic dataset to demonstrate:
- train/validation/test split
- scaling and PCA
- encoding
- imbalance handling (SMOTE)
- outlier detection
---


In [2]:
from sklearn.datasets import make_classification, make_regression

# Classification dataset with imbalance (for SMOTE + metrics)
Xc, yc = make_classification(
    n_samples=2000, n_features=12, n_informative=6, n_redundant=2,
    weights=[0.92, 0.08], flip_y=0.01, random_state=42
)

# Regression dataset (for scaling + PCA + feature selection demos)
Xr, yr = make_regression(
    n_samples=800, n_features=10, n_informative=6, noise=15.0, random_state=42
)

Xc.shape, np.bincount(yc)


((2000, 12), array([1833,  167]))

---
# Answers (1–80)
---

## Q1) Define Artificial Intelligence (AI).

**Artificial Intelligence (AI)** is a broad field of computer science focused on building systems that can perform tasks that normally require human intelligence, such as:
- perception (vision/speech),
- reasoning and decision-making,
- learning from data/experience,
- language understanding and generation,
- planning and acting in environments.

In short: **AI aims to create “intelligent behavior” in machines.**

## Q2) Explain the differences between Artificial Intelligence (AI), Machine Learning (ML), Deep Learning (DL), and Data Science (DS).

- **AI (Artificial Intelligence):** Umbrella field—any approach that makes machines behave intelligently (rule-based, search, ML, etc.).  
- **ML (Machine Learning):** Subset of AI where models *learn patterns from data* instead of being explicitly programmed.  
- **DL (Deep Learning):** Subset of ML using multi-layer neural networks; excellent for images, audio, and text at large scale.  
- **DS (Data Science):** A broader discipline focused on extracting insight/value from data using statistics, ML, visualization, and domain knowledge (often includes building dashboards, experiments, business decisions, etc.).

**Relationship:** AI ⊃ ML ⊃ DL, while DS overlaps with ML/DL but also includes analytics, experimentation, and reporting.

## Q3) How does AI differ from traditional software development?

Traditional software uses **explicit rules**: “if X then do Y”.  
AI systems often use **models** that learn behavior from data or experience.

Key differences:
- **Rule-based vs data-driven:** Traditional = fixed logic; AI/ML = learns patterns from data.
- **Deterministic vs probabilistic:** Traditional outputs are predictable; AI outputs may be probabilistic.
- **Maintenance:** Traditional changes need code updates; AI changes may need re-training / new data.
- **Testing:** Traditional tests logic branches; AI tests generalization with validation/test data.
- **Performance:** Traditional improves via better code; AI improves via better data, features, model, training.

## Q4) Provide examples of AI, ML, DL, and DS applications.

**Examples**
- **AI:** chess/Go engine, voice assistants, self-driving planning, expert systems.
- **ML:** spam detection, house price prediction, credit scoring, recommendation systems.
- **DL:** image classification, speech-to-text, large language models, medical image segmentation.
- **DS:** customer churn analysis, A/B testing, KPI dashboards, forecasting demand, root-cause analysis.

## Q5) Discuss the importance of AI, ML, DL, and DS in today's world.

Why these fields matter today:
- **Automation & productivity:** automating repetitive tasks (customer support, document processing).
- **Better decisions:** predictive analytics (risk, churn, demand forecasting).
- **Personalization:** recommendations in e-commerce/streaming.
- **Healthcare:** diagnosis assistance, drug discovery, monitoring.
- **Safety & security:** anomaly detection, fraud detection, cybersecurity.
- **Research acceleration:** faster discovery in engineering, materials, climate, etc.

## Q6) What is Supervised Learning?

**Supervised Learning** is an ML setting where we learn a mapping **X → y** using labeled examples:
- Inputs/features: **X**
- Target/label: **y**
Goal: learn a model that predicts **y** for unseen **X**.

Two main types:
- **Classification:** y is categorical (spam/not spam)
- **Regression:** y is continuous (price, temperature)

## Q7) Provide examples of Supervised Learning algorithms.

Examples of supervised algorithms:
- **Linear Regression**, **Ridge/Lasso** (regression)
- **Logistic Regression** (classification)
- **Decision Trees**, **Random Forest**
- **Gradient Boosting** (XGBoost/LightGBM/CatBoost)
- **Support Vector Machines (SVM)**
- **k-Nearest Neighbors (kNN)**
- **Neural Networks**

## Q8) Explain the process of Supervised Learning.

Typical supervised learning workflow:
1. **Collect labeled data** (X, y)
2. **Clean/prepare data** (missing values, encoding, scaling)
3. **Split data** into train/validation/test
4. **Train** model on training set
5. **Tune** hyperparameters using validation (or cross-validation)
6. **Evaluate** on test set (final unbiased estimate)
7. **Deploy** and **monitor** (data drift, performance decay)


## Q9) What are the characteristics of Unsupervised Learning?

**Unsupervised Learning** uses **unlabeled** data (only X). Common characteristics:
- Goal is to **discover structure/patterns** in data
- Outputs are not “correct labels” but groupings/representations
- Often used for **clustering**, **dimensionality reduction**, **density estimation**, **anomaly detection**
- Evaluation is harder (no true labels); uses internal metrics or downstream task performance

## Q10) Give examples of Unsupervised Learning algorithms.

Examples of unsupervised algorithms:
- **K-Means**, **Hierarchical Clustering**, **DBSCAN**
- **Gaussian Mixture Models (GMM)**
- **PCA**, **t-SNE**, **UMAP** (dimensionality reduction/visualization)
- **Association rules** (Apriori)
- **Autoencoders** (representation learning)

## Q11) Describe Semi-Supervised Learning and its significance.

**Semi-Supervised Learning** uses:
- a small labeled dataset + a large unlabeled dataset.

Why important:
- Labels can be expensive (medical data, manual annotation)
- Uses unlabeled data to improve generalization
Examples:
- pseudo-labeling, consistency regularization, graph-based methods

## Q12) Explain Reinforcement Learning and its applications.

**Reinforcement Learning (RL)** is learning by interaction:
- an **agent** takes actions in an **environment**
- receives **reward** and new **state**
- objective: maximize long-term cumulative reward.

Applications:
- games (Atari, chess/Go)
- robotics control
- recommendation policies / ads bidding
- scheduling, resource allocation, operations research

## Q13) How does Reinforcement Learning differ from Supervised and Unsupervised Learning?

Differences:
- **Supervised:** learn from labeled (X, y) pairs; feedback is direct and immediate.
- **Unsupervised:** learn patterns/structure from X only; no labels.
- **RL:** learn by trial-and-error with **rewards**, where feedback can be delayed; must explore/exploit and handle sequential decisions.

## Q14) What is the purpose of the Train-Test-Validation split in machine learning?

Train/Test/Validation split is used to:
- **Train**: fit model parameters
- **Validation**: tune hyperparameters, select model, early stopping
- **Test**: final unbiased estimate of generalization performance

Purpose: avoid **overfitting** to the same data used for development.

## Q15) Explain the significance of the training set.

The training set:
- is the data the model learns from (fits weights/parameters)
- must be representative of real-world data
- quality and quantity of training data strongly affect generalization

Bad training data (biased/noisy) → bad model, regardless of algorithm.

## Q16) How do you determine the size of the training, testing, and validation sets?

How to choose split sizes:
- depends on dataset size, model complexity, and variance of evaluation
Common choices:
- **70/15/15**, **80/10/10**, **80/20** (no validation if using CV)
Rules of thumb:
- If data is large → smaller test is OK
- If data is small → use **k-fold cross-validation**
- For time-series → use time-based split (no shuffling)

## Q17) What are the consequences of improper Train-Test-Validation splits?

Improper splits can cause:
- **data leakage** (train sees info from test/validation)
- overly optimistic results (model seems better than it is)
- poor real-world performance after deployment
- unstable evaluation due to too-small test set
- wrong model selection/hyperparameter tuning

## Q18) Discuss the trade-offs in selecting appropriate split ratios.

Trade-offs:
- More training data → better learning, but less reliable evaluation (small test)
- Larger test/validation → more reliable estimate, but less training data
- For high-stakes tasks → prefer larger test set or repeated CV
- For imbalanced/time-series/grouped data → use stratified/time-based/group splits

## Q19) Define model performance in machine learning.

**Model performance** = how well a model meets objectives on unseen data, such as:
- predictive accuracy / error
- calibration (probability quality)
- robustness (noise/outliers)
- fairness and stability
- latency/memory constraints (for deployment)

## Q20) How do you measure the performance of a machine learning model?

Measurement depends on task:

**Classification metrics**
- Accuracy, Precision, Recall, F1-score
- ROC-AUC, PR-AUC (for imbalance)
- Confusion matrix

**Regression metrics**
- MAE, MSE, RMSE
- R²

Also evaluate:
- Cross-validation scores
- Learning curves
- Business metrics (cost, profit)

## Q21) What is overfitting and why is it problematic?

**Overfitting** happens when a model learns noise and training-specific patterns, performing well on training data but poorly on unseen data.

Why problematic:
- poor generalization
- unreliable predictions in real world
- often caused by high model complexity, small data, leakage, or too much training

## Q22) Provide techniques to address overfitting.

Techniques to reduce overfitting:
- More data / data augmentation
- Regularization (L1/L2, dropout)
- Simpler model (reduce complexity)
- Early stopping
- Cross-validation
- Feature selection / dimensionality reduction
- Ensembling (bagging), pruning trees
- Proper train/val/test separation (avoid leakage)

## Q23) Explain underfitting and its implications.

**Underfitting** occurs when a model is too simple to capture the underlying pattern.
Symptoms:
- high error on training data and test data
Implications:
- model cannot learn important structure → poor predictions everywhere

## Q24) How can you prevent underfitting in machine learning models?

Prevent underfitting by:
- using a more expressive model (nonlinear, deeper, more features)
- reducing regularization if too strong
- training longer / better optimization
- adding useful features (feature engineering)
- reducing noise via better preprocessing

## Q25) Discuss the balance between bias and variance in model performance.

**Bias–Variance trade-off**
- **Bias:** error from oversimplified assumptions → underfitting
- **Variance:** error from sensitivity to training data → overfitting

Goal: choose model/regularization that balances both, minimizing test error.
Tools:
- learning curves
- cross-validation
- regularization and model complexity control

## Q26) What are the common techniques to handle missing data?

Common ways to handle missing data:
- **Remove** rows/columns (if few missing and safe)
- **Simple imputation:** mean/median (numeric), mode (categorical)
- **Advanced imputation:** KNN imputer, iterative/multivariate imputation
- **Model-based handling:** some models handle missing (e.g., XGBoost can)
- **Add indicator features** (missingness flag)

## Q27) Explain the implications of ignoring missing data.

Ignoring missing data can:
- reduce dataset size (if rows dropped automatically)
- bias results if missingness is not random
- break algorithms (many models cannot handle NaNs)
- distort feature distributions → wrong relationships and wrong predictions

## Q28) Discuss the pros and cons of imputation methods.

**Imputation pros**
- keeps more data (better statistical power)
- allows models requiring complete data to run
- can reduce bias vs dropping rows (depending on missing mechanism)

**Imputation cons**
- can introduce bias if imputation is unrealistic
- underestimates uncertainty (treats filled values as true)
- complex methods may be slow and may leak information if fit on full dataset (should fit only on training data)

## Q29) How does missing data affect model performance?

Missing data affects performance by:
- reducing effective training size
- increasing noise/bias in estimated relationships
- causing models to learn wrong patterns if missingness is correlated with target
- producing unpredictable inference if missing patterns differ between train and test (missingness drift)

## Q30) Define imbalanced data in the context of machine learning.

**Imbalanced data** means class frequencies are very unequal (e.g., 95% class 0, 5% class 1).
Common in fraud detection, rare disease diagnosis, defect detection.

## Q31) Discuss the challenges posed by imbalanced data.

Challenges of imbalanced data:
- accuracy becomes misleading (predict majority class and get high accuracy)
- minority class recall/precision often poor
- decision boundary biased toward majority class
- model may not learn minority patterns due to few examples

## Q32) What techniques can be used to address imbalanced data?

Techniques for imbalance:
- **Resampling:** oversampling minority, undersampling majority
- **SMOTE/ADASYN** synthetic sampling
- **Class weights / cost-sensitive learning**
- **Threshold tuning**
- **Use better metrics:** PR-AUC, F1, recall, balanced accuracy
- **Ensembles:** Balanced Random Forest, EasyEnsemble

## Q33) Explain the process of up-sampling and down-sampling.

**Up-sampling (oversampling):**
- increase minority examples by duplicating or synthesizing new ones

**Down-sampling (undersampling):**
- reduce majority examples by removing samples

Workflow:
1. Split train/test first (avoid leakage)
2. Apply resampling on training set only
3. Train model, evaluate on untouched test set

## Q34) When would you use up-sampling versus down-sampling?

Use **up-sampling** when:
- dataset is small and you cannot afford removing majority data
- minority class is very rare and important

Use **down-sampling** when:
- dataset is huge (majority class has many redundant samples)
- training time is an issue
- you can remove some majority examples without losing important information

## Q35) What is SMOTE and how does it work?

**SMOTE (Synthetic Minority Over-sampling Technique)** creates synthetic minority points by:
1. for each minority sample, find k nearest minority neighbors
2. pick a neighbor and create a synthetic point along the line segment between them

This increases minority diversity without exact duplication.

## Q36) Explain the role of SMOTE in handling imbalanced data.

Role of SMOTE:
- balances training data so classifier learns minority patterns better
- often improves recall/F1 for minority class
- should be applied **only on training split** to avoid leakage

## Q37) Discuss the advantages and limitations of SMOTE.

**Advantages**
- reduces overfitting compared to simple duplication
- improves minority decision region learning

**Limitations**
- can create overlapping classes (more false positives)
- sensitive to noise/outliers (may generate bad synthetic points)
- does not respect complex manifolds; may be unrealistic for some domains

## Q38) Provide examples of scenarios where SMOTE is beneficial.

SMOTE is useful when:
- minority class has too few samples (fraud, rare events)
- classes are reasonably separable but minority boundary is under-learned
Examples:
- credit card fraud detection
- manufacturing defect detection
- medical diagnosis with rare positives

## Q39) Define data interpolation and its purpose.

**Data interpolation** estimates missing values between known data points.
Purpose:
- fill missing values in time series or continuous measurements
- create smoother signals (e.g., sensor data)
- align data to common sampling rate

## Q40) What are the common methods of data interpolation?

Common interpolation methods:
- **Linear interpolation**
- **Polynomial interpolation**
- **Spline interpolation** (cubic splines)
- **Nearest neighbor interpolation**
- **Time-series methods** (forward/backward fill; not true interpolation but often used)

## Q41) Discuss the implications of using data interpolation in machine learning.

Implications of interpolation:
- can improve models if missingness is small and data is smooth
- can introduce bias (fabricated values) if behavior is nonlinear or abrupt
- may leak future information in time series if not careful (e.g., using future points to fill past)
Best practice:
- apply interpolation only with correct time direction and only on training set for modeling pipelines

## Q42) What are outliers in a dataset?

**Outliers** are data points that are unusually far from the majority of observations.
They may occur due to:
- measurement errors
- rare events
- data entry issues
- true but extreme behavior

## Q43) Explain the impact of outliers on machine learning models.

Impact of outliers:
- can distort mean/variance and correlation
- can heavily affect models sensitive to scale (linear regression, kNN, SVM)
- can cause unstable decision boundaries
- may reduce model accuracy and generalization if they are noise

## Q44) Discuss techniques for identifying outliers.

Outlier detection techniques:
- **Z-score** / standard deviation rule
- **IQR method** (boxplot fences)
- **Isolation Forest**
- **Local Outlier Factor (LOF)**
- **DBSCAN** (points in low-density region)
- visualization: scatter plots, box plots

## Q45) How can outliers be handled in a dataset?

Handling outliers:
- verify and correct data entry errors
- remove outliers (only if they are incorrect/noise)
- cap/winsorize (clip to percentile bounds)
- transform features (log/Box-Cox)
- use robust models/metrics (Huber loss, median-based methods)
- separate rare-event class if outliers represent a real phenomenon

## Q46) Compare and contrast Filter, Wrapper, and Embedded methods for feature selection.

**Filter methods:** select features using statistics independent of model (fast).
- e.g., correlation, chi-square, mutual information.

**Wrapper methods:** use a model to evaluate feature subsets (more accurate but expensive).
- e.g., forward selection, backward elimination, recursive feature elimination (RFE).

**Embedded methods:** feature selection happens inside model training.
- e.g., Lasso (L1), tree-based feature importance.

## Q47) Provide examples of algorithms associated with each method.

Examples:
- **Filter:** Pearson correlation filter, Chi-square test, ANOVA F-test, mutual information.
- **Wrapper:** RFE with Logistic Regression/SVM, forward/backward selection using validation score.
- **Embedded:** Lasso/ElasticNet, Decision Trees/Random Forest feature importance, Gradient Boosting importance.

## Q48) Discuss the advantages and disadvantages of each feature selection method.

Pros/cons:
- **Filter**
  - ✅ fast, simple, works well for initial screening
  - ❌ may ignore feature interactions with the model
- **Wrapper**
  - ✅ can capture interactions, often better subset quality
  - ❌ computationally expensive; risk of overfitting to validation
- **Embedded**
  - ✅ good balance (selection during training), efficient
  - ❌ selection depends on model assumptions (e.g., Lasso for linear relations)

## Q49) Explain the concept of feature scaling.

**Feature scaling** transforms numeric features to comparable ranges.
Why needed:
- distance-based models (kNN, K-means) depend on scale
- gradient-based optimization (neural nets) converges faster
- regularization depends on feature magnitude
Not always necessary for trees (Decision Trees/Random Forest) but still can help in some pipelines.

## Q50) Describe the process of standardization.

**Standardization (z-score scaling):**
\[ z = \frac{x - \mu}{\sigma} \]
- mean becomes 0
- standard deviation becomes 1
Useful for many ML models (SVM, Logistic Regression, PCA).

## Q51) How does mean normalization differ from standardization?

**Mean normalization** typically:
\[ x' = \frac{x - \mu}{\max(x)-\min(x)} \]
or sometimes \( (x - mean)/range \)

**Standardization**:
\[ x' = \frac{x - \mu}{\sigma} \]

Difference:
- mean normalization uses **range**; standardization uses **std dev**
- standardization is more common for ML pipelines and PCA

## Q52) Discuss the advantages and disadvantages of Min-Max scaling.

**Min-Max scaling**:
\[ x' = \frac{x - x_{min}}{x_{max} - x_{min}} \] (maps to [0,1])

✅ Pros:
- preserves original shape of distribution
- good for bounded features and neural nets

❌ Cons:
- very sensitive to outliers (they set min/max)
- if new data has values beyond training min/max, scaling can go outside [0,1]

## Q53) What is the purpose of unit vector scaling?

**Unit vector scaling (normalization)** scales each sample (row) to have unit norm:
\[ x' = \frac{x}{\|x\|} \]

Purpose:
- useful when direction matters more than magnitude
- common in text features (TF-IDF), cosine similarity, nearest-neighbor retrieval

## Q54) Define Principle Component Analysis (PCA).

**Principal Component Analysis (PCA)** is a dimensionality reduction technique that transforms features into a new set of orthogonal axes (principal components) that capture maximum variance in the data.

## Q55) Explain the steps involved in PCA.

Steps in PCA:
1. standardize data (often required)
2. compute covariance matrix (or SVD directly)
3. compute eigenvalues/eigenvectors (or singular vectors)
4. sort components by eigenvalues (variance explained)
5. choose top k components
6. project data onto these components

## Q56) Discuss the significance of eigenvalues and eigenvectors in PCA.

- **Eigenvectors** define the direction of principal components (new axes).
- **Eigenvalues** tell how much variance is captured along each eigenvector.
Bigger eigenvalue ⇒ more important component.

## Q57) How does PCA help in dimensionality reduction?

PCA reduces dimensionality by:
- projecting data onto a smaller number of components that capture most variance
Benefits:
- less overfitting (fewer features)
- faster training/inference
- noise reduction
- visualization in 2D/3D

## Q58) Define data encoding and its importance in machine learning.

**Data encoding** converts categorical variables to numeric form so ML algorithms can use them.
Importance:
- most ML models require numeric input
- correct encoding prevents introducing false order relationships and improves performance

## Q59) Explain Nominal Encoding and provide an example.

**Nominal encoding** applies to categories with **no natural order** (e.g., color).
Common approach: one-hot encoding.

Example:
- Color ∈ {Red, Blue, Green}
- One-hot → Red=[1,0,0], Blue=[0,1,0], Green=[0,0,1]

## Q60) Discuss the process of One Hot Encoding.

**One-Hot Encoding** creates binary indicator columns for each category.
Steps:
1. find unique categories in a feature
2. create one column per category
3. put 1 where the row has that category, else 0
Often drop one column to avoid dummy variable trap in linear regression (multicollinearity).

## Q61) How do you handle multiple categories in One Hot Encoding?

Handling many categories:
- **Top-K + 'Other'**: keep most frequent K categories, rest → Other
- **Hashing trick** (feature hashing)
- **Target/mean encoding** (careful with leakage)
- **Embeddings** (deep learning)
- Reduce cardinality via grouping (domain knowledge)

## Q62) Explain Mean Encoding and its advantages.

**Mean (Target) Encoding** replaces each category with the mean of the target for that category.
Advantages:
- handles high-cardinality categories with a single numeric column
- often improves performance in tree/linear models
Risks:
- leakage and overfitting if computed on full dataset
Best practice:
- compute using CV folds, add smoothing, and apply only using training data

## Q63) Provide examples of Ordinal Encoding and Label Encoding.

- **Ordinal encoding:** categories mapped to integers in natural order.
  Example: size {Small, Medium, Large} → {0,1,2}
- **Label encoding:** assigns integer IDs to categories (often for target labels in classification).
  Example: {cat, dog, fish} → {0,1,2}
Caution: label encoding for input features can create false ordering unless the feature is truly ordinal.

## Q64) What is Target Guided Ordinal Encoding and how is it used?

**Target Guided Ordinal Encoding** assigns order based on target statistics.
Example:
- categories are ordered by mean target value and then encoded 0,1,2...
Useful when categories have meaningful relationship to target.
Must be done carefully with cross-validation to prevent leakage.

## Q65) Define covariance and its significance in statistics.

**Covariance** measures how two variables vary together:
- positive covariance: both increase together
- negative covariance: one increases while the other decreases
Magnitude depends on units, so covariance is not normalized.

## Q66) Explain the process of correlation check.

Correlation check process:
1. select numeric features
2. compute correlation matrix (Pearson for linear, Spearman for monotonic)
3. visualize via heatmap (or print matrix)
4. detect multicollinearity (very high correlations)
5. decide: drop/merge features, use PCA, or regularize model

## Q67) What is the Pearson Correlation Coefficient?

**Pearson correlation coefficient (r)** measures linear relationship between two variables:
- ranges from -1 to +1
- +1 perfect positive linear
- -1 perfect negative linear
- 0 no linear relationship (may still have nonlinear relationship)

## Q68) How does Spearman's Rank Correlation differ from Pearson's Correlation?

**Spearman correlation**:
- based on **ranks** (monotonic relationship)
- less sensitive to outliers and non-normal distributions
**Pearson**:
- uses raw values (linear relationship)
- more sensitive to outliers

## Q69) Discuss the importance of Variance Inflation Factor (VIF) in feature selection.

**VIF (Variance Inflation Factor)** detects multicollinearity among features.
- High VIF means a feature is highly explained by other features (redundant).
Why important:
- multicollinearity inflates variance of coefficient estimates in linear models
- makes interpretation unstable (coefficients can flip signs)
Common thresholds: VIF > 5 or 10 indicates problematic multicollinearity.

## Q70) Define feature selection and its purpose.

**Feature selection** is choosing a subset of relevant features for training.
Purpose:
- reduce overfitting
- improve generalization
- reduce training time
- improve interpretability
- remove redundant/noisy variables

## Q71) Explain the process of Recursive Feature Elimination.

**Recursive Feature Elimination (RFE)**:
1. train a model
2. rank features by importance (weights/coefs)
3. remove the least important features
4. repeat until desired number of features remains
Often combined with cross-validation (RFECV).

## Q72) How does Backward Elimination work?

**Backward elimination** (wrapper method):
1. start with all features
2. train model (often linear regression)
3. remove the least significant feature (highest p-value) or smallest improvement
4. repeat until all remaining features are significant or performance stops improving

## Q73) Discuss the advantages and limitations of Forward Elimination.

**Forward selection**:
- start with zero features
- add the feature that improves performance most at each step

✅ Advantages:
- cheaper than testing all subsets
- can work well when few features matter

❌ Limitations:
- greedy (may miss optimal subset)
- can be slow when many features
- results depend on evaluation metric and split

## Q74) What is feature engineering and why is it important?

**Feature engineering** is creating/transforming features to make patterns easier for a model to learn.
Why important:
- boosts performance even with simple models
- captures domain knowledge
- can reduce noise and improve generalization

## Q75) Discuss the steps involved in feature engineering.

Steps in feature engineering:
1. understand data + domain + target
2. clean data (missing, outliers)
3. transform variables (log, scaling)
4. encode categoricals
5. create interactions / aggregates / time features
6. validate with cross-validation
7. monitor for leakage and stability

## Q76) Provide examples of feature engineering techniques.

Examples:
- log transform for skewed variables
- binning (age groups)
- interaction features (A*B, ratios)
- time-based features (day of week, rolling mean)
- text features (TF-IDF)
- polynomial features
- group aggregates (mean per user/category)

## Q77) How does feature selection differ from feature engineering?

Difference:
- **Feature engineering:** create or transform features (new information).
- **Feature selection:** choose which features to keep (reduce dimensionality).
Often used together: engineer useful features, then select the best subset.

## Q78) Explain the importance of feature selection in machine learning pipelines.

Importance of feature selection in pipelines:
- reduces dimensionality and computation
- improves generalization by removing noise
- helps interpretability for stakeholders
- prevents multicollinearity issues in linear models
- can improve stability and deployment speed

## Q79) Discuss the impact of feature selection on model performance.

Impact on performance:
- can increase accuracy by removing irrelevant features
- reduces overfitting risk
- may decrease performance if important features are removed incorrectly
Therefore, selection should be validated with cross-validation and stable metrics.

## Q80) How do you determine which features to include in a machine-learning model?

How to decide which features to include:
- start with domain knowledge + data understanding
- remove leakage features (future info)
- handle missing/outliers and encode categoricals properly
- check correlation and VIF (multicollinearity)
- use feature importance (tree models), coefficients (linear models), SHAP
- perform selection with CV (RFE, embedded methods)
- compare models with/without features using validation metrics
- keep features that improve performance and are stable/available at inference time

---
# Optional Demo Code (supports multiple questions)
These code cells demonstrate key ML preprocessing and evaluation concepts mentioned in the answers.
---


## Demo 1) Train–Validation–Test Split (supports Q14–Q18)

We split data in three parts:
- training for fitting
- validation for tuning
- test for final evaluation


In [3]:
from sklearn.model_selection import train_test_split

# Step 1: train+val vs test
X_trainval, X_test, y_trainval, y_test = train_test_split(
    Xc, yc, test_size=0.20, stratify=yc, random_state=42
)

# Step 2: train vs val
X_train, X_val, y_train, y_val = train_test_split(
    X_trainval, y_trainval, test_size=0.25, stratify=y_trainval, random_state=42
)  # 0.25 of 0.80 = 0.20 => 60/20/20 split

print("Train:", X_train.shape, "Val:", X_val.shape, "Test:", X_test.shape)
print("Class counts (train):", np.bincount(y_train))
print("Class counts (val)  :", np.bincount(y_val))
print("Class counts (test) :", np.bincount(y_test))


Train: (1200, 12) Val: (400, 12) Test: (400, 12)
Class counts (train): [1099  101]
Class counts (val)  : [367  33]
Class counts (test) : [367  33]


## Demo 2) Overfitting vs Underfitting (supports Q21–Q25)

We train decision trees with different depths and compare train vs validation accuracy.
- Very small depth → underfit
- Very large depth → overfit


In [4]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

depths = [1, 2, 3, 5, 10, None]
rows = []

for d in depths:
    clf = DecisionTreeClassifier(max_depth=d, random_state=42)
    clf.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, clf.predict(X_train))
    val_acc = accuracy_score(y_val, clf.predict(X_val))
    rows.append({"max_depth": d, "train_acc": train_acc, "val_acc": val_acc})

pd.DataFrame(rows)


Unnamed: 0,max_depth,train_acc,val_acc
0,1.0,0.930833,0.91
1,2.0,0.936667,0.925
2,3.0,0.954167,0.9375
3,5.0,0.971667,0.9475
4,10.0,0.995833,0.925
5,,1.0,0.9225


## Demo 3) Missing Data Handling (supports Q26–Q29)

We create missing values artificially and show:
- mean imputation
- median imputation
- adding a missingness indicator


In [5]:
from sklearn.impute import SimpleImputer

Xr_df = pd.DataFrame(Xr, columns=[f"f{i}" for i in range(Xr.shape[1])])

# Create missingness in two columns
mask = np.random.rand(*Xr_df[["f0","f1"]].shape) < 0.10
Xr_df.loc[mask[:,0], "f0"] = np.nan
Xr_df.loc[mask[:,1], "f1"] = np.nan

print("Missing values count:")
print(Xr_df[["f0","f1"]].isna().sum())

# Mean imputation
imp_mean = SimpleImputer(strategy="mean")
X_imp_mean = imp_mean.fit_transform(Xr_df)

# Add missing indicators
X_ind = Xr_df.copy()
X_ind["f0_missing"] = X_ind["f0"].isna().astype(int)
X_ind["f1_missing"] = X_ind["f1"].isna().astype(int)

X_imp_with_ind = SimpleImputer(strategy="median").fit_transform(X_ind)

print("Imputed shapes:", X_imp_mean.shape, X_imp_with_ind.shape)


Missing values count:
f0    96
f1    75
dtype: int64
Imputed shapes: (800, 10) (800, 12)


## Demo 4) Imbalanced Data + SMOTE (supports Q30–Q38)

We compare class distribution before and after SMOTE.

> If `imblearn` is not installed, run:
`pip install imbalanced-learn`


In [6]:
from collections import Counter

print("Before SMOTE (train):", Counter(y_train))

try:
    from imblearn.over_sampling import SMOTE
    sm = SMOTE(random_state=42, k_neighbors=5)
    X_train_sm, y_train_sm = sm.fit_resample(X_train, y_train)
    print("After SMOTE (train): ", Counter(y_train_sm))
except Exception as e:
    print("SMOTE demo skipped:", e)
    print("Install with: pip install imbalanced-learn")


Before SMOTE (train): Counter({0: 1099, 1: 101})
After SMOTE (train):  Counter({0: 1099, 1: 1099})


## Demo 5) Data Interpolation (supports Q39–Q41)

We create a small time series with missing values and fill using linear and spline interpolation.


In [7]:
ts = pd.Series([10, np.nan, 14, np.nan, np.nan, 25, 28, np.nan, 35])
print("Original:")
display(ts)

print("Linear interpolation:")
display(ts.interpolate(method="linear"))

print("Spline interpolation (order=2) (requires scipy):")
try:
    display(ts.interpolate(method="spline", order=2))
except Exception as e:
    print("Spline interpolation skipped:", e)


Original:


0    10.0
1     NaN
2    14.0
3     NaN
4     NaN
5    25.0
6    28.0
7     NaN
8    35.0
dtype: float64

Linear interpolation:


0    10.000000
1    12.000000
2    14.000000
3    17.666667
4    21.333333
5    25.000000
6    28.000000
7    31.500000
8    35.000000
dtype: float64

Spline interpolation (order=2) (requires scipy):


0    10.000000
1    12.176871
2    14.000000
3    17.928571
4    21.059524
5    25.000000
6    28.000000
7    31.472789
8    35.000000
dtype: float64

## Demo 6) Outlier Detection (supports Q42–Q45)

We detect outliers using:
- IQR rule
- Z-score rule


In [11]:
x = pd.Series(np.concatenate([np.random.normal(0, 1, 300), np.array([8, 9, -7])]))

# IQR method
q1, q3 = x.quantile(0.25), x.quantile(0.75)
iqr = q3 - q1
lower, upper = q1 - 1.5*iqr, q3 + 1.5*iqr
out_iqr = x[(x < lower) | (x > upper)]

# Z-score method
z = (x - x.mean()) / x.std(ddof=1)
out_z = x[np.abs(z) > 3]

print("IQR outliers:")
print(out_iqr.values)

print("\nZ-score outliers:")
print(out_z.values)


IQR outliers:
[-2.92135048  3.19310757 -2.70323229  8.          9.         -7.        ]

Z-score outliers:
[ 8.  9. -7.]


## Demo 7) Feature Scaling (supports Q49–Q53)

We compare:
- Standardization
- MinMax scaling
- Unit vector scaling (normalization)


In [12]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, Normalizer

X_small = Xr[:8, :3]  # small slice for display
print("Original:")
print(np.round(X_small, 3))

std = StandardScaler().fit_transform(X_small)
mm = MinMaxScaler().fit_transform(X_small)
uv = Normalizer(norm="l2").fit_transform(X_small)

print("\nStandardized:")
print(np.round(std, 3))

print("\nMinMax scaled:")
print(np.round(mm, 3))

print("\nUnit vector scaled (row-wise):")
print(np.round(uv, 3))


Original:
[[ 3.926 -2.084  0.141]
 [ 1.396 -0.562 -1.106]
 [-0.898 -0.766 -1.9  ]
 [   nan -0.123  0.264]
 [-0.032  0.641 -0.531]
 [   nan -1.388 -0.488]
 [-1.294 -0.696  2.125]
 [-0.555  0.428 -0.834]]


ValueError: Input X contains NaN.
Normalizer does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

## Demo 8) PCA (supports Q54–Q57)

We standardize data, run PCA, and show explained variance ratio.


In [None]:
from sklearn.decomposition import PCA

X_std = StandardScaler().fit_transform(Xr)

pca = PCA(n_components=5, random_state=42)
Xp = pca.fit_transform(X_std)

print("Explained variance ratio (first 5 PCs):")
print(np.round(pca.explained_variance_ratio_, 4))
print("Cumulative variance:")
print(np.round(np.cumsum(pca.explained_variance_ratio_), 4))


## Demo 9) Encoding (supports Q58–Q64)

We demonstrate:
- One-hot encoding
- Label encoding (for target labels)
- Ordinal encoding
- Mean/target encoding (simple illustration)


In [14]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder

df_cat = pd.DataFrame({
    "color": ["red","blue","green","blue","red","green"],
    "size": ["small","medium","large","small","large","medium"],
    "target": [1, 0, 1, 0, 1, 0]
})

# One-hot for nominal 'color'
ohe = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
X_color_ohe = ohe.fit_transform(df_cat[["color"]])
print("One-hot feature names:", ohe.get_feature_names_out(["color"]))
print(X_color_ohe)

# Ordinal for ordered 'size'
ord_enc = OrdinalEncoder(categories=[["small","medium","large"]])
X_size_ord = ord_enc.fit_transform(df_cat[["size"]])
print("\nOrdinal encoding (size):")
print(X_size_ord.ravel())

# Label encoding (typically for y, not X)
le = LabelEncoder()
y_le = le.fit_transform(["cat","dog","dog","fish"])
print("\nLabel encoding example:", y_le, "classes:", le.classes_)

# Mean/Target encoding (simple, MUST be CV-safe in real pipelines)
mean_map = df_cat.groupby("color")["target"].mean().to_dict()
df_cat["color_mean_enc"] = df_cat["color"].map(mean_map)
print("\nMean encoding map:", mean_map)
display(df_cat)


One-hot feature names: ['color_blue' 'color_green' 'color_red']
[[0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [1. 0. 0.]
 [0. 0. 1.]
 [0. 1. 0.]]

Ordinal encoding (size):
[0. 1. 2. 0. 2. 1.]

Label encoding example: [0 1 1 2] classes: ['cat' 'dog' 'fish']

Mean encoding map: {'blue': 0.0, 'green': 0.5, 'red': 1.0}


Unnamed: 0,color,size,target,color_mean_enc
0,red,small,1,1.0
1,blue,medium,0,0.0
2,green,large,1,0.5
3,blue,small,0,0.0
4,red,large,1,1.0
5,green,medium,0,0.5


## Demo 10) VIF and RFE (supports Q69–Q72)

- VIF checks multicollinearity among features (commonly in linear models).
- RFE selects features by recursively removing the least important.

> If `statsmodels` is not installed: `pip install statsmodels`


In [13]:
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

# RFE with linear regression on regression dataset
model = LinearRegression()
rfe = RFE(model, n_features_to_select=5)
rfe.fit(Xr, yr)

selected = np.where(rfe.support_)[0]
ranking = rfe.ranking_
print("Selected feature indices:", selected)
print("Ranking (1=selected):", ranking)


ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values

In [None]:
# VIF demo (requires statsmodels)
try:
    import statsmodels.api as sm
    from statsmodels.stats.outliers_influence import variance_inflation_factor

    X_vif = pd.DataFrame(Xr[:, :6], columns=[f"f{i}" for i in range(6)])
    X_vif = sm.add_constant(X_vif)

    vif = []
    for i in range(X_vif.shape[1]):
        vif.append({"feature": X_vif.columns[i], "VIF": variance_inflation_factor(X_vif.values, i)})
    pd.DataFrame(vif)
except Exception as e:
    print("VIF demo skipped:", e)
    print("Install with: pip install statsmodels")
