# ML Assignment – 2  
**Submitter Name:** Aasif Majeed  
**Date:** 24 May 2024  

This notebook answers **all questions (1–80)** from the uploaded assignment PDF.  
Each question is written with its number, followed by a clear explanation.  
A few **optional Python demo cells** are included at the end to support key concepts.


---
## 0) Imports (for optional demo code)
---


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

np.random.seed(42)
pd.set_option("display.max_columns", 100)
pd.set_option("display.width", 140)


---
# Answers (1–80)
---

## Q1) What is regression analysis?

Regression analysis is a **supervised learning** (and classical statistics) method that models the relationship between:
- **Inputs / predictors**: \(X = (x_1, x_2, \dots, x_p)\)
- **Output / target**: \(y\) (continuous)

**Core goals**
1. **Prediction**: estimate \(y\) for new/unseen \(X\).
2. **Explanation**: quantify how each predictor is associated with changes in \(y\) (e.g., “holding other variables constant”).

**General form**
\[
y = f(X) + \epsilon
\]
where \(f\) is the learned function and \(\epsilon\) represents noise/unobserved factors.

**Examples**
- Predict house price from size, location features, and age.
- Predict energy consumption from temperature and occupancy.


## Q2) Explain the difference between linear and nonlinear regression.

**Linear regression** assumes the model is linear in its parameters:
\[
y \approx \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p
\]
Even if features are transformed (e.g., \(\log x\)), it is still “linear regression” as long as it is linear in \(\beta\).

**Nonlinear regression** means the relationship is not linear (often nonlinear in parameters or structure), for example:
- Exponential: \(y = a e^{bx}\)
- Power law: \(y = a x^b\)
- Saturation curves, logistic growth, etc.

**Practical difference**
- Linear models are easier to interpret and optimize (closed-form solutions exist).
- Nonlinear models can fit complex patterns but may require iterative optimization and can overfit more easily without controls.


## Q3) What is the difference between simple linear regression and multiple linear regression?

**Simple linear regression** uses exactly **one predictor**:
\[
y = \beta_0 + \beta_1 x + \epsilon
\]
It captures a straight-line relationship between one input and the target.

**Multiple linear regression** uses **two or more predictors**:
\[
y = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p + \epsilon
\]
It models the combined effect of several variables (and can include interaction terms like \(x_1x_2\)).

**Why multiple regression matters**
- Real problems often have multiple drivers.
- It lets you estimate the effect of one variable **while controlling for others**.


## Q4) How is the performance of a regression model typically evaluated?

Regression performance is usually evaluated on **unseen data** (validation/test) using error metrics and diagnostics.

**Common metrics**
- **MAE**: \(\frac{1}{n}\sum|y-\hat{y}|\) (robust, easy to interpret).
- **MSE**: \(\frac{1}{n}\sum(y-\hat{y})^2\) (penalizes large errors).
- **RMSE**: \(\sqrt{\text{MSE}}\) (same units as \(y\)).
- **R²**: \(1 - \frac{\sum (y-\hat{y})^2}{\sum (y-\bar{y})^2}\) (variance explained).

**Diagnostics**
- Residual plots: check nonlinearity, heteroscedasticity, outliers.
- Learning curves / CV: check generalization stability.

**Best practice**
- Do cross-validation for stable estimates, especially with smaller datasets.


## Q5) What is overfitting in the context of regression models?

**Overfitting** in regression happens when the model fits the **noise** in the training data instead of the true underlying pattern.

**Symptoms**
- Very low training error, but noticeably higher validation/test error.
- Residuals look “too good” on train but unstable on new data.

**Common causes**
- Model too complex (high-degree polynomial, too many parameters).
- Too many features relative to data size.
- Data leakage (test info accidentally used in training).
- Too long training without regularization (in iterative models).

**Fixes**
- Use regularization (Ridge/Lasso), reduce complexity, do feature selection, collect more data, and tune using cross-validation.


## Q6) What is logistic regression used for?

**Logistic regression** is primarily used for **classification**, especially:
- **Binary classification**: \(y \in \{0,1\}\)
- Also extensions: multinomial logistic regression for multiple classes.

It models the probability of the positive class:
\[
p = P(y=1|x)
\]
and then predicts class labels using a threshold (commonly 0.5).

**Applications**
- spam vs not spam
- disease present vs absent
- churn vs no churn


## Q7) How does logistic regression differ from linear regression?

**Linear regression**
- Target is continuous (\(y \in \mathbb{R}\)).
- Output can be any real number.
- Often optimized with least squares (MSE).

**Logistic regression**
- Target is categorical (typically 0/1).
- Output is a probability \(p \in (0,1)\) via the sigmoid.
- Optimized with **log-loss / cross-entropy** (maximum likelihood).

**Important note**
Despite the name, logistic regression is a **classification** model.


## Q8) Explain the concept of odds ratio in logistic regression.

In logistic regression, we often work with **odds** and **log-odds**:

- Probability: \(p = P(y=1|x)\)
- **Odds**: \(\frac{p}{1-p}\)
- **Log-odds (logit)**:
\[
\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta^T x
\]

**Odds ratio (OR)**
For a 1-unit increase in feature \(x_j\), the odds multiply by:
\[
OR = e^{\beta_j}
\]
Interpretation:
- \(OR>1\): odds increase
- \(OR<1\): odds decrease
- \(OR=1\): no change


## Q9) What is the sigmoid function in logistic regression?

The **sigmoid** (logistic) function converts any real number into a probability in (0,1):
\[
\sigma(z) = \frac{1}{1+e^{-z}}
\]
where \(z = \beta_0 + \beta^T x\).

**Why it matters**
- It “squashes” outputs to be valid probabilities.
- It creates an S-shaped curve: small changes near 0 can change probability a lot; far from 0, the probability saturates near 0 or 1.


## Q10) How is the performance of a logistic regression model evaluated?

Logistic regression performance is evaluated using **classification metrics**, ideally on validation/test data.

**Core tools**
- **Confusion matrix**: TP, FP, TN, FN
- **Accuracy**: may be misleading if data is imbalanced
- **Precision**: TP/(TP+FP)
- **Recall**: TP/(TP+FN)
- **F1-score**: harmonic mean of precision and recall
- **ROC-AUC**: ranking quality across thresholds
- **PR-AUC**: often better for imbalanced datasets
- **Log-loss (cross-entropy)**: evaluates probability quality

**Extra**
- Threshold tuning and calibration checks are common when probabilities matter.


## Q11) What is a decision tree?

A **decision tree** is a model that makes predictions by repeatedly applying **if–else** rules on features.

- In **classification**, each leaf predicts a class (often majority class) or class probabilities.
- In **regression**, each leaf predicts a numeric value (often mean of y in that leaf).

Trees are popular because they:
- handle nonlinear relationships,
- work with mixed feature types,
- are interpretable in small-to-medium depth.


## Q12) How does a decision tree make predictions?

A decision tree predicts by **traversing** from the root to a leaf:

1. Start at root node.
2. Apply the split rule (example: `feature_j <= threshold`).
3. Depending on the rule outcome, go left or right.
4. Repeat until reaching a leaf node.
5. Output the leaf’s prediction (class label/probability or mean value).

The training algorithm chooses splits that best reduce impurity (classification) or reduce variance/error (regression).


## Q13) What is entropy in the context of decision trees?

**Entropy** measures class impurity/uncertainty at a node in a classification tree:

\[
H(S) = -\sum_{k=1}^{K} p_k \log_2(p_k)
\]
where \(p_k\) is the fraction of samples in class \(k\).

- \(H=0\) when the node is **pure** (all samples in one class).
- Entropy is higher when classes are mixed.

Trees select splits using **information gain** (entropy reduction):
\[
IG = H(parent) - \sum_{child} \frac{n_{child}}{n_{parent}} H(child)
\]


## Q14) What is pruning in decision trees?

**Pruning** reduces a decision tree’s complexity to improve generalization and prevent overfitting.

**Pre-pruning (early stopping)**
- stop splitting early using constraints like:
  - `max_depth`
  - `min_samples_split`
  - `min_samples_leaf`

**Post-pruning**
- build a large tree, then remove weak branches using criteria like cost-complexity pruning.

**Effect**
- simpler tree, lower variance, often better test performance.


## Q15) How do decision trees handle missing values?

Decision trees can handle missing values in several ways (depends on implementation/library):

1. **Imputation before training** (most common): fill missing with mean/median/mode.
2. **Treat missing as its own category** (categorical variables).
3. **Surrogate splits**: if best split feature is missing, use another correlated feature’s split.
4. Some boosted-tree libraries learn a default direction for missing values at each split.

**Best practice**
- Use a pipeline: fit imputer on training data only, then transform validation/test.


## Q16) What is a support vector machine (SVM)?

A **Support Vector Machine (SVM)** is a supervised model that finds a decision boundary with maximum margin.

For binary classification, it tries to find a hyperplane:
\[
w^T x + b = 0
\]
that separates classes while maximizing the margin (distance from the hyperplane to nearest points).

Variants:
- **SVC**: classification
- **SVR**: regression
- Supports **kernels** for nonlinear decision boundaries.


## Q17) Explain the concept of margin in SVM.

The **margin** is the distance between the separating hyperplane and the nearest points from each class.

For a linear SVM:
- The hyperplane is \(w^T x + b = 0\).
- The margin is proportional to \(1/\|w\|\).
SVM maximizes this margin because larger margin often leads to better generalization (less sensitive to small perturbations).

In soft-margin SVM, we allow some violations using slack variables (controlled by \(C\)).


## Q18) What are support vectors in SVM?

**Support vectors** are the training data points that lie closest to the decision boundary (or on/inside the margin).

Why they matter:
- They **completely determine** the SVM boundary.
- If you remove non-support-vector points, the boundary usually stays the same.
- They represent the “hardest” points to classify.


## Q19) How does SVM handle non-linearly separable data?

SVM handles nonlinearly separable data in two main ways:

1. **Soft margin** (slack variables): allow some points to be misclassified or inside the margin.
   - Controlled by parameter \(C\) (higher \(C\) → less tolerance for errors, risk more overfitting).
2. **Kernel trick**: implicitly maps inputs to a higher-dimensional space where a linear separator becomes possible.
   - Common kernels: **RBF (Gaussian)**, polynomial, sigmoid.

This allows SVM to model nonlinear boundaries while keeping computation manageable.


## Q20) What are the advantages of SVM over other classification algorithms?

**Advantages of SVM**
- Strong performance in **high-dimensional** spaces.
- Effective with clear margins between classes.
- Kernel trick enables flexible nonlinear classification.
- Uses only support vectors to define boundary (compact boundary representation).

**Limitations**
- Can be slower for very large datasets.
- Sensitive to feature scaling (standardization recommended).
- Requires tuning (\(C\), kernel parameters like \(\gamma\)).
- Not inherently probabilistic; probabilities need calibration (`probability=True` or Platt scaling).


## Q21) What is the Naïve Bayes algorithm?

**Naïve Bayes (NB)** is a probabilistic classifier based on Bayes’ theorem:
\[
P(y|x) \propto P(y)\,P(x|y)
\]
and the **naïve assumption**:
\[
P(x|y) = \prod_{j=1}^{p} P(x_j|y)
\]

So the prediction rule is:
\[
\hat{y} = \arg\max_y \left[\log P(y) + \sum_j \log P(x_j|y)\right]
\]

NB is fast, simple, and works especially well for text classification.


## Q22) Why is it called 'Naïve' Bayes?

It is called **“naïve”** because it assumes that all features are **conditionally independent** given the class label \(y\).

Example: In spam detection, NB treats words as independent given spam/not spam.  
This is not strictly true in language, but the assumption often still produces good performance because:
- many features contribute additive evidence,
- errors may cancel out,
- it generalizes well in sparse high-dimensional spaces (like bag-of-words).


## Q23) How does Naïve Bayes handle continuous and categorical features?

Naïve Bayes uses different likelihood models depending on feature type:

**1) Continuous features**
- **Gaussian NB** assumes each feature follows a normal distribution within each class:
  \[
  x_j|y \sim \mathcal{N}(\mu_{jy}, \sigma_{jy}^2)
  \]

**2) Categorical / count features**
- **Multinomial NB**: word counts / frequencies (text).
- **Bernoulli NB**: binary features (word present/absent).
- **Categorical NB**: general categorical features with probability tables.

Smoothing (e.g., Laplace) is often used for categorical/count features.


## Q24) Explain the concept of prior and posterior probabilities in Naïve Bayes.

**Prior probability**: \(P(y)\)  
- The probability of class \(y\) before seeing any features.
- Example: if 70% emails are not spam, \(P(y=\text{not spam})=0.7\).

**Likelihood**: \(P(x|y)\)  
- Probability of observing features \(x\) given class \(y\).

**Posterior probability**: \(P(y|x)\)  
- Updated probability after seeing features:
\[
P(y|x) = \frac{P(x|y)\,P(y)}{P(x)}
\]
In classification, we compare posteriors across classes and pick the largest.


## Q25) What is Laplace smoothing and why is it used in Naïve Bayes?

**Laplace smoothing** (add-one smoothing) prevents zero probabilities when a category/word is unseen in training for a class.

Without smoothing:
- If \(P(x_j|y)=0\) for any feature, the entire product becomes 0 → model fails.

With Laplace smoothing:
\[
P(w|c) = \frac{count(w,c)+\alpha}{\sum_w count(w,c) + \alpha V}
\]
- \(V\) = vocabulary size (or number of categories)
- \(\alpha=1\) gives classic Laplace smoothing

This improves robustness, especially in sparse text data.


## Q26) Can Naïve Bayes be used for regression tasks?

In standard ML practice, Naïve Bayes is mainly a **classification** algorithm.

- Some research variants exist for regression, but they are not commonly used in mainstream pipelines.
- For regression tasks, common choices are:
  - Linear regression / Ridge / Lasso
  - SVR
  - Random Forest Regressor
  - Gradient Boosting Regressor (XGBoost/LightGBM)

So the typical answer: **NB is not a standard regression model**.


## Q27) How do you handle missing values in Naïve Bayes?

Handling missing values for Naïve Bayes is usually done via **preprocessing**, because most NB implementations expect complete input.

Common strategies:
- **Impute numeric** features with mean/median.
- **Impute categorical** with mode or treat missing as its own category like `"Unknown"`.
- Add **missingness indicator features** (0/1 flags) to capture informative missingness.

Important: fit imputation on **training data only** to avoid leakage.


## Q28) What are some common applications of Naïve Bayes?

Naïve Bayes is widely used in problems where features are many and sparse, especially text.

Common applications:
- **Spam detection**
- **Sentiment analysis**
- **Document/topic classification**
- **Language detection**
- **Medical diagnosis baselines** (quick baseline)
- **Real-time classification** where speed matters

It is often used as a strong baseline because it trains and predicts very fast.


## Q29) Explain the concept of feature independence assumption in Naïve Bayes.

The key assumption in Naïve Bayes is **conditional independence**:
\[
P(x_1, x_2, \dots, x_p \mid y) = \prod_{j=1}^{p} P(x_j \mid y)
\]
Meaning: once we know the class \(y\), knowing \(x_1\) gives no extra information about \(x_2\), etc.

Why it helps:
- reduces the number of parameters to estimate,
- makes learning and inference fast and stable in high dimensions.

Why it can fail:
- when strong dependencies between features carry important information.


## Q30) How does Naïve Bayes handle categorical features with a large number of categories?

Large-category features (high cardinality) can be challenging because:
- probability tables become large,
- many categories are rare → unreliable estimates.

Ways NB handles it:
- **Smoothing** (Laplace/add-α) to avoid zero probabilities and reduce overconfidence.
- **Grouping rare categories** into an `"Other"` bucket.
- **Feature hashing** to reduce dimensionality (common in text pipelines).
- Prefer count-based representations (Multinomial NB) for text-like features.

Goal: keep estimates stable and avoid huge sparse matrices.


## Q31) What is the curse of dimensionality, and how does it affect machine learning algorithms?

The **curse of dimensionality** describes how data becomes sparse and harder to model as the number of features grows.

What happens:
- volume of space grows exponentially → need much more data to cover it
- distances become less informative (kNN, k-means suffer)
- models overfit more easily (many degrees of freedom)

How to mitigate:
- feature selection (remove irrelevant/redundant features)
- dimensionality reduction (PCA/UMAP)
- regularization (Ridge/Lasso)
- get more data or better features


## Q32) Explain the bias-variance tradeoff and its implications for machine learning models.

**Bias–variance tradeoff** explains why making a model more complex is not always better.

- **Bias**: error from overly simple assumptions (underfitting).  
  High bias → poor training and test performance.
- **Variance**: error from sensitivity to training data (overfitting).  
  High variance → excellent training but poor test performance.

As complexity increases:
- bias ↓
- variance ↑

We choose complexity/regularization to minimize **validation/test error**.


## Q33) What is cross-validation, and why is it used?

**Cross-validation (CV)** evaluates a model by splitting data into multiple train/validation folds.

In **k-fold CV**:
1. Split dataset into k equal folds.
2. Train on k−1 folds, validate on the remaining fold.
3. Repeat k times and average performance.

Why used:
- more reliable estimate than a single split
- better use of limited data
- standard method for hyperparameter tuning and model selection


## Q34) Explain the difference between parametric and non-parametric machine learning algorithms.

**Parametric algorithms** assume a fixed functional form and learn a finite set of parameters.
- Examples: linear regression, logistic regression, (Gaussian) Naïve Bayes.

**Non-parametric algorithms** do not assume a fixed form; model complexity can grow with data.
- Examples: kNN, decision trees, random forests, kernel density methods.

Tradeoff:
- parametric: simpler, faster, may underfit
- non-parametric: flexible, can overfit, often needs more data


## Q35) What is feature scaling, and why is it important in machine learning?

**Feature scaling** transforms features to comparable ranges (e.g., standardization or min-max scaling).

Why important:
- **Distance-based models** (kNN, k-means) depend heavily on scales.
- **Gradient-based models** (logistic regression, neural nets) converge faster with scaled inputs.
- **SVM** is sensitive to scaling, especially with RBF kernel.
- **PCA** requires scaling to avoid domination by large-scale features.

Tree-based models are less sensitive, but scaling still helps in mixed pipelines.


## Q36) What is regularization, and why is it used in machine learning?

**Regularization** adds a penalty term to the loss function to discourage overly complex models and prevent overfitting.

Common types:
- **L2 (Ridge)**: \(\lambda \sum \beta_j^2\)  
  shrinks coefficients smoothly; good for multicollinearity.
- **L1 (Lasso)**: \(\lambda \sum |\beta_j|\)  
  encourages sparsity → performs feature selection.

Benefits:
- improves generalization
- stabilizes coefficients
- reduces variance


## Q37) Explain the concept of ensemble learning and give an example.

**Ensemble learning** combines multiple models to produce a stronger final predictor.

Why it works:
- multiple learners reduce variance and/or bias
- errors can cancel out

Examples:
- **Random Forest** (bagging): many trees trained on bootstrapped samples.
- **Gradient Boosting** (boosting): sequentially adds models to correct errors.
- **Voting classifier**: majority vote/averaging across different models.


## Q38) What is the difference between bagging and boosting?

**Bagging (Bootstrap Aggregating)**
- Train many models in parallel on different bootstrapped datasets.
- Combine predictions by averaging/voting.
- Main effect: reduces **variance**.
- Example: Random Forest.

**Boosting**
- Train models sequentially; each new model focuses on previous errors.
- Combine via weighted sum.
- Main effect: reduces **bias** (often improves accuracy).
- Examples: AdaBoost, Gradient Boosting, XGBoost.


## Q39) What is the difference between a generative model and a discriminative model?

**Generative models** learn how data is generated by modeling \(P(x,y)\) or \(P(x|y)\) and \(P(y)\).
- Example: Naïve Bayes (models \(P(x|y)\) and \(P(y)\)).
- Can generate samples of \(x\) given \(y\) (in principle).

**Discriminative models** directly model \(P(y|x)\) or learn the decision boundary.
- Examples: logistic regression, SVM, neural networks.
- Often achieve higher predictive accuracy when enough labeled data exists.


## Q40) Explain the concept of batch gradient descent and stochastic gradient descent.

**Batch Gradient Descent**
- Computes gradient using **all training samples** each step.
- Stable updates, but expensive for large datasets.

**Stochastic Gradient Descent (SGD)**
- Computes gradient using **one sample** at a time.
- Very fast per step, but noisy updates (can bounce around).

**Mini-batch GD**
- Uses small batches (e.g., 32–256).
- Most common in practice: balances stability and speed.


## Q41) What is the K-nearest neighbors (KNN) algorithm, and how does it work?

**k-Nearest Neighbors (kNN)** is a lazy, non-parametric method.

How it works:
1. Choose \(k\).
2. For a query point, compute distance to training points (Euclidean, cosine, etc.).
3. Pick the \(k\) nearest points.
4. Predict:
   - **classification:** majority vote (optionally distance-weighted)
   - **regression:** mean/weighted mean of neighbors’ targets

It relies on the idea that “similar points have similar labels.”


## Q42) What are the disadvantages of the K-nearest neighbors algorithm?

Main disadvantages of kNN:
- **Slow prediction**: needs distances to many points (unless using indexing structures).
- **Memory-heavy**: stores the full training set.
- **Sensitive to scaling**: unscaled features distort distances.
- **Curse of dimensionality**: distances become less meaningful in high dimensions.
- Performance depends heavily on \(k\) choice and distance metric.


## Q43) Explain the concept of one-hot encoding and its use in machine learning.

**One-hot encoding** converts a categorical feature into multiple binary columns.

Example: Color ∈ {Red, Blue, Green}
- Red → [1,0,0]
- Blue → [0,1,0]
- Green → [0,0,1]

Why used:
- avoids imposing a fake ordinal relationship (unlike label encoding).
- enables linear models and many ML algorithms to process categorical inputs.

Caution:
- for high-cardinality categories, one-hot can create too many columns.


## Q44) What is feature selection, and why is it important in machine learning?

**Feature selection** is choosing the most relevant subset of input features for a model.

Why important:
- reduces overfitting (removes noise)
- improves generalization
- reduces training time and storage
- improves interpretability
- reduces multicollinearity for linear models

Approaches:
- filter methods (correlation, chi-square)
- wrapper methods (RFE, forward/backward selection)
- embedded methods (Lasso, tree-based importances)


## Q45) Explain the concept of cross-entropy loss and its use in classification tasks.

**Cross-entropy loss** measures how well predicted probabilities match true labels.
For binary classification:
\[
L = -\left[y\log(p) + (1-y)\log(1-p)\right]
\]
- If the model is confidently wrong (p close to 0 when y=1), loss becomes very large.
- Encourages well-calibrated probabilities.

Used in:
- logistic regression
- neural networks for classification
- many probabilistic classifiers


## Q46) What is the difference between batch learning and online learning?

**Batch learning**
- Train once on a fixed dataset.
- To update with new data, you often retrain from scratch (or periodically).

**Online learning**
- Update model incrementally as new data arrives (streaming).
- Useful for large datasets, real-time systems, or when data distribution changes (concept drift).

Examples:
- online SGD, incremental Naïve Bayes, streaming recommender systems.


## Q47) Explain the concept of grid search and its use in hyperparameter tuning.

**Grid search** tries all combinations of a specified set of hyperparameters.

Example: for SVM (RBF), tune \(C\) and \(\gamma\).
- Use validation set or **cross-validation** to evaluate each combination.
- Choose the one with best CV score.

Pros:
- simple and exhaustive

Cons:
- can be expensive when many parameters or wide ranges
Alternative: random search, Bayesian optimization.


## Q48) What are the advantages and disadvantages of decision trees?

**Decision tree advantages**
- interpretable (especially shallow trees)
- handles nonlinearities and interactions
- minimal preprocessing (no scaling needed)
- works with numeric + categorical (with proper handling)

**Disadvantages**
- prone to overfitting (high variance)
- unstable: small data changes can change structure
- greedy split decisions may miss global optimum
- can be biased toward features with many possible splits

Fixes: pruning, ensembles (Random Forest, Gradient Boosting).


## Q49) What is the difference between L1 and L2 regularization?

**L1 regularization (Lasso)**
- Penalty: \(\lambda \sum_j |\beta_j|\)
- Encourages sparsity → some coefficients become exactly 0 (feature selection).

**L2 regularization (Ridge)**
- Penalty: \(\lambda \sum_j \beta_j^2\)
- Shrinks coefficients smoothly, rarely to exactly 0.
- Works well when features are correlated (multicollinearity).

Elastic Net combines both.


## Q50) What are some common preprocessing techniques used in machine learning?

Common preprocessing techniques include:
- handling missing values (imputation + indicators)
- encoding categorical variables (one-hot, ordinal, target encoding)
- scaling/normalization (standardization, min-max)
- outlier handling (IQR, z-score, robust scaling)
- train/validation/test split (avoid leakage)
- feature selection or dimensionality reduction (PCA)
- class imbalance handling (resampling, SMOTE, class weights)
- text preprocessing (tokenization, TF-IDF)


## Q51) What is the difference between a parametric and non-parametric algorithm? Give examples of each.

A **parametric** algorithm has a fixed number of parameters regardless of dataset size.
- Examples: **linear regression**, **logistic regression**, **Naïve Bayes** (with distribution assumptions).

A **non-parametric** algorithm adapts complexity with data size or structure.
- Examples: **kNN**, **decision trees**, **random forests**.

Key difference: parametric models are simpler and need fewer samples, while non-parametric models are flexible but may require more data and careful regularization/tuning.


## Q52) Explain the bias-variance tradeoff and how it relates to model complexity.

Bias–variance tradeoff relates directly to **model complexity**:
- As complexity increases, **training error decreases**.
- Bias tends to decrease, but variance tends to increase.

Practical implication:
- Very simple model → underfits (high bias)
- Very complex model → overfits (high variance)

We use validation curves, learning curves, and cross-validation to find the best complexity/regularization level.


## Q53) What are the advantages and disadvantages of using ensemble methods like random forests?

**Random Forest (an ensemble of decision trees)**

Advantages:
- strong performance on many tabular tasks
- reduces overfitting compared to a single tree (bagging reduces variance)
- handles nonlinearities and interactions well
- provides feature importance estimates
- robust to outliers and monotonic transforms

Disadvantages:
- less interpretable than a single tree
- can be heavy (many trees) for memory and deployment
- may underperform boosted methods on some tasks without tuning


## Q54) Explain the difference between bagging and boosting.

**Bagging (Bootstrap Aggregating)**
- Train many models in parallel on different bootstrapped datasets.
- Combine predictions by averaging/voting.
- Main effect: reduces **variance**.
- Example: Random Forest.

**Boosting**
- Train models sequentially; each new model focuses on previous errors.
- Combine via weighted sum.
- Main effect: reduces **bias** (often improves accuracy).
- Examples: AdaBoost, Gradient Boosting, XGBoost.


## Q55) What is the purpose of hyperparameter tuning in machine learning?

Hyperparameter tuning finds the best “settings” that are not learned directly from the data (unlike model weights).
Examples:
- SVM: \(C\), kernel choice, \(\gamma\)
- Tree: max_depth, min_samples_leaf
- kNN: \(k\)
- Boosting: learning rate, number of estimators

Purpose:
- improve generalization
- avoid underfitting/overfitting
- produce stable, reproducible performance using validation/CV


## Q56) What is the difference between regularization and feature selection?

**Regularization** controls model complexity by adding a penalty to the loss (keeps all features but shrinks coefficients).
- Example: Ridge and Lasso penalties.

**Feature selection** reduces complexity by **removing features** entirely.
- Example: RFE, selecting top features by mutual information.

They can be used together:
- engineer features → select useful ones → regularize model for stability.


## Q57) How does the Lasso (L1) regularization differ from Ridge (L2) regularization?

**Lasso (L1)**:
- adds \(\lambda \sum |\beta_j|\)
- can force some coefficients to exactly zero → automatic feature selection
- useful when you believe only a few features matter

**Ridge (L2)**:
- adds \(\lambda \sum \beta_j^2\)
- shrinks coefficients but keeps them non-zero
- very good when features are correlated and you want stable estimates

**Choice** depends on sparsity expectations and multicollinearity.


## Q58) Explain the concept of cross-validation and why it is used.

**Cross-validation (CV)** evaluates a model by splitting data into multiple train/validation folds.

In **k-fold CV**:
1. Split dataset into k equal folds.
2. Train on k−1 folds, validate on the remaining fold.
3. Repeat k times and average performance.

Why used:
- more reliable estimate than a single split
- better use of limited data
- standard method for hyperparameter tuning and model selection


## Q59) What are some common evaluation metrics used for regression tasks?

Regression performance is usually evaluated on **unseen data** (validation/test) using error metrics and diagnostics.

**Common metrics**
- **MAE**: \(\frac{1}{n}\sum|y-\hat{y}|\) (robust, easy to interpret).
- **MSE**: \(\frac{1}{n}\sum(y-\hat{y})^2\) (penalizes large errors).
- **RMSE**: \(\sqrt{\text{MSE}}\) (same units as \(y\)).
- **R²**: \(1 - \frac{\sum (y-\hat{y})^2}{\sum (y-\bar{y})^2}\) (variance explained).

**Diagnostics**
- Residual plots: check nonlinearity, heteroscedasticity, outliers.
- Learning curves / CV: check generalization stability.

**Best practice**
- Do cross-validation for stable estimates, especially with smaller datasets.


## Q60) How does the K-nearest neighbors (KNN) algorithm make predictions?

**k-Nearest Neighbors (kNN)** is a lazy, non-parametric method.

How it works:
1. Choose \(k\).
2. For a query point, compute distance to training points (Euclidean, cosine, etc.).
3. Pick the \(k\) nearest points.
4. Predict:
   - **classification:** majority vote (optionally distance-weighted)
   - **regression:** mean/weighted mean of neighbors’ targets

It relies on the idea that “similar points have similar labels.”


## Q61) What is the curse of dimensionality, and how does it affect machine learning algorithms?

The **curse of dimensionality** describes how data becomes sparse and harder to model as the number of features grows.

What happens:
- volume of space grows exponentially → need much more data to cover it
- distances become less informative (kNN, k-means suffer)
- models overfit more easily (many degrees of freedom)

How to mitigate:
- feature selection (remove irrelevant/redundant features)
- dimensionality reduction (PCA/UMAP)
- regularization (Ridge/Lasso)
- get more data or better features


## Q62) What is feature scaling, and why is it important in machine learning?

**Feature scaling** transforms features to comparable ranges (e.g., standardization or min-max scaling).

Why important:
- **Distance-based models** (kNN, k-means) depend heavily on scales.
- **Gradient-based models** (logistic regression, neural nets) converge faster with scaled inputs.
- **SVM** is sensitive to scaling, especially with RBF kernel.
- **PCA** requires scaling to avoid domination by large-scale features.

Tree-based models are less sensitive, but scaling still helps in mixed pipelines.


## Q63) How does the Naïve Bayes algorithm handle categorical features?

Naïve Bayes handles categorical features by estimating class-conditional probabilities for each category.

Example:
- Feature = Color ∈ {Red, Blue, Green}
- For each class \(y\), estimate \(P(Color=Red|y)\), \(P(Color=Blue|y)\), etc.

To avoid zero probabilities for rare/unseen categories, apply **smoothing** (Laplace/add-α).
For text features (many categories = words), Multinomial/Bernoulli NB is commonly used.


## Q64) Explain the concept of prior and posterior probabilities in Naïve Bayes.

**Prior probability**: \(P(y)\)  
- The probability of class \(y\) before seeing any features.
- Example: if 70% emails are not spam, \(P(y=\text{not spam})=0.7\).

**Likelihood**: \(P(x|y)\)  
- Probability of observing features \(x\) given class \(y\).

**Posterior probability**: \(P(y|x)\)  
- Updated probability after seeing features:
\[
P(y|x) = \frac{P(x|y)\,P(y)}{P(x)}
\]
In classification, we compare posteriors across classes and pick the largest.


## Q65) What is Laplace smoothing, and why is it used in Naïve Bayes?

**Laplace smoothing** (add-one smoothing) prevents zero probabilities when a category/word is unseen in training for a class.

Without smoothing:
- If \(P(x_j|y)=0\) for any feature, the entire product becomes 0 → model fails.

With Laplace smoothing:
\[
P(w|c) = \frac{count(w,c)+\alpha}{\sum_w count(w,c) + \alpha V}
\]
- \(V\) = vocabulary size (or number of categories)
- \(\alpha=1\) gives classic Laplace smoothing

This improves robustness, especially in sparse text data.


## Q66) Can Naïve Bayes handle continuous features?

Yes. **Gaussian Naïve Bayes** is designed for continuous (real-valued) features.
It assumes each feature is normally distributed within each class:
\[
x_j|y \sim \mathcal{N}(\mu_{jy}, \sigma_{jy}^2)
\]
It estimates \(\mu\) and \(\sigma^2\) from the training data per class and uses them to compute likelihoods.


## Q67) What are the assumptions of the Naïve Bayes algorithm?

Key assumptions of Naïve Bayes:
1. **Conditional independence**: features are independent given the class.
2. **Correct likelihood model** for features:
   - Gaussian for continuous (GNB)
   - Multinomial/Bernoulli/Categorical for discrete/text
3. **i.i.d. samples**: training examples are independent and identically distributed.

When these assumptions are approximately reasonable, NB can be very effective.


## Q68) How does Naïve Bayes handle missing values?

Most NB implementations cannot accept NaNs directly, so missing values are handled by preprocessing:
- impute numeric features (mean/median)
- impute categorical features (mode) or introduce a category like “Unknown”
- add missingness indicators if missingness is informative

Always fit imputers on the training set only, then apply the same transformation to validation/test.


## Q69) What are some common applications of Naïve Bayes?

Naïve Bayes is widely used in problems where features are many and sparse, especially text.

Common applications:
- **Spam detection**
- **Sentiment analysis**
- **Document/topic classification**
- **Language detection**
- **Medical diagnosis baselines** (quick baseline)
- **Real-time classification** where speed matters

It is often used as a strong baseline because it trains and predicts very fast.


## Q70) Explain the difference between generative and discriminative models.

**Generative models** learn how data is generated by modeling \(P(x,y)\) or \(P(x|y)\) and \(P(y)\).
- Example: Naïve Bayes (models \(P(x|y)\) and \(P(y)\)).
- Can generate samples of \(x\) given \(y\) (in principle).

**Discriminative models** directly model \(P(y|x)\) or learn the decision boundary.
- Examples: logistic regression, SVM, neural networks.
- Often achieve higher predictive accuracy when enough labeled data exists.


## Q71) How does the decision boundary of a Naïve Bayes classifier look like for binary classification tasks?

The Naïve Bayes decision boundary depends on the NB variant:

- For **Multinomial/Bernoulli NB**, taking logs yields a linear function of features in many common representations (e.g., bag-of-words), so the boundary is often **approximately linear** in that feature space.
- For **Gaussian NB**, if each class has **equal variances** per feature, the boundary becomes **linear** (similar to LDA). If variances differ, the boundary can be **quadratic** (curved).

So: linear in many cases, quadratic in general for GNB with unequal variances.


## Q72) What is the difference between multinomial Naïve Bayes and Gaussian Naïve Bayes?

**Multinomial Naïve Bayes**
- Designed for **count data**: word counts, term frequencies.
- Likelihood is multinomial.
- Common in text classification (spam, sentiment).

**Gaussian Naïve Bayes**
- Designed for **continuous** features.
- Assumes normal distribution per feature per class.
- Used for numeric datasets where Gaussian assumption is acceptable.

Choose based on data type and representation.


## Q73) How does Naïve Bayes handle numerical instability issues?

Naïve Bayes can face numerical underflow because it multiplies many small probabilities:
\[
P(y|x) \propto P(y)\prod_j P(x_j|y)
\]
To fix this, implementations use **log-space**:
\[
\log P(y|x) = \log P(y) + \sum_j \log P(x_j|y) + C
\]
Additionally, smoothing ensures probabilities are never exactly zero, avoiding \(\log(0)\).


## Q74) What is the Laplacian correction, and when is it used in Naïve Bayes?

**Laplacian correction** is another name for **Laplace (add-one) smoothing**.
It is used when estimating probabilities from counts for categorical/text features to:
- avoid zero probabilities for unseen events
- improve generalization on sparse datasets

General add-α version:
\[
P = \frac{count+\alpha}{N+\alpha V}
\]


## Q75) Can Naïve Bayes be used for regression tasks?

In general coursework and common ML practice, Naïve Bayes is not used for regression.
Regression requires predicting continuous values, and NB is built around computing class posteriors.

For regression, typical models include:
- linear regression / Ridge / Lasso
- SVR
- tree ensembles (Random Forest, Gradient Boosting)
- neural networks


## Q76) Explain the concept of conditional independence assumption in Naïve Bayes.

**Conditional independence** in Naïve Bayes means:
Given the class \(y\), the features do not depend on each other:
\[
P(x_1,\dots,x_p|y) = \prod_{j=1}^p P(x_j|y)
\]
This assumption dramatically simplifies learning:
- Instead of estimating a full joint distribution over features, NB estimates each feature distribution separately per class.
Even when not perfectly true, NB often performs well, especially with many weakly informative features.


## Q77) How does Naïve Bayes handle categorical features with a large number of categories?

For categorical features with many categories:
- probability estimates become unreliable for rare categories
- one-hot can create very sparse/high-dimensional features

Common handling:
- **Laplace/add-α smoothing**
- **group rare categories** into “Other”
- **feature hashing** (especially in text pipelines)
- use text-style representations (counts/TF-IDF) if categories behave like tokens

Goal: stabilize probabilities and reduce dimensionality.


## Q78) What are some drawbacks of the Naïve Bayes algorithm?

Drawbacks of Naïve Bayes:
- independence assumption often violated → can limit accuracy
- probability estimates can be poorly calibrated (overconfident)
- Gaussian NB can fail when features are not close to normal per class
- cannot capture complex feature interactions without feature engineering
- sensitive to representation (e.g., for text: stopwords, tokenization choices matter)

Still, NB is a strong, fast baseline in many classification tasks.


## Q79) Explain the concept of smoothing in Naïve Bayes.

**Smoothing** in Naïve Bayes adjusts probability estimates to avoid zeros and reduce overfitting to sparse counts.

Why needed:
- In sparse data, some events may not appear in training but can appear in test.
- Without smoothing, unseen events have probability 0 → kills the entire posterior.

Common smoothing:
- Laplace (add-one): \(\alpha=1\)
- Add-α smoothing: \(\alpha\) tuned (e.g., 0.1, 0.5, 1.0) for best performance.


## Q80) How does Naïve Bayes handle imbalanced datasets?

Naïve Bayes can be affected by imbalanced classes because the **prior** \(P(y)\) may dominate decisions.

How NB deals with imbalance:
- **Class priors**: it naturally uses \(P(y)\), reflecting imbalance.
- **Threshold tuning**: instead of 0.5, choose a threshold that improves recall/precision.
- **Evaluation**: use F1, recall, PR-AUC (not only accuracy).

Additional strategies (outside NB itself):
- resampling training data (oversample/undersample, SMOTE)
- cost-sensitive learning in other models


---
# Optional Demo Code (supports multiple questions)
These cells are optional and demonstrate regression metrics, logistic regression metrics,
decision trees, SVM, Naïve Bayes smoothing, and cross-validation/grid search.
---


In [2]:
# If scikit-learn is available, these demos will run.
# If not, you can install it locally with: pip install scikit-learn


## Demo A) Regression metrics (MAE, RMSE, R²)

In [3]:
try:
    from sklearn.datasets import make_regression
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

    X, y = make_regression(n_samples=600, n_features=6, noise=15.0, random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

    lr = LinearRegression().fit(X_train, y_train)
    pred = lr.predict(X_test)

    print("MAE :", mean_absolute_error(y_test, pred))
    print("RMSE:", mean_squared_error(y_test, pred, squared=False))
    print("R^2 :", r2_score(y_test, pred))
except Exception as e:
    print("Demo skipped:", e)


MAE : 10.70647762693496
RMSE: 13.746574623072956
R^2 : 0.987065779228942




## Demo B) Logistic regression evaluation

In [4]:
try:
    from sklearn.datasets import make_classification
    from sklearn.model_selection import train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.pipeline import Pipeline
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score

    Xc, yc = make_classification(n_samples=2000, n_features=10, n_informative=6,
                                 weights=[0.85, 0.15], random_state=42)

    X_train, X_test, y_train, y_test = train_test_split(Xc, yc, test_size=0.25, stratify=yc, random_state=42)

    clf = Pipeline([("scaler", StandardScaler()),
                    ("lr", LogisticRegression(max_iter=2000))])
    clf.fit(X_train, y_train)

    proba = clf.predict_proba(X_test)[:, 1]
    pred = (proba >= 0.5).astype(int)

    print("Confusion matrix:\n", confusion_matrix(y_test, pred))
    print("\nClassification report:\n", classification_report(y_test, pred))
    print("ROC-AUC:", roc_auc_score(y_test, proba))
except Exception as e:
    print("Demo skipped:", e)


Confusion matrix:
 [[412  12]
 [ 26  50]]

Classification report:
               precision    recall  f1-score   support

           0       0.94      0.97      0.96       424
           1       0.81      0.66      0.72        76

    accuracy                           0.92       500
   macro avg       0.87      0.81      0.84       500
weighted avg       0.92      0.92      0.92       500

ROC-AUC: 0.920990566037736


## Demo C) Naïve Bayes smoothing effect (alpha)

In [5]:
try:
    from sklearn.model_selection import train_test_split
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.metrics import f1_score
    from collections import Counter

    rng = np.random.default_rng(42)
    X_nb = rng.poisson(lam=2.0, size=(1500, 25))  # count-like features
    y_nb = rng.choice([0,1], size=1500, p=[0.8, 0.2])

    X_train, X_test, y_train, y_test = train_test_split(X_nb, y_nb, test_size=0.25, stratify=y_nb, random_state=42)
    print("Class balance (train):", Counter(y_train))

    nb0 = MultinomialNB(alpha=0.0).fit(X_train, y_train)
    nb1 = MultinomialNB(alpha=1.0).fit(X_train, y_train)

    pred0 = nb0.predict(X_test)
    pred1 = nb1.predict(X_test)

    print("F1 (alpha=0):", f1_score(y_test, pred0))
    print("F1 (alpha=1):", f1_score(y_test, pred1))
except Exception as e:
    print("Demo skipped:", e)


Class balance (train): Counter({0: 889, 1: 236})
F1 (alpha=0): 0.0
F1 (alpha=1): 0.0


## Demo D) Cross-validation + Grid Search (SVM)

In [6]:
try:
    from sklearn.svm import SVC
    from sklearn.model_selection import GridSearchCV, train_test_split
    from sklearn.preprocessing import StandardScaler
    from sklearn.pipeline import Pipeline
    from sklearn.datasets import make_classification
    from sklearn.metrics import accuracy_score

    Xc, yc = make_classification(n_samples=1500, n_features=12, n_informative=7, weights=[0.8,0.2], random_state=42)
    X_train, X_test, y_train, y_test = train_test_split(Xc, yc, test_size=0.25, stratify=yc, random_state=42)

    pipe = Pipeline([("scaler", StandardScaler()),
                     ("svm", SVC(kernel="rbf"))])

    grid = GridSearchCV(pipe, {
        "svm__C": [0.5, 1, 5, 10],
        "svm__gamma": ["scale", 0.1, 0.01]
    }, cv=5, scoring="f1", n_jobs=-1)

    grid.fit(X_train, y_train)
    print("Best params:", grid.best_params_)
    print("Best CV F1:", grid.best_score_)
    print("Test accuracy:", accuracy_score(y_test, grid.best_estimator_.predict(X_test)))
except Exception as e:
    print("Demo skipped:", e)


Best params: {'svm__C': 5, 'svm__gamma': 'scale'}
Best CV F1: 0.8662329437781132
Test accuracy: 0.9386666666666666
