# Theoretical

##
### 1. What is a Support Vector Machine (SVM) ?

A **Support Vector Machine (SVM)** is a **supervised machine learning algorithm** used for **classification and regression tasks**. It works by finding the **optimal hyperplane** that best separates data points of different classes in an N-dimensional space.

* The **support vectors** are the data points closest to the decision boundary ‚Äî they define the position of the hyperplane.
* SVM aims to **maximize the margin**, i.e., the distance between the hyperplane and the nearest data points from each class.
* It can also handle **non-linear data** using **kernel functions** (e.g., polynomial, RBF).


##
### 2. What is the difference between Hard Margin and Soft Margin SVM ?

**Hard Margin SVM:**

* Assumes that data is **perfectly linearly separable**.
* Finds a hyperplane that separates all points **without any misclassification**.
* Very **sensitive to noise and outliers**, as even one incorrect point can make separation impossible.

**Soft Margin SVM:**

* Allows some **misclassifications** by introducing a **slack variable (Œæ)**.
* Balances between maximizing the margin and minimizing classification errors.
* More **robust** and works well with **real-world, noisy data**.


##
### 3. What is the mathematical intuition behind SVM ?

The **mathematical intuition** behind SVM is to find a **hyperplane** that best separates two classes by **maximizing the margin** ‚Äî the distance between the hyperplane and the nearest data points (support vectors).

For a dataset with features ( X ) and labels ( y ):

$$[
y_i (w \cdot x_i + b) \ge 1
]$$

SVM minimizes the cost function:

$$[
\min_{w,b} \ \frac{1}{2} |w|^2
]$$

subject to the above constraint.

* ( w ) ‚Üí weight vector defining the hyperplane
* ( b ) ‚Üí bias term
* Maximizing margin ( = \frac{2}{|w|} )


üëâ In soft margin SVM, a penalty term ( C \sum \xi_i ) is added to allow some misclassifications while maintaining the widest possible margin.
.

##
### 4. What is the role of Lagrange Multipliers in SVM ?

**Lagrange Multipliers** are used in SVM to transform the **constrained optimization problem** (finding the optimal hyperplane with maximum margin) into an **unconstrained one**, making it easier to solve mathematically.

They help incorporate the constraints:

$$
y_i (w \cdot x_i + b) \ge 1
$$

into the objective function through a **Lagrangian formulation**:

$$
L(w, b, \alpha) = \frac{1}{2}|w|^2 - \sum_i \alpha_i [y_i(w \cdot x_i + b) - 1]
$$

where ( \alpha_i ) are the Lagrange multipliers.

* Non-zero ( \alpha_i ) correspond to **support vectors**.
* This formulation leads to the **dual problem**, which allows SVM to efficiently handle **non-linear data** using **kernel tricks**.


##
### 5.  What are Support Vectors in SVM ?

**Support Vectors** are the **data points that lie closest to the decision boundary (hyperplane)** in an SVM model.

* They are the **critical elements** of the training set that directly influence the position and orientation of the hyperplane.
* Only these points are used to define the margin ‚Äî removing them would **change the decision boundary**, whereas removing others would not.
* Support vectors are identified by **non-zero Lagrange multipliers (Œ±·µ¢)** in the optimization process.

üëâ They ensure the SVM achieves **maximum margin separation** between classes.


##
### 6. What is a Support Vector Classifier (SVC) ?

A **Support Vector Classifier (SVC)** is the **classification implementation of SVM** used to separate data into distinct categories. It finds the **optimal hyperplane** that maximizes the margin between different classes.

* It supports both **linear and non-linear classification** using **kernel functions** (e.g., linear, polynomial, RBF).
* The **regularization parameter (C)** controls the trade-off between maximizing margin and minimizing misclassification.
* SVC is robust to high-dimensional data and effective when the **number of features exceeds the number of samples**.


##
### 7. What is a Support Vector Regressor (SVR) ?

A **Support Vector Regressor (SVR)** is the **regression version of SVM**, designed to predict continuous values rather than class labels.

* It tries to fit a function within a **margin of tolerance (Œµ)** around the actual data points.
* Errors within this margin are **ignored**, while points outside are **penalized** using a regularization parameter ( C ).
* SVR aims to find a **flat and simple function** that approximates data well, making it robust to **outliers** and effective for **non-linear regression** using kernel functions.


##
### 8. What is the Kernel Trick in SVM ?

The **Kernel Trick** allows SVM to handle **non-linearly separable data** by implicitly mapping input features into a **higher-dimensional space** where a linear separator can be found ‚Äî **without explicitly computing the transformation**.

Common kernel functions:

* **Linear Kernel:** ( K(x, x') = x \cdot x' )
* **Polynomial Kernel:** ( K(x, x') = (x \cdot x' + 1)^d )
* **RBF (Gaussian) Kernel:** ( K(x, x') = e^{-\gamma |x - x'|^2} )

üëâ This trick enables SVM to model **complex, non-linear relationships** efficiently while keeping computation feasible.


##
### 9.  Compare Linear Kernel, Polynomial Kernel, and RBF Kernel.

| **Kernel Type**           | **Mathematical Form**                 | **Use Case**                               | **Advantages**                              | **Limitations**                                 |
| ------------------------- | ------------------------------------- | ------------------------------------------ | ------------------------------------------- | ----------------------------------------------- |
| **Linear Kernel**         | ( K(x, x') = x \cdot x' )             | When data is **linearly separable**        | Simple, fast, easy to interpret             | Fails with non-linear data                      |
| **Polynomial Kernel**     | ( K(x, x') = (x \cdot x' + 1)^d )     | When data has **moderate non-linearity**   | Captures interactions and curved boundaries | Sensitive to degree (d), can overfit            |
| **RBF (Gaussian) Kernel** | ( K(x, x') = e^{-\gamma |x - x'|^2} ) | For **highly non-linear** and complex data | Powerful and flexible for most cases        | Requires careful tuning of ( \gamma ) and ( C ) |

üëâ In practice, **RBF Kernel** is the most widely used due to its ability to model complex, non-linear relationships effectively.


##
### 10. What is the effect of the C parameter in SVM ?

The **C parameter** in SVM is a **regularization constant** that controls the trade-off between **maximizing the margin** and **minimizing classification errors**.

* **High C value:**

  * The model prioritizes **correctly classifying all training points**, allowing **smaller margins**.
  * Can lead to **overfitting** (less generalization).

* **Low C value:**

  * Allows **more misclassifications** but aims for a **wider margin**.
  * Produces a **simpler, more generalizable model**.

üëâ In short, **C regulates model flexibility** ‚Äî large C fits data tightly, small C keeps it smoother and more robust to noise.


##
### 11. What is the role of the Gamma parameter in RBF Kernel SVM ?

The **Gamma (Œ≥)** parameter in an **RBF Kernel SVM** defines how far the influence of a single training point reaches ‚Äî it controls the **curvature** of the decision boundary.

* **High Œ≥ (large value):**

  * Each point has **short-range influence**, leading to **tight, complex boundaries**.
  * Model fits training data closely ‚Üí risk of **overfitting**.

* **Low Œ≥ (small value):**

  * Points have **wider influence**, creating **smoother, simpler boundaries**.
  * May lead to **underfitting** if too small.

üëâ Thus, **Gamma controls the decision region smoothness** ‚Äî higher values capture detail, while lower values generalize better.


##
### 12. What is the Na√Øve Bayes classifier, and why is it called "Na√Øve" ?

**Na√Øve Bayes** is a **probabilistic classification algorithm** based on **Bayes‚Äô Theorem**, which predicts the probability of a class given the input features. It assumes that all features are **independent of each other**, given the class label.

$$
P(C|X) = \frac{P(X|C) , P(C)}{P(X)}
$$

It‚Äôs called **‚ÄúNa√Øve‚Äù** because of this **strong independence assumption** ‚Äî in reality, features are often correlated, but the model still performs remarkably well in many tasks like **spam filtering, sentiment analysis, and text classification**.


##
### 13. What is Bayes‚Äô Theorem ?

**Bayes‚Äô Theorem** is a fundamental concept in probability theory that describes how to **update the probability of a hypothesis** based on new evidence.

$$
P(A|B) = \frac{P(B|A) , P(A)}{P(B)}
$$

Where:

* ( P(A|B) ): Posterior probability ‚Äî probability of event A given B.
* ( P(B|A) ): Likelihood ‚Äî probability of observing B given A is true.
* ( P(A) ): Prior probability of A.
* ( P(B) ): Evidence or normalization factor.

üëâ It allows combining **prior knowledge** with **new data**, forming the foundation of **Bayesian inference** and models like **Na√Øve Bayes**.


##
### 14. Explain the differences between Gaussian Na√Øve Bayes, Multinomial Na√Øve Bayes, and Bernoulli Na√Øve Bayes.

| **Type**                    | **Used For**    | **Feature Type**                                     | **Probability Model**         | **Example Use Case**                                                      |                                   |
| --------------------------- | --------------- | ---------------------------------------------------- | ----------------------------- | ------------------------------------------------------------------------- | --------------------------------- |
| **Gaussian Na√Øve Bayes**    | Continuous data | Features follow a **normal (Gaussian)** distribution | ( P(x_i                       | y) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x_i - \mu)^2}{2\sigma^2}} ) | Iris classification, medical data |
| **Multinomial Na√Øve Bayes** | Count data      | Features represent **frequency counts**              | Uses multinomial distribution | Text classification, spam detection                                       |                                   |
| **Bernoulli Na√Øve Bayes**   | Binary data     | Features are **0/1** (present or absent)             | Uses Bernoulli distribution   | Sentiment analysis, document classification                               |                                   |

üëâ In short:

* **Gaussian** ‚Üí continuous numeric features,
* **Multinomial** ‚Üí word count or frequency data,
* **Bernoulli** ‚Üí binary or Boolean features.


##
### 15. When should you use Gaussian Na√Øve Bayes over other variants ?

You should use **Gaussian Na√Øve Bayes** when the **features are continuous** and approximately follow a **normal (Gaussian) distribution**.

‚úÖ **Best suited for:**

* Numerical datasets (e.g., height, weight, age, temperature).
* Problems where feature values are **real numbers** rather than counts or binary indicators.
* Tasks like **medical diagnosis, sensor data analysis, or image recognition**, where continuous inputs are common.

üëâ Use **Multinomial NB** for text/count data and **Bernoulli NB** for binary features instead.


##
### 16. What are the key assumptions made by Na√Øve Bayes ?

The **key assumptions** of the Na√Øve Bayes algorithm are:

1. **Feature Independence:**
   All features are assumed to be **independent** of each other given the class label.

2. **Equal Importance of Features:**
   Each feature contributes **equally and independently** to the outcome.

3. **Conditional Probability Validity:**
   The probability of each feature given the class (( P(X_i|Y) )) is estimated correctly from the data.

4. **Distributional Assumption (Variant-specific):**

   * **Gaussian NB:** Features follow a **normal distribution**.
   * **Multinomial NB:** Features represent **counts/frequencies**.
   * **Bernoulli NB:** Features are **binary (0/1)**.

üëâ These assumptions simplify computation and make Na√Øve Bayes **fast, efficient, and effective**, even when the independence assumption is not fully true.


##
### 17. What are the advantages and disadvantages of Na√Øve Bayes ?

**Advantages:**

* ‚úÖ **Fast and efficient** ‚Äî simple to implement and computationally inexpensive.
* ‚úÖ **Performs well on small datasets** and in **high-dimensional spaces** (e.g., text data).
* ‚úÖ Works surprisingly well even when the **independence assumption** is partially violated.
* ‚úÖ Requires **less training data** and handles **multiclass classification** effectively.

**Disadvantages:**

* ‚ùå Assumes **feature independence**, which rarely holds true in real data.
* ‚ùå Performs poorly when features are **highly correlated**.
* ‚ùå Struggles with **zero probabilities** (if a feature value never appears in training data).
* ‚ùå Continuous data must fit the **assumed distribution** (e.g., Gaussian).


##
### 18. Why is Na√Øve Bayes a good choice for text classification ?

**Na√Øve Bayes** is an excellent choice for **text classification** because:

* ‚úÖ **Feature independence** assumption fits naturally since words (tokens) are often treated as independent.
* ‚úÖ Handles **high-dimensional sparse data** efficiently ‚Äî common in text (e.g., thousands of words per document).
* ‚úÖ **Fast training and prediction**, even on large corpora.
* ‚úÖ Works well with **word count or frequency data**, especially with **Multinomial Na√Øve Bayes**.
* ‚úÖ Performs robustly in **spam detection, sentiment analysis, and topic classification** with minimal preprocessing.


##
### 19. Compare SVM and Na√Øve Bayes for classification tasks.

| **Aspect**               | **Support Vector Machine (SVM)**                   | **Na√Øve Bayes (NB)**                              |
| ------------------------ | -------------------------------------------------- | ------------------------------------------------- |
| **Model Type**           | Discriminative (finds boundary between classes)    | Generative (models class probabilities)           |
| **Data Assumption**      | No specific distribution assumption                | Assumes feature independence                      |
| **Computation**          | Computationally heavier, slower on large data      | Very fast and efficient                           |
| **Performance**          | High accuracy on complex and high-dimensional data | Performs well on simple or text-based data        |
| **Interpretability**     | Harder to interpret                                | Easy to interpret with clear probabilistic output |
| **Best For**             | Non-linear, complex decision boundaries            | Text, spam, or sentiment classification           |
| **Sensitivity to Noise** | Robust with proper tuning (C, Œ≥)                   | Sensitive to correlated or noisy features         |

üëâ In summary, **SVM** is more powerful for **complex boundaries**, while **Na√Øve Bayes** is preferred for **speed, simplicity, and text data**.


##
### 20. How does Laplace Smoothing help in Na√Øve Bayes ?

**Laplace Smoothing** (also called **Additive Smoothing**) prevents zero probabilities in Na√Øve Bayes when a word or feature doesn‚Äôt appear in the training data for a given class.

It modifies the probability formula as:

$$
P(x_i|y) = \frac{count(x_i, y) + 1}{count(y) + n}
$$

where:

* ( count(x_i, y) ) = frequency of feature ( x_i ) in class ( y )
* ( n ) = total number of unique features

‚úÖ **Benefits:**

* Avoids **zero probability errors** that would make the entire class probability zero.
* Ensures **better generalization** for unseen data during prediction.

üëâ Especially useful in **text classification**, where new or rare words often appear in test data.
