1.  What is Information Gain, and how is it used in Decision Trees?

**ANSWER:** Information Gain tells us how much “information” a feature gives us about the target variable.

It is based on Entropy, which measures impurity (randomness).

- Entropy formula:

$$Entropy(S) = - \sum p_i \log_2(p_i)$$

Where $p_i$ proportion of class $i$ n dataset $S$.

- Information Gain formula:

$$Entropy(S) = - \sum_{i} p_i \log_2(p_i)$$

Where:

- $S$= parent dataset

- $A$= feature

- $S_v$= subsets formed by splitting on feature $A$

2. What is the difference between Gini Impurity and Entropy?

**ANSWER:** Here’s a simple and clear explanation of the difference between Gini Impurity and Entropy, both used in Decision Trees to measure how mixed a node is:

- Gini Impurity vs Entropy:

1. Definition:

Gini Impurity

- Measures the probability of incorrectly classifying a randomly chosen element.

- Formula:

$$Gini = 1 - \sum p_i^2$$

- Where $p_i$ = probability of class i.

Entropy

- Measures the amount of randomness or disorder in the dataset.

- Formula:

$$Entropy = - \sum p_i \log_2(p_i)$$

2. Range:

| Metric      | Minimum | Maximum             | Meaning of Maximum       |
| ----------- | ------- | ------------------- | ------------------------ |
| **Gini**    | 0       | 0.5 (for 2 classes) | Classes are evenly split |
| **Entropy** | 0       | 1 (for 2 classes)   | Classes are evenly split |

3. Interpretation:

Gini Impurity

i. Faster to compute.

ii. More sensitive to node purity.

iii. Often prefers larger partitions.

Entropy

i. More computationally expensive due to logarithm.

ii. Values increase more smoothly.

iii. Used in ID3, C4.5 algorithms.

3. What is Pre-Pruning in Decision Trees?

**ANSWER:** Pre-pruning (also called early stopping) is a technique in Decision Trees where you stop the tree from growing further during the training process to prevent it from becoming too large and overfitting the data.

Pre-pruning is the process of limiting the growth of a decision tree by applying stopping conditions before a split is made.

It avoids unnecessary splits that do not significantly improve model performance.

A fully grown tree:

- becomes too complex

- learns noise from training data

- causes overfitting

- reduces generalization on test data

Pre-pruning keeps the tree simple and effective.

Common Pre-Pruning Techniques:

You can stop the tree from splitting based on:

1. Maximum Depth -

Stop splitting once the tree reaches a certain depth.

2. Minimum Samples Split -

Do not split a node if it contains fewer samples than required.

3. Minimum Samples per Leaf -

Each leaf must contain at least a set number of samples.

4. Minimum Impurity Decrease -

Only split if impurity decreases by a sufficient amount.

4. :Write a Python program to train a Decision Tree Classifier using Gini
Impurity as the criterion and print the feature importances (practical).

**ANSWER:**

In [3]:
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load dataset
data = load_iris()
X = data.data
y = data.target
feature_names = data.feature_names

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Create Decision Tree Classifier using Gini impurity
clf = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the model
clf.fit(X_train, y_train)

# Print accuracy
accuracy = clf.score(X_test, y_test)
print("Model Accuracy:", accuracy)

# Print feature importances
print("\nFeature Importances:")
for name, importance in zip(feature_names, clf.feature_importances_):
    print(f"{name}: {importance:.4f}")


Model Accuracy: 1.0

Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


5. What is a Support Vector Machine (SVM)?

**ANSWER:** A Support Vector Machine finds the best possible boundary (called a hyperplane) that separates data into different classes with the maximum margin.

1. Hyperplane:

A line (in 2D) or a plane (in higher dimensions) that separates classes.

2. Margin:

The distance between the hyperplane and the nearest data points from each class.
SVM chooses the hyperplane with the maximum margin → better accuracy and generalization.

3. Support Vectors:

The important data points that lie closest to the hyperplane.
They “support” or define the decision boundary.

4. Kernel Trick:

Allows SVM to classify non-linear data by mapping it into higher dimensions.
Common kernels:

- Linear

- Polynomial

- RBF (Radial Basis Function)

- Sigmoid

6. What is the Kernel Trick in SVM?

**ANSWER:** The Kernel Trick in Support Vector Machines (SVM) is a mathematical technique that allows SVMs to separate data that is not linearly separable in its original space by implicitly mapping it to a higher-dimensional feature space — without actually performing the transformation.

Instead of transforming data to a higher dimension explicitly (which is expensive or impossible), the kernel trick uses a kernel function to compute the dot product in the higher-dimensional space directly.

This lets SVM draw non-linear decision boundaries efficiently.

- Kernel Trick Needed:

Some datasets cannot be separated by a straight line (linear boundary).
Example: XOR pattern, spirals, curved shapes, etc.

Mapping to a higher dimension can make data linearly separable.

But explicitly computing this mapping is:

I. costly

II. sometimes infinite-dimensional

III. mathematically complex

A kernel function $K(x,x')$ satisfies:

$$K(x, x') = \phi(x) \cdot \phi(x')$$

Where:

- $\phi(x)$ =implicit transformation to high dimension

- You never compute $\phi(x)$

- You only compute $K(x,x')$

7. Write a Python program to train two SVM classifiers with Linear and RBF
kernels on the Wine dataset, then compare their accuracies.

**ANSWER:**

In [4]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = datasets.load_wine()
X = wine.data
y = wine.target

# Standardize features (important for SVM)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.3, random_state=42
)

# Train SVM with Linear kernel
svm_linear = SVC(kernel='linear')
svm_linear.fit(X_train, y_train)
linear_pred = svm_linear.predict(X_test)
linear_acc = accuracy_score(y_test, linear_pred)

# Train SVM with RBF kernel
svm_rbf = SVC(kernel='rbf')
svm_rbf.fit(X_train, y_train)
rbf_pred = svm_rbf.predict(X_test)
rbf_acc = accuracy_score(y_test, rbf_pred)

# Print accuracy comparison
print("Accuracy of SVM with Linear Kernel:", linear_acc)
print("Accuracy of SVM with RBF Kernel:", rbf_acc)

if rbf_acc > linear_acc:
    print("\nRBF kernel performed better.")
elif rbf_acc < linear_acc:
    print("\nLinear kernel performed better.")
else:
    print("\nBoth kernels performed equally well.")


Accuracy of SVM with Linear Kernel: 0.9814814814814815
Accuracy of SVM with RBF Kernel: 0.9814814814814815

Both kernels performed equally well.


8. What is the Naïve Bayes classifier, and why is it called "Naïve"?

**ANSWER:** The Naïve Bayes classifier is a simple yet powerful probabilistic machine learning algorithm based on Bayes’ Theorem. It is often used for classification tasks like spam detection, sentiment analysis, document classification, email filtering, and more.

The Naïve Bayes classifier predicts the class of a data point using:

$$P(\text{Class} \mid \text{Features}) =
\frac{P(\text{Features} \mid \text{Class}) \cdot P(\text{Class})}
{P(\text{Features})}$$

It calculates the probability of each class given the input features and chooses the class with the highest probability.

In simple words:

- It learns how likely each class is (prior probability)

- It learns how each feature behaves within each class (likelihood)

- It combines these probabilities using Bayes’ theorem and predicts the most probable class

It is called Naïve because it makes a very strong and unrealistic assumption:

- All features are independent of each other given the class.

This means Naïve Bayes assumes:

i. Every feature contributes independently to the outcome

ii. There is no correlation between features

For example, when classifying emails:

i. The words “free,” “money,” and “win” are clearly related

ii. But Naïve Bayes treats them as independent features

This assumption is almost never true in real-world data, hence the name:

“Naïve” = Overly simple assumption of feature independence:

Yet, despite this unrealistic assumption:

- Naïve Bayes works very well in practice

- It is fast, efficient, and surprisingly accurate for many applications

- It performs especially well for text classification and high-dimensional data.

9. Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve
Bayes, and Bernoulli Naïve Bayes

**ANSWER:**

| **Aspect**               | **Gaussian Naïve Bayes**                         | **Multinomial Naïve Bayes**             | **Bernoulli Naïve Bayes**                   |
| ------------------------ | ------------------------------------------------ | --------------------------------------- | ------------------------------------------- |
| **Type of Data Used**    | Continuous numerical data                        | Discrete counts                         | Binary data (0/1)                           |
| **Feature Examples**     | Height, weight, temperature                      | Word counts, term frequencies           | Word present/not present                    |
| **Assumed Distribution** | Gaussian (Normal distribution)                   | Multinomial distribution                | Bernoulli (binary) distribution             |
| **Input Value Type**     | Real numbers (can be negative)                   | Non-negative integers                   | 0 or 1                                      |
| **Common Use Cases**     | Sensor data, continuous measurements, image data | Text classification using word counts   | Text classification using binary features   |
| **NLP Usage**            | Less common                                      | Very common                             | Common for binary text models               |
| **When to Use**          | When features are continuous                     | When features represent counts          | When features represent presence/absence    |
| **Output**               | Probability based on mean and variance           | Probability based on counts of features | Probability based on binary feature matches |


10. Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.

**ANSWER:**

In [6]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split into train & test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Create and train Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict
y_pred = gnb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)

print("Gaussian Naive Bayes Accuracy:", accuracy)


Gaussian Naive Bayes Accuracy: 0.9736842105263158
