### Question 1 : **What is Information Gain, and how is it used in Decision Trees?**


Information Gain measures the reduction in entropy (uncertainty) about the target label after splitting a dataset on a feature.  
Formally, Information Gain for split S on feature A is:  
**IG(S, A) = Entropy(S) - Σ (|S_v|/|S|) * Entropy(S_v)** where S_v are subsets after splitting by A.  
Decision trees (like ID3/C4.5) use Information Gain to choose the best feature at each node — the feature yielding the largest IG is chosen because it reduces class impurity the most, producing purer child nodes and a more informative split.


### Question 2: **What is the difference between Gini Impurity and Entropy?**


- **Entropy (Information Gain)**:  
  - Formula: \(H(S) = -\sum_{i} p_i \log_2 p_i\).  
  - Measures the amount of information (uncertainty). Sensitive to changes in class probability distribution; used by ID3/C4.5.  
- **Gini Impurity**:  
  - Formula: \(G(S) = 1 - \sum_{i} p_i^2\).  
  - Measures probability of misclassification when randomly labeling according to class distribution. Used by CART.  
**Practical differences:**  
- Both rank splits similarly in many cases; Gini is slightly faster to compute (no logarithm) and often preferred in CART implementations.  
- Entropy is theoretically grounded in information theory and can be slightly more sensitive to class distribution changes.  
- Choice often doesn't change final tree drastically; prefer whichever is implemented or benchmarked for your dataset.


### Question 3: **What is Pre-Pruning in Decision Trees?**


Pre-pruning (early stopping) halts tree growth early to prevent overfitting. Instead of fully growing the tree and pruning later, pre-pruning sets stopping rules while building the tree: e.g., minimum samples required to split, maximum depth, minimum impurity decrease, or minimum samples per leaf. These constraints prevent overly specific branches, improving generalization and reducing tree complexity, at the expense of possibly underfitting if too strict.


### Question 4: **Write a Python program to train a Decision Tree Classifier using Gini Impurity as the criterion and print the feature importances**

In [1]:
# Q4: Decision Tree Classifier using Gini and printing feature importances
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

X, y = load_iris(return_X_y=True)
feature_names = load_iris().feature_names
dfX = pd.DataFrame(X, columns=feature_names)
X_train, X_test, y_train, y_test = train_test_split(dfX, y, test_size=0.25, random_state=42)

clf = DecisionTreeClassifier(criterion='gini', random_state=42)
clf.fit(X_train, y_train)

importances = clf.feature_importances_
fi = pd.Series(importances, index=feature_names).sort_values(ascending=False)
print("Feature importances (Decision Tree, criterion='gini'):")
print(fi)
print('\nTrain accuracy:', clf.score(X_train, y_train))
print('Test accuracy:', clf.score(X_test, y_test))


Feature importances (Decision Tree, criterion='gini'):
petal length (cm)    0.899746
petal width (cm)     0.082378
sepal width (cm)     0.017876
sepal length (cm)    0.000000
dtype: float64

Train accuracy: 1.0
Test accuracy: 1.0


### Question 5: **What is a Support Vector Machine (SVM)?**


A Support Vector Machine (SVM) is a supervised learning algorithm for classification (and regression) that finds the hyperplane that best separates classes by maximizing the margin — the distance between the hyperplane and the nearest points of each class (support vectors). SVMs can work in high-dimensional spaces and are effective when classes are separable with a clear margin. For non-linearly separable data, SVMs use kernel functions to project data into higher-dimensional spaces.


### Question 6:**What is the Kernel Trick in SVM?**


The Kernel Trick allows SVMs to compute dot products in a high-dimensional (possibly infinite-dimensional) feature space without explicitly mapping data to that space. Instead, a kernel function K(x, x') computes \(\phi(x)\cdot\phi(x')\) directly. Common kernels: linear, polynomial, RBF (Gaussian), sigmoid. This enables SVMs to learn non-linear decision boundaries efficiently.


### Question 7: **Write a Python program to train two SVM classifiers with Linear and RBF kernels on the Wine dataset, then compare their accuracies.**

In [2]:
# Q7: Compare SVM with linear and RBF kernels on Wine dataset
from sklearn.datasets import load_wine
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
import pandas as pd

data = load_wine()
X = data.data
y = data.target
feature_names = data.feature_names

# Standardize features for SVM
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42, stratify=y)

svm_linear = SVC(kernel='linear', random_state=42)
svm_rbf = SVC(kernel='rbf', random_state=42)

svm_linear.fit(X_train, y_train)
svm_rbf.fit(X_train, y_train)

pred_lin = svm_linear.predict(X_test)
pred_rbf = svm_rbf.predict(X_test)

acc_lin = accuracy_score(y_test, pred_lin)
acc_rbf = accuracy_score(y_test, pred_rbf)

print(f"Linear SVM accuracy: {acc_lin:.4f}")
print(f"RBF SVM accuracy:    {acc_rbf:.4f}")


Linear SVM accuracy: 0.9556
RBF SVM accuracy:    0.9778


### Question 8: **What is the Naïve Bayes classifier, and why is it called "Naïve"?**


Naïve Bayes is a family of probabilistic classifiers based on Bayes' theorem:  
\(P(y|x) \propto P(y) \prod_i P(x_i | y)\).  
It's called *naïve* because it assumes conditional independence between features given the class label (i.e., features are independent of each other within each class). Despite this strong (and often unrealistic) assumption, Naïve Bayes works well in many practical tasks like text classification.


### Question 9: **Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve Bayes, and Bernoulli Naïve Bayes**


- **Gaussian Naïve Bayes (GaussianNB):** Assumes continuous features follow a Gaussian (normal) distribution. Used for continuous-valued inputs (e.g., sensor data).  
- **Multinomial Naïve Bayes:** Designed for count data (e.g., word counts in documents). Models P(x_i|y) with a multinomial distribution; works well with TF or raw term counts.  
- **Bernoulli Naïve Bayes:** Designed for binary features (e.g., word presence/absence). Models each feature as a Bernoulli (0/1) variable.  
**Which to choose:** depends on data type: continuous -> Gaussian, counts -> Multinomial, binary indicators -> Bernoulli.


### Question 10: **Breast Cancer Dataset - Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer dataset and evaluate accuracy.**

In [3]:
# Q10: Train Gaussian Naive Bayes on Breast Cancer dataset and evaluate
from sklearn.datasets import load_breast_cancer
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report
import pandas as pd

data = load_breast_cancer()
X = data.data
y = data.target
feature_names = data.feature_names

# Standardize (GaussianNB doesn't require scaling but it can help interpretation)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42, stratify=y)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

pred = gnb.predict(X_test)
acc = accuracy_score(y_test, pred)
print(f"GaussianNB accuracy on Breast Cancer test set: {acc:.4f}\n")
print('Classification report:\n', classification_report(y_test, pred, target_names=data.target_names))


GaussianNB accuracy on Breast Cancer test set: 0.9371

Classification report:
               precision    recall  f1-score   support

   malignant       0.94      0.89      0.91        53
      benign       0.94      0.97      0.95        90

    accuracy                           0.94       143
   macro avg       0.94      0.93      0.93       143
weighted avg       0.94      0.94      0.94       143

