Q1: What is Information Gain, and how is it used in Decision Trees?
- Information Gain measures the reduction in entropy or surprise by splitting the dataset according to a given attribute. In decision trees, it helps in selecting the attribute that best separates the samples into target classes at each node. The attribute with the highest information gain is chosen for the split, resulting in more informative and pure child nodes.

Q2: What is the difference between Gini Impurity and Entropy?
-  Gini Impurity quantifies the probability of incorrectly classifying a randomly chosen element if it is labeled according to the distribution of labels in the dataset.

     Entropy measures the average amount of information or uncertainty in the dataset.

     Gini is computationally less intensive and is preferred when speed is important, while Entropy (and thus Information Gain) is more theoretically sound and is often used when theoretical purity is important

Q3: What is Pre-Pruning in Decision Trees?
-   Pre-pruning is a technique that stops the tree's growth early by setting conditions (like maximum depth, minimum samples per leaf, or minimum samples per split) to prevent overfitting. It keeps the model simpler and more generalizable by avoiding the creation of very specific rules from the training data.

Q4: Write a Python program to train a Decision Tree Classifier using Gini
Impurity as the criterion and print the feature importances (practical).


In [9]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Train Decision Tree with Gini
clf = DecisionTreeClassifier(criterion='gini')
clf.fit(X, y)

# Print feature importances
print("Feature importances:", clf.feature_importances_)


Feature importances: [0.02666667 0.         0.55072262 0.42261071]


Q5: What is a Support Vector Machine (SVM)?
-   A Support Vector Machine is a supervised learning algorithm used for classification and regression. It finds the hyperplane that best separates data points of different classes by maximizing the margin between the closest points (called support vectors) of each class

Q6: What is the Kernel Trick in SVM?
-   The Kernel Trick allows SVM to find nonlinear decision boundaries by implicitly mapping input features into higher-dimensional spaces without explicit computation. This enables the classifier to solve problems that are not linearly separable in the original feature space.


Q7: Write a Python program to train two SVM classifiers with Linear and RBF
kernels on the Wine dataset, then compare their accuracies.


In [7]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load data
wine = load_wine()
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, random_state=42)

# Linear kernel SVM
svc_linear = SVC(kernel='linear')
svc_linear.fit(X_train, y_train)
linear_acc = accuracy_score(y_test, svc_linear.predict(X_test))

# RBF kernel SVM
svc_rbf = SVC(kernel='rbf')
svc_rbf.fit(X_train, y_train)
rbf_acc = accuracy_score(y_test, svc_rbf.predict(X_test))

print("Linear Kernel Accuracy:", linear_acc)
print("RBF Kernel Accuracy:", rbf_acc)


Linear Kernel Accuracy: 0.9777777777777777
RBF Kernel Accuracy: 0.7111111111111111


Q8: What is the Naïve Bayes classifier, and why is it called "Naïve"?
-   Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem, assuming independence among predictors. It is called “naïve” because it assumes that all features are independent of each other, an assumption that is rarely true in real data

Q9: Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve
Bayes, and Bernoulli Naïve Bayes
-   Gaussian Naïve Bayes: Used for continuous data and assumes features follow a normal distribution.

     Multinomial Naïve Bayes: Works best with discrete counts, e.g., word counts in text classification.

     Bernoulli Naïve Bayes: Designed for binary/boolean features (present/absent).


Question 10: Breast Cancer Dataset
Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
dataset and evaluate accuracy.
Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from
sklearn.datasets.


In [6]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

# Load data
data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=42)

# Train Gaussian Naive Bayes
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.958041958041958
