# Supervised Classification: Decision Trees, SVM, and Naive Bayes>>

1. What is Information Gain, and how is it used in Decision Trees?
   - Information Gain (IG) is a measure used in Decision Trees to decide which feature should be selected as the splitting node at each step.
      
      It is used in Decision Trees:
       
       1. Compute Entropy of the whole dataset.

        2. For each feature:

             Split the data based on that feature.

             Compute entropy of each subset.

             Compute Information Gain.

        3. Select the feature with maximum Information Gain as the root node.

        4. Repeat this process for each branch until:

              Pure nodes are obtained, or

              No more features left.

2. What is the difference between Gini Impurity and Entropy?
   - Gini Impurity and Entropy are both measures used in decision tree algorithms to quantify the impurity or randomness of a dataset. The goal is to find splits that minimize impurity, leading to more homogeneous child nodes.
        
        DIFFERENCES:
        1. Strentgh: Entropy provides a more 'balanced' tree by seeking splits that create subsets with roughly equal sizes, often leading to better generalization. whereas, Gini impurity makes it faster to compute. Often performs similarly to entropy in practice. Less prone to bias with features having many categories.

        2. Weakness: Entropy is computationally more intensive. Can be biased towards features with many unique values. Whereas, Gini impurity may not always yield the best splits in terms of overall tree balance compared to entropy. Slightly more sensitive to class imbalances.
        
        3. Case use:
              * If computational speed is a primary concern, Gini Impurity might be slightly preferred due to its simpler calculation.
              * If you are building a CART model, Gini Impurity is the default choice.
              * In most practical scenarios, the choice between Gini Impurity and Entropy often makes little difference in the final performance of the decision tree. Both measures aim to achieve the same goal: maximizing homogeneity in child nodes.

3. What is Pre-Pruning in Decision Trees?
   - Pre-pruning is a technique used in the construction of decision trees to prevent overfitting. Instead of building a full decision tree and then pruning it back, pre-pruning stops the tree growth early, during its construction.
         
      In practice, both pre-pruning and post-pruning are valuable techniques for controlling the complexity of decision trees and improving their generalization performance.

4. Write a Python program to train a Decision Tree Classifier using Gini
Impurity as the criterion and print the feature importances (practical).


In [1]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import pandas as pd

In [2]:
iris = load_iris()
X = iris.data
y = iris.target
feature_names = iris.feature_names

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Dataset loaded and split successfully.")

Dataset loaded and split successfully.


In [3]:
dt_classifier = DecisionTreeClassifier(criterion='gini', random_state=42)
dt_classifier.fit(X_train, y_train)

print("Decision Tree Classifier trained using Gini Impurity.")

Decision Tree Classifier trained using Gini Impurity.


In [4]:
feature_importances = dt_classifier.feature_importances_

# Create a pandas Series for better visualization
importance_df = pd.Series(feature_importances, index=feature_names)

print("Feature Importances (Gini Impurity):")
display(importance_df.sort_values(ascending=False))

Feature Importances (Gini Impurity):


Unnamed: 0,0
petal length (cm),0.906143
petal width (cm),0.077186
sepal width (cm),0.01667
sepal length (cm),0.0


5. What is a Support Vector Machine (SVM)?
   - A Support Vector Machine (SVM) is a powerful and versatile machine learning algorithm capable of performing linear or non-linear classification, regression, and even outlier detection. The fundamental idea behind SVMs for classification is to find an optimal hyperplane that distinctly classifies data points into different classes.
      Advantages of SVMs:
           
        * Effective in High-Dimensional Spaces
        * Memory Efficient
        * Versatile

6. What is the Kernel Trick in SVM?
   - The Kernel Trick is a fundamental concept that significantly extends the power of Support Vector Machines (SVMs) by allowing them to handle non-linearly separable data.
     One of the most powerful features of SVMs is their ability to perform non-linear classification using the "kernel trick." When data is not linearly separable in its original feature space, the kernel trick implicitly maps the data into a higher-dimensional feature space where it becomes linearly separable.

7. Write a Python program to train two SVM classifiers with Linear and RBF
kernels on the Wine dataset, then compare their accuracies.
   

In [5]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd

In [6]:
wine = load_wine()
X = wine.data
y = wine.target
feature_names = wine.feature_names

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Wine dataset loaded and split successfully.")

Wine dataset loaded and split successfully.


In [7]:
# Train SVM with Linear Kernel
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train, y_train)

# Train SVM with RBF Kernel
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train, y_train)

print("Both Linear and RBF SVM classifiers trained successfully.")

Both Linear and RBF SVM classifiers trained successfully.


In [8]:
# Make predictions
y_pred_linear = svm_linear.predict(X_test)
y_pred_rbf = svm_rbf.predict(X_test)

# Calculate accuracies
accuracy_linear = accuracy_score(y_test, y_pred_linear)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)

print(f"Accuracy of SVM with Linear Kernel: {accuracy_linear:.4f}")
print(f"Accuracy of SVM with RBF Kernel: {accuracy_rbf:.4f}")

if accuracy_linear > accuracy_rbf:
    print("\nThe Linear Kernel performed better or equally well on this dataset.")
elif accuracy_rbf > accuracy_linear:
    print("\nThe RBF Kernel performed better on this dataset.")
else:
    print("\nBoth Kernels performed equally well on this dataset.")

Accuracy of SVM with Linear Kernel: 0.9815
Accuracy of SVM with RBF Kernel: 0.7593

The Linear Kernel performed better or equally well on this dataset.


8. What is the Naïve Bayes classifier, and why is it called "Naïve"?
   - The Naïve Bayes classifier is a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. Naïve Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.
      The term "Naïve" comes from the strong independence assumption it makes. The model assumes that all features are independent of each other given the class.
     

9. Explain the differences between Gaussian Naïve Bayes, Multinomial Naïve
Bayes, and Bernoulli Naïve Bayes.
   - While all Naïve Bayes classifiers share the core "naïve" assumption of feature independence, they differ based on the underlying distribution assumptions for the features.
      1. Gaussian Naïve Bayes: This means that when you plot the values of a continuous feature for each class, they should resemble a bell curve.
        
         Data Type: Best suited for continuous numerical features (e.g., height, weight, temperature).
        
           It calculates the mean and standard deviation of each feature for each class during training.
      2. Multinomial Naïve Bayes:  It models the probability of observing a count for a specific feature, given the class.
        
          Data Type: Primarily used for discrete features that represent counts, such as word counts in text documents.
          
            It calculates the probability of each feature (e.g., a word) occurring within a document, given that the document belongs to a particular class.
      3. Bernoulli Naïve Bayes: This means each feature indicates the presence or absence of a particular event or characteristic.
            
            Data Type: Works with binary features (0 or 1, true or false, present or absent).
            
            It explicitly penalizes the absence of a feature that is indicative of a class. For example, if a word is present in a document, it counts towards the probability. If it's absent, it also factors into the probability calculation.


10. Breast Cancer Dataset
Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
dataset and evaluate accuracy.
Hint:Use GaussianNB() from sklearn.naive_bayes and the Breast Cancer dataset from
sklearn.datasets.


In [9]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
import pandas as pd

In [10]:
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

print("Breast Cancer dataset loaded and split successfully.")

Breast Cancer dataset loaded and split successfully.


In [11]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)

print("Gaussian Naïve Bayes classifier trained successfully.")

Gaussian Naïve Bayes classifier trained successfully.


In [12]:
y_pred = gnb.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy of Gaussian Naïve Bayes on Breast Cancer dataset: {accuracy:.4f}")

Accuracy of Gaussian Naïve Bayes on Breast Cancer dataset: 0.9415
