 1 : What is Information Gain, and how is it used in Decision Trees?

 ans-  
    Information Gain measures how much a feature reduces uncertainty (entropy)

    in a dataset. In decision trees, it is used to select the best feature to
     
    split the data at each node by choosing the feature with the highest
     
    information gain, which leads to the most pure or homogeneous child nodes.
     
    This process is repeated recursively to build the tree, effectively
      
    creating a series of if-then rules for prediction.


2: What is the difference between Gini Impurity and Entropy?

ans-  


    Gini Impurity and Entropy are both measures of a node's impurity in a decision

    tree, but they differ in their calculation and range. Gini Impurity is faster

    to compute and ranges from 0 to 0.5, while Entropy uses logarithms, is

    more computationally expensive, and its values range from 0 to 1. Both

    aim to find the best split by minimizing impurity, but Gini is often
  
    preferred for large datasets due to its speed, as the results from both are
  
    usually very similar.

3:What is Pre-Pruning in Decision Trees?

ans-


    it is a technique in decision tree that halts the growth of tree during its
    
    construction to prevent it from becoming too complex and over fitting the
     
    training data .

 4:Write a Python program to train a Decision Tree Classifier using Gini
Impurity as the criterion and print the feature importances

In [2]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris # Example dataset

# Load an example dataset (Iris dataset)
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Decision Tree Classifier with Gini impurity as the criterion
# criterion='gini' is the default, but explicitly setting it emphasizes the choice.
dt_classifier = DecisionTreeClassifier(criterion='gini', random_state=42)

# Train the Decision Tree Classifier
dt_classifier.fit(X_train, y_train)

# Print the feature importances
print("Feature Importances:")
for feature, importance in zip(X.columns, dt_classifier.feature_importances_):
    print(f"{feature}: {importance:.4f}")



Feature Importances:
sepal length (cm): 0.0000
sepal width (cm): 0.0191
petal length (cm): 0.8933
petal width (cm): 0.0876


 5: What is a Support Vector Machine (SVM)?

 ans-  

    A Support Vector Machine (SVM) is a supervised machine learning algorithm

    used for classification and regression tasks by finding an optimal hyperplane
  
    to separate data points into different classes

 6: What is the Kernel Trick in SVM?

ans-



    The Kernel Trick is a method used in Support Vector Machines (SVM) to
    
    classify non-linear data without explicitly converting it to a
    
    higher-dimensional space. It works by using a kernel function to compute
    
    the dot product between data points in that higher-dimensional space, which
    
    is computationally more efficient than actually performing the transformation

 7: Write a Python program to train two SVM classifiers with Linear and RBF
kernels on the Wine dataset, then compare their accuracies.

In [3]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Load the Wine dataset
wine = load_wine()
X = wine.data
y = wine.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train an SVM classifier with a Linear kernel
svm_linear = SVC(kernel='linear', random_state=42)
svm_linear.fit(X_train, y_train)

# Make predictions and calculate accuracy for the Linear SVM
y_pred_linear = svm_linear.predict(X_test)
accuracy_linear = accuracy_score(y_test, y_pred_linear)
print(f"Accuracy of SVM with Linear Kernel: {accuracy_linear:.4f}")

# Train an SVM classifier with an RBF kernel
svm_rbf = SVC(kernel='rbf', random_state=42)
svm_rbf.fit(X_train, y_train)

# Make predictions and calculate accuracy for the RBF SVM
y_pred_rbf = svm_rbf.predict(X_test)
accuracy_rbf = accuracy_score(y_test, y_pred_rbf)
print(f"Accuracy of SVM with RBF Kernel: {accuracy_rbf:.4f}")

# Compare the accuracies
if accuracy_linear > accuracy_rbf:
    print("Linear Kernel performed better.")
elif accuracy_rbf > accuracy_linear:
    print("RBF Kernel performed better.")
else:
    print("Both kernels performed equally well.")

Accuracy of SVM with Linear Kernel: 0.9815
Accuracy of SVM with RBF Kernel: 0.7593
Linear Kernel performed better.


 8: What is the Naive Bayes classifier, and why is it called "Naive"?

 ans-


    The Naive Bayes classifier is a simple, probabilistic classification
    
    algorithm that uses Bayes' Theorem to predict the probability of a class
    
    for a given data point. It is called "naïve" because it makes a strong,
    
    often unrealistic, assumption that all features in the data are independent
    
    of each other.

 9: Explain the differences between Gaussian Naive Bayes, Multinomial Naive
Bayes, and Bernoulli Naive Bayes

ans-

**Gaussian naive**

    handles continuous data by assuming it follows a bell curve (normal distribution).


**Multinomial naive**

    works with discrete data, specifically counts or frequencies (how many
    
    times something appears).


**Bernoulli naive**  

    is for binary data, where a feature is either present (1) or absent (0).

10: Write a Python program to train a Gaussian Naïve Bayes classifier on the Breast Cancer
dataset and evaluate accuracy.


In [8]:
# Import necessary libraries
import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

# 1. Load the Breast Cancer dataset
# The dataset is included in scikit-learn for convenience
print("Loading the Breast Cancer dataset...")
cancer = load_breast_cancer()
X = cancer.data
y = cancer.target
print(f"Dataset loaded. X shape: {X.shape}, y shape: {y.shape}\n")

# 2. Split the dataset into training and testing sets
# We use 70% for training and 30% for testing
# A random state is set for reproducibility
print("Splitting data into training (70%) and testing (30%) sets...")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Testing set size: {X_test.shape[0]} samples\n")

# 3. Initialize the Gaussian Naive Bayes classifier
# GaussianNB is suitable for continuous data, like the features in this dataset
print("Initializing Gaussian Naive Bayes classifier...")
gnb = GaussianNB()

# 4. Train the classifier on the training data
print("Training the classifier...")
gnb.fit(X_train, y_train)
print("Training complete.\n")

# 5. Make predictions on the testing data
print("Making predictions on the test set...")
y_pred = gnb.predict(X_test)
print("Predictions complete.\n")

# 6. Evaluate the accuracy of the model
# Compare the predicted labels (y_pred) with the actual labels (y_test)
accuracy = accuracy_score(y_test, y_pred)

print("-" * 40)
print(f"Model Accuracy on Test Set: {accuracy * 100:.2f}%")
print("-" * 40)

# Optional: Display a few actual vs. predicted results
print("\nSample Actual vs. Predicted Labels:")
for i in range(10):
    print(f"  Actual: {y_test[i]}, Predicted: {y_pred[i]}")



Loading the Breast Cancer dataset...
Dataset loaded. X shape: (569, 30), y shape: (569,)

Splitting data into training (70%) and testing (30%) sets...
Training set size: 398 samples
Testing set size: 171 samples

Initializing Gaussian Naive Bayes classifier...
Training the classifier...
Training complete.

Making predictions on the test set...
Predictions complete.

----------------------------------------
Model Accuracy on Test Set: 94.15%
----------------------------------------

Sample Actual vs. Predicted Labels:
  Actual: 1, Predicted: 1
  Actual: 0, Predicted: 0
  Actual: 0, Predicted: 0
  Actual: 1, Predicted: 1
  Actual: 1, Predicted: 1
  Actual: 0, Predicted: 0
  Actual: 0, Predicted: 0
  Actual: 0, Predicted: 0
  Actual: 1, Predicted: 1
  Actual: 1, Predicted: 1
