Naive Bayes Classifier Algorithm with Iris Datase

In [7]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target

In [8]:
iris.data.shape

(150, 4)

iris.data: 
   It contains the features or independent variables of the dataset. The dataset has 4 features: sepal length, sepal width, petal length, and petal width. These features are represented as a NumPy array with shape (150, 4).

iris.target: 
   It contains the target or dependent variable of the dataset. In this case, it represents the class labels for each sample in the dataset. There are 3 classes in the Iris dataset: Iris Setosa, Iris Versicolor, and Iris Virginica. The target variable is represented as a NumPy array with shape (150,).

Split the Data

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

The above code snippet is using the train_test_split function from **scikit-learn's model_selection** module to split the Iris **dataset into training and testing sets**.

**The train_test_split function takes two arrays (X and y in this case) as input and returns four arrays - X_train, X_test, y_train, and y_test.**

The test_size parameter is used to specify the proportion of the dataset to include in the testing set. In this case, it is set to 0.2, which means that 20% of the dataset will be used for **testing**, and the remaining 80% will be used for training.

X_train and y_train contain the training data,** which is used to train the Naive Bayes classifier**. X_test and y_test contain the testing data, which is used to evaluate the performance of the classifier. Splitting the dataset into training and testing sets is important to evaluate the performance of the classifier on unseen data and to avoid overfitting.

Train the Model:
 Now we’ll train the Naive Bayes classifier. We’ll use the GaussianNB class from scikit-learn, which implements the Gaussian Naive Bayes algorithm.

In [10]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
model = GaussianNB()
model.fit(X_train, y_train)


The above code snippet is using scikit-learn’s GaussianNB class to create an instance of the Naive Bayes classifier.

**GaussianNB is a subclass of the NaiveBayes class** and is used for **classification tasks where the input data follows a Gaussian distribution**. 

In this case, the input data (X_train) represents the features of the Iris dataset, which have continuous values, and the GaussianNB class is appropriate for the task.

The model.fit(X_train, y_train) line is then used to train the Naive Bayes classifier on the training data (X_train and y_train). The fit method fits the model to the training data by calculating the mean and variance of each feature for each class in the dataset. This information is then used to make predictions on new data.

After training, the model object represents a trained Naive Bayes classifier that can be used to make predictions on new data.

Evaluate the Model

After training the model, we’ll evaluate its performance on the test set. We’ll use the accuracy_score function from scikit-learn to compute the accuracy of the model.

In [11]:
from sklearn.metrics import accuracy_score
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print('Accuracy:', accuracy)

Accuracy: 1.0


The above code snippet is using scikit-learn’s accuracy_score function from the metrics module to calculate the accuracy of the trained Naive Bayes classifier.

model.predict(X_test) is used to make predictions on the testing data (X_test) using the trained Naive Bayes classifier (model). The predicted labels are stored in the y_pred variable.

accuracy_score(y_test, y_pred) calculates the accuracy of the classifier by comparing the predicted labels (y_pred) with the true labels (y_test) of the testing data. The accuracy_score function returns the fraction of correctly classified samples.

The resulting accuracy is then printed to the console using the print function. The accuracy score gives an indication of how well the Naive Bayes classifier is performing on the testing data.

Make Predictions

In [12]:
#Finally, we can use the trained model to make predictions on new samples. We’ll use the predict method of the model to make predictions.

sample = [[5.1, 3.5, 1.4, 0.2]]
prediction = model.predict(sample)
print('Prediction:', prediction)

Prediction: [0]


The above code snippet is using the trained Naive Bayes classifier (model) to make a prediction on a new sample of data.

The new sample is represented as a 2D list containing the values of the 4 features of the Iris dataset — [[5.1, 3.5, 1.4, 0.2]].

model.predict(sample) is used to predict the class label of the new sample using the trained Naive Bayes classifier (model). The predicted label is stored in the prediction variable.

Finally, the predicted label is printed to the console using the print function. This line of code gives an example of how to use the trained Naive Bayes classifier to make predictions on new, unseen data.

The above print statement produces the following output:

The output Prediction: [0] means that the trained Naive Bayes classifier has predicted that the new sample of data belongs to the first class in the Iris dataset, which is the setosa class.

In the Iris dataset, there are three classes — setosa, versicolor, and virginica — represented by the labels 0, 1, and 2, respectively. In this case, the predicted label is 0, which corresponds to the setosa class.

It’s important to note that this prediction is made using the trained Naive Bayes classifier, which has learned patterns and relationships in the input data based on the training set. The accuracy of the prediction depends on the quality and representativeness of the training data and the assumptions made by the Naive Bayes algorithm.

In [13]:
print(iris.feature_names)
print(iris.target_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
['setosa' 'versicolor' 'virginica']


In [14]:
sample = [[.1, 3.95, 1.4, 0.2]]
prediction = model.predict(sample)
print('Prediction:', prediction)
print('Predicted class:', iris.target_names[prediction[0]].upper())


Prediction: [0]
Predicted class: SETOSA


Check Accuracy 

In [15]:
# Initialize Naive Bayes classifiers
gaussian_nb = GaussianNB()
multinomial_nb = MultinomialNB()
bernoulli_nb = BernoulliNB()

# Train classifiers
gaussian_nb.fit(X_train, y_train)
multinomial_nb.fit(X_train, y_train)
bernoulli_nb.fit(X_train, y_train)

In [16]:
# Make predictions
gaussian_pred = gaussian_nb.predict(X_test)
multinomial_pred = multinomial_nb.predict(X_test)
bernoulli_pred = bernoulli_nb.predict(X_test)


In [17]:
# Evaluate accuracy
gaussian_accuracy = accuracy_score(y_test, gaussian_pred)
multinomial_accuracy = accuracy_score(y_test, multinomial_pred)
bernoulli_accuracy = accuracy_score(y_test, bernoulli_pred)

print("Accuracy of Gaussian Naive Bayes:", gaussian_accuracy)
print("Accuracy of Multinomial Naive Bayes:", multinomial_accuracy)
print("Accuracy of Bernoulli Naive Bayes:", bernoulli_accuracy)

Accuracy of Gaussian Naive Bayes: 1.0
Accuracy of Multinomial Naive Bayes: 0.9333333333333333
Accuracy of Bernoulli Naive Bayes: 0.26666666666666666


For hyperparameter tuning, you can use techniques like GridSearchCV or RandomizedSearchCV

In [18]:
from sklearn.model_selection import GridSearchCV

# Define hyperparameters grid for GaussianNB
gaussian_params = {}

# Define hyperparameters grid for MultinomialNB
multinomial_params = {'alpha': [0.1, 0.5, 1.0]}

# Define hyperparameters grid for BernoulliNB
bernoulli_params = {'alpha': [0.1, 0.5, 1.0]}

# Perform grid search for each classifier
gaussian_grid_search = GridSearchCV(gaussian_nb, gaussian_params, cv=5)
multinomial_grid_search = GridSearchCV(multinomial_nb, multinomial_params, cv=5)
bernoulli_grid_search = GridSearchCV(bernoulli_nb, bernoulli_params, cv=5)

# Fit grid search models
gaussian_grid_search.fit(X_train, y_train)
multinomial_grid_search.fit(X_train, y_train)
bernoulli_grid_search.fit(X_train, y_train)

# Get best parameters and best score for each classifier
print("Best parameters for Gaussian Naive Bayes:", gaussian_grid_search.best_params_)
print("Best score for Gaussian Naive Bayes:", gaussian_grid_search.best_score_)

print("Best parameters for Multinomial Naive Bayes:", multinomial_grid_search.best_params_)
print("Best score for Multinomial Naive Bayes:", multinomial_grid_search.best_score_)

print("Best parameters for Bernoulli Naive Bayes:", bernoulli_grid_search.best_params_)
print("Best score for Bernoulli Naive Bayes:", bernoulli_grid_search.best_score_)

Best parameters for Gaussian Naive Bayes: {}
Best score for Gaussian Naive Bayes: 0.9416666666666668
Best parameters for Multinomial Naive Bayes: {'alpha': 0.1}
Best score for Multinomial Naive Bayes: 0.95
Best parameters for Bernoulli Naive Bayes: {'alpha': 0.1}
Best score for Bernoulli Naive Bayes: 0.3333333333333333


Accuracy And hyperparameter tuning Using User Define Method 

In [19]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score

def fit_and_compare_classifiers(X_train, X_test, y_train, y_test):
    classifiers = {
        "GaussianNB": GaussianNB(),
        "MultinomialNB": MultinomialNB(),
        "BernoulliNB": BernoulliNB()
    }
    
    results = {}
    
    for name, clf in classifiers.items():
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        results[name] = accuracy
    
    return results

def tweak_hyperparameters_and_compare(X_train, X_test, y_train, y_test):
    hyperparameters = {
        "GaussianNB": {},
        "MultinomialNB": {'alpha': [0.1, 0.5, 1.0]},
        "BernoulliNB": {'alpha': [0.1, 0.5, 1.0]}
    }
    
    results = {}
    
    for name, params in hyperparameters.items():
        clf = eval(name)()
        grid_search = GridSearchCV(clf, params, cv=5)
        grid_search.fit(X_train, y_train)
        best_params = grid_search.best_params_
        best_score = grid_search.best_score_
        results[name] = {"best_params": best_params, "best_score": best_score}
    
    return results

# Load Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Task 1: Fit and compare all versions of Naive Bayes classifiers
results1 = fit_and_compare_classifiers(X_train, X_test, y_train, y_test)
print("Task 1 - Results:")
print(results1)

# Task 2: Tweak hyperparameters and compare the results
results2 = tweak_hyperparameters_and_compare(X_train, X_test, y_train, y_test)
print("\nTask 2 - Results:")
print(results2)


Task 1 - Results:
{'GaussianNB': 1.0, 'MultinomialNB': 0.9, 'BernoulliNB': 0.3}

Task 2 - Results:
{'GaussianNB': {'best_params': {}, 'best_score': 0.9416666666666668}, 'MultinomialNB': {'best_params': {'alpha': 0.1}, 'best_score': 0.9333333333333332}, 'BernoulliNB': {'best_params': {'alpha': 0.1}, 'best_score': 0.3333333333333333}}
