<a href="https://colab.research.google.com/github/bhattacharya5/MLBD/blob/main/M21AIE211_MLBD_Assignment1_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

In [None]:
%cd "/content/gdrive/My Drive/Colab Notebooks/semester3/MLBD"

In [None]:
import numpy as np
import pandas as pd
from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
import matplotlib.pyplot as plt

# python magic function
%matplotlib inline

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import warnings
warnings.filterwarnings("ignore")

In [None]:
df_train = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/semester3/MLBD/train.csv')
df_train.head(5)

In [None]:
df_train.shape

In [None]:
df_train = df_train.iloc[0:5000]
df_train.shape

In [None]:
df_test = pd.read_csv('/content/gdrive/My Drive/Colab Notebooks/semester3/MLBD/test.csv')
df_test.head(5)

In [None]:
df_test = df_test.iloc[0:3000]
df_test.shape

In [None]:
Y_train = df_train["label"]

# Drop 'label' column
X_train = df_train.drop(labels = ["label"],axis = 1) 

X_test = df_test

#plot numbers

In [None]:
Y_train = to_categorical(Y_train, num_classes = 10)
Y_train[8]

In [None]:
# create figure with 3x3 subplots using matplotlib.pyplot
fig, axs = plt.subplots(3, 3, figsize = (12, 12))
plt.gray()

# loop through subplots and add mnist images
for i, ax in enumerate(axs.flat):
    ax.matshow(X_train[i])
    ax.axis('off')
    ax.set_title('Number {}'.format(Y_train[i]))
    
# display the figure
fig.show()

#BFR implementation

Sure, I can guide you through the implementation of the BFR (Bradley-Fayyad-Reina) clustering algorithm using the MNIST dataset and making predictions on the test dataset. Here are the steps you can follow:

Load the MNIST dataset using any suitable library like TensorFlow or PyTorch. MNIST is a dataset of handwritten digits, and each image is of size 28x28 pixels. The dataset has 60,000 training images and 10,000 test images.

Preprocess the data by scaling the pixel values to a range between 0 and 1, and flattening the images to a one-dimensional array of size 784. You can also reduce the dimensionality of the data using techniques like PCA or t-SNE if needed.

Implement the BFR clustering algorithm. Here is a brief overview of the algorithm:

Initialize the algorithm with a set of random centroids.

Assign each data point to the nearest centroid.

Compute the quality of the clustering using a cost function like the sum of squared errors.

If the quality of the clustering is below a threshold, split the centroids into sub-clusters using a splitting algorithm like k-means or hierarchical clustering.

If the quality of the clustering is above a threshold, merge the centroids into larger clusters using a merging algorithm like agglomerative clustering.

Repeat the above steps until convergence or a maximum number of iterations is reached.

Train the BFR algorithm on the MNIST training dataset.

Predict the labels of the test dataset by assigning each test image to the nearest centroid and using the label of the centroid as the predicted label for the test image.

Evaluate the accuracy of the predictions using metrics like accuracy, precision, recall, and F1-score.

Note that implementing the BFR algorithm can be complex, and there are many variations and parameters that can affect the results. Therefore, it is recommended to use an existing implementation or library like scikit-learn or TensorFlow that provides a high-level interface to the algorithm and allows you to tune the parameters and evaluate the performance easily.

I hope this helps you get started with implementing the BFR algorithm on the MNIST dataset and making predictions on the test dataset. Let me know if you have any further questions!

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.decomposition import PCA
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.metrics import accuracy_score
import numpy as np

X= X_train / 255.0  # Scale pixel values to [0, 1]
y= Y_train.astype('int')

# Reduce dimensionality using PCA
pca = PCA(n_components=50)
X = pca.fit_transform(X)

# Define BFR clustering algorithm
class BFR:
    def __init__(self, k_init, split_size, merge_size, max_iter):
        self.k_init = k_init
        self.split_size = split_size
        self.merge_size = merge_size
        self.max_iter = max_iter

    def fit(self, X):
        # Initialize centroids using KMeans
        kmeans = KMeans(n_clusters=self.k_init, n_init=1)
        kmeans.fit(X)
        centroids = kmeans.cluster_centers_
        labels = kmeans.labels_
        n_clusters = self.k_init

        # Run BFR algorithm
        for i in range(self.max_iter):
            # Assign points to nearest centroid
            distances = np.linalg.norm(X[:, np.newaxis, :] - centroids, axis=2)
            assignments = np.argmin(distances, axis=1)

            # Split large clusters
            for j in range(n_clusters):
                if np.sum(assignments == j) > self.split_size:
                    sub_kmeans = KMeans(n_clusters=2, n_init=1)
                    sub_kmeans.fit(X[assignments == j])
                    sub_centroids = sub_kmeans.cluster_centers_
                    sub_labels = sub_kmeans.labels_
                    sub_n_clusters = 2

                    # Update centroids and labels
                    centroids[j] = sub_centroids[0]
                    centroids = np.vstack((centroids, sub_centroids[1]))
                    labels[assignments == j] = sub_labels + n_clusters
                    n_clusters += 1

            # Merge small clusters
            for j in range(n_clusters):
                if np.sum(labels == j) < self.merge_size:
                    closest_cluster = np.argmin(np.linalg.norm(centroids[j] - centroids, axis=1))
                    labels[labels == j] = closest_cluster

            # Recompute centroids
            for j in range(n_clusters):
                centroids[j] = np.mean(X[labels == j], axis=0)

        # Save centroids and labels
        self.centroids = centroids
        self.labels = labels

    def predict(self, X):
        # Assign points to nearest centroid and return labels
        distances = np.linalg.norm(X[:, np.newaxis, :] - self.centroids, axis=2)
        assignments = np.argmin(distances, axis=1)
        return self.labels[assignments]

# Train BFR model on MNIST dataset
bfr = BFR(k_init=50, split_size=50, merge_size=10, max_iter=10)
bfr.fit(X)

In [None]:
# Predict labels of test dataset
y_pred = bfr.predict(pca.transform(df_test))