In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

importing the required libraries: Numpy, Pandas, and Matplotlib.

In [12]:
# K-Means Clustering Algorithm
def kmeans_clustering(dataset, k, max_iterations):
    # Extract features from dataset
    features = dataset[:, :-1]

    # Initialize centroids randomly
    centroids = features[np.random.choice(features.shape[0], k, replace=False)]

    # Initialize clusters
    clusters = np.zeros(features.shape[0])

    for _ in range(max_iterations):
        # Assign each data point to the nearest centroid
        for i in range(features.shape[0]):
            distances = np.linalg.norm(features[i] - centroids, axis=1)
            cluster = np.argmin(distances)
            clusters[i] = cluster

        # Update centroids
        for cluster in range(k):
            cluster_points = features[clusters == cluster]
            centroids[cluster] = np.mean(cluster_points, axis=0)

    return clusters, centroids


The kmeans_clustering function implements the K-Means Clustering algorithm. It takes the dataset, the desired number of clusters (k), and the maximum number of iterations as input. It initializes centroids randomly, assigns data points to the nearest centroid, and updates the centroids iteratively until convergence or the maximum number of iterations is reached.

In [13]:
# Principal Component Analysis (PCA) Algorithm
def pca(dataset, n_components):
    # Extract features from dataset
    features = dataset[:, :-1]

    # Standardize the features
    standardized_features = (features - np.mean(features, axis=0)) / np.std(features, axis=0)

    # Calculate the covariance matrix
    covariance_matrix = np.cov(standardized_features.T)

    # Calculate the eigenvalues and eigenvectors
    eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)

    # Sort eigenvalues and corresponding eigenvectors in descending order
    sorted_indices = np.argsort(eigenvalues)[::-1]
    sorted_eigenvalues = eigenvalues[sorted_indices]
    sorted_eigenvectors = eigenvectors[:, sorted_indices]

    # Select the top n_components eigenvectors
    selected_eigenvectors = sorted_eigenvectors[:, :n_components]

    # Project the standardized features onto the selected eigenvectors
    transformed_features = np.dot(standardized_features, selected_eigenvectors)

    return transformed_features, selected_eigenvectors, sorted_eigenvalues


The pca function implements the Principal Component Analysis algorithm. It takes the dataset and the desired number of principal components (n_components) as input. It standardizes the features, calculates the covariance matrix, computes the eigenvalues and eigenvectors, and selects the top n_components eigenvectors. It then projects the standardized features onto the selected eigenvectors to obtain the transformed features.

In [14]:
# Load the dataset and remove the "Species" column
def load_dataset(dataset_path):
    df = pd.read_csv(dataset_path)
    species = df["Species"]
    dataset = df.drop("Species", axis=1).to_numpy()
    return dataset, species


The load_dataset function loads the dataset from a given path and removes the "Species" column. It returns the modified dataset and the species labels separately.

In [15]:
# Visualize the clusters
def visualize_clusters(dataset, clusters, centroids):
    plt.scatter(dataset[:, 0], dataset[:, 1], c=clusters)
    plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', color='red')
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title('K-Means Clustering')
    plt.show()

The visualize_clusters function visualizes the clusters obtained from K-Means Clustering. It plots the data points with different colors based on their assigned clusters and displays the centroids as red crosses.

In [16]:
# Visualize the PCA results
def visualize_pca(dataset, transformed_features):
    plt.scatter(transformed_features[:, 0], transformed_features[:, 1])
    plt.xlabel('Principal Component 1')
    plt.ylabel('Principal Component 2')
    plt.title('PCA')
    plt.show()

The visualize_pca function visualizes the results of PCA. It plots the transformed features on a scatter plot using the first two principal components.

In [17]:
# Main function
def main():
    # Dataset path
    dataset_path = "iris_dataset.csv"

    # Load the dataset
    dataset, species = load_dataset(dataset_path)

    # Perform K-Means Clustering
    k = 3
    max_iterations = 100
    clusters, centroids = kmeans_clustering(dataset, k,max_iterations)
    # Visualize the clusters
    visualize_clusters(dataset, clusters, centroids)

    # Perform PCA
    n_components = 2
    transformed_features, eigenvectors, eigenvalues = pca(dataset, n_components)

    # Visualize the PCA results
    visualize_pca(dataset, transformed_features)

The main function serves as the entry point of the code. It loads the dataset, performs K-Means Clustering, visualizes the clusters, performs PCA, and visualizes the PCA results.