In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

- `import numpy as np`: This line imports the NumPy library, which provides support for efficient numerical operations in Python.
- `import pandas as pd`: This line imports the Pandas library, which offers high-performance data manipulation and analysis tools for structured data.
- `import matplotlib.pyplot as plt`: This line imports the `pyplot` module from the Matplotlib library, which provides a MATLAB-like interface for creating visualizations in Python.

In [2]:
def preprocess_data(df):
    # Convert categorical variables into numerical
    df = pd.get_dummies(df)
    
    # Normalize numerical features
    df = (df - df.mean()) / df.std()
    
    return df

Processing the Data (preprocess_data): The preprocess_data function takes a DataFrame (df) as input and performs preprocessing steps on the data. It first converts categorical variables into numerical representation using one-hot encoding through pd.get_dummies. Then, it normalizes the numerical features by subtracting the mean and dividing by the standard deviation. The preprocessed DataFrame is returned.

In [3]:
def train_test_split(df, test_size=0.2):
    # Shuffle the DataFrame
    df = df.sample(frac=1).reset_index(drop=True)
    
    # Split the DataFrame into training and testing sets
    split_index = int((1 - test_size) * len(df))
    train_data = df[:split_index]
    test_data = df[split_index:]
    
    return train_data, test_data

Train-Test Split (train_test_split): The train_test_split function takes a DataFrame (df) and a test size ratio as inputs. It shuffles the rows of the DataFrame using sample and reset_index to ensure randomization. Then, it splits the shuffled DataFrame into a training set and a testing set based on the provided test size ratio. The split point is determined by calculating the index that corresponds to the test size ratio. The function returns the training set and testing set as separate DataFrames.

In [4]:
def calculate_distance(instance1, instance2):
    # Euclidean distance
    return np.sqrt(np.sum((instance1 - instance2) ** 2))

Calculating Distance (calculate_distance): The calculate_distance function computes the Euclidean distance between two instances (instance1 and instance2). It uses NumPy's np.sqrt and np.sum functions to calculate the square root of the sum of squared differences between the corresponding feature values of the two instances.

In [5]:
def k_nearest_neighbors(x_train, y_train, x_test, k):
    y_pred = []
    
    for test_instance in x_test:
        distances = []
        
        for i, train_instance in enumerate(x_train):
            distance = calculate_distance(train_instance, test_instance)
            distances.append((distance, y_train[i]))
        
        # Sort the distances in ascending order
        distances.sort(key=lambda x: x[0])
        
        # Get the k nearest neighbors
        neighbors = distances[:k]
        
        # Count the votes for each class label
        votes = {}
        for neighbor in neighbors:
            label = neighbor[1]
            if label in votes:
                votes[label] += 1
            else:
                votes[label] = 1
        
        # Predict the class label with maximum votes
        predicted_label = max(votes, key=votes.get)
        y_pred.append(predicted_label)
    
    return y_pred

K-Nearest Neighbors Algorithm (k_nearest_neighbors): The k_nearest_neighbors function performs the k-nearest neighbors classification algorithm. It takes the training features (x_train), training labels (y_train), test features (x_test), and the value of k as inputs. For each test instance, it calculates the distance to all training instances using calculate_distance and stores them in a list along with their corresponding labels. It then sorts the distances in ascending order and selects the k nearest neighbors. The function counts the votes for each class label among the k neighbors and predicts the class label with the maximum votes. The predicted labels for all test instances are stored in the y_pred list, which is returned by the function.

In [6]:
def compute_class_probabilities(x_train, y_train, x_test):
    classes = np.unique(y_train)
    class_probabilities = {}
    
    for class_label in classes:
        class_instances = x_train[y_train == class_label]
        class_probability = len(class_instances) / len(x_train)
        
        feature_probabilities = {}
        for feature_index in range(x_train.shape[1]):
            feature_values = class_instances[:, feature_index]
            feature_probabilities[feature_index] = {
                'mean': np.mean(feature_values),
                'std': np.std(feature_values)
            }
        
        class_probabilities[class_label] = {
            'probability': class_probability,
            'feature_probabilities': feature_probabilities
        }
    
    return class_probabilities

Computing Class Probabilities (compute_class_probabilities): The compute_class_probabilities function calculates the class probabilities and feature probabilities for the Naive Bayes algorithm. It takes the training features (x_train), training labels (y_train), and test features (x_test) as inputs. It first identifies the unique class labels in the training data using np.unique. Then, for each class label, it selects the instances belonging to that class and calculates the class probability as the ratio of the number of instances in that class to the total number of training instances. It also computes the mean and standard deviation of each feature within the class instances. The class probabilities and feature probabilities are stored in a dictionary and returned.

In [7]:
def naive_bayes(x_train, y_train, x_test):
    y_pred = []
    
    class_probabilities = compute_class_probabilities(x_train, y_train, x_test)
    
    for test_instance in x_test:
        instance_probabilities = {}
        
        for class_label, class_data in class_probabilities.items():
            class_probability = class_data['probability']
            feature_probabilities = class_data['feature_probabilities']
            
            instance_probability = class_probability
            for feature_index, feature_value in enumerate(test_instance):
                mean = feature_probabilities[feature_index]['mean']
                std = feature_probabilities[feature_index]['std']
                likelihood = (1 / (np.sqrt(2 * np.pi) * std)) * np.exp(-((feature_value - mean) ** 2) / (2 * std ** 2))
                instance_probability *= likelihood
            
            instance_probabilities[class_label] = instance_probability


Naive Bayes Algorithm (naive_bayes): The naive_bayes function applies the Naive Bayes algorithm to predict the class labels for the test instances. It takes the training features (x_train), training labels (y_train), and test features (x_test) as inputs. It first calls the compute_class_probabilities function to obtain the class probabilities and feature probabilities. Then, for each test instance, it iterates through each class label and calculates the instance probability based on the class probability and feature probabilities using the Naive Bayes formula. The instance probabilities for each class label are stored in the instance_probabilities dictionary.