<a href="https://colab.research.google.com/github/ehuseynov/ITU-BLG454-HW1/blob/main/454HW1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Name Surname: Emil Huseynov

Student No: 150210906

---

Libraries to be used

---



In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score, f1_score, root_mean_squared_error
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import numpy as np
from matplotlib import pyplot as plt

# Build your own code-base (30 points)

Implement the methods provided and compare your implementation with Sklearn library

---

K-Nearest Neighbour (5 points)

In [8]:
class KNN:
    def __init__(self, train_data, train_label, k=3):
        self.k = k
        self.train_data = train_data
        self.train_label = train_label

    def predict(self, test_data):
        predictions = []

        for test_point in test_data:
            # Calculate the Euclidean distance between the test point and each training data point
            distances = []
            for i in range(len(self.train_data)):
                distance = np.sqrt(np.sum((self.train_data[i] - test_point) ** 2))
                distances.append((distance, self.train_label[i]))

            # Sort distances and get the k nearest labels
            distances.sort(key=lambda x: x[0])
            k_nearest_labels = [label for _, label in distances[:self.k]]

            # Manually count occurrences to find the most common label
            label_counts = {}
            for label in k_nearest_labels:
                if label in label_counts:
                    label_counts[label] += 1
                else:
                    label_counts[label] = 1

            # Determine the label with the highest count
            most_common_label = max(label_counts, key=label_counts.get)
            predictions.append(most_common_label)

        return predictions



| Feature               | My KNN Implementation            | Sklearn KNeighborsClassifier           |
|-----------------------|--------------------------------------|----------------------------------------|
| **Distance Metric**   | Only Euclidean                      | Multiple (Euclidean, Manhattan, etc.)  |
| **Voting**            | Majority voting only                | Supports weighted voting               |
| **Ease of Use**       | Manual setup, basic Python          | Easy-to-use API with `.fit()` & `.predict()` |
| **Flexibility**       | Limited options, no customization   | Customizable (metrics, weights)        |
| **Performance**       | Slower, unoptimized                 | Fast, optimized with data structures like KDTree |
| **Error Handling**    | Minimal error checks                | Robust input validation                |

**Gaussian** Naive Bayes (5 points)

In [9]:
class GNB:
    def __init__(self, train_data, train_label):
        self.train_data = np.array(train_data)
        self.train_label = np.array(train_label)
        self.classes = np.unique(self.train_label)
        self.means = {}
        self.variances = {}
        self.priors = {}

    def gaussian_probability(self, x, mean, variance):
        # Calculate the Gaussian probability density function
        exponent = np.exp(-((x - mean) ** 2) / (2 * variance))
        return (1 / np.sqrt(2 * np.pi * variance)) * exponent

    def fit(self):
        # Calculate mean, variance, and prior for each class
        for c in self.classes:
            # Filter data by class
            class_data = self.train_data[self.train_label == c]
            # Calculate mean and variance for each feature in the class
            self.means[c] = np.mean(class_data, axis=0)
            self.variances[c] = np.var(class_data, axis=0)
            # Calculate prior probability for the class
            self.priors[c] = class_data.shape[0] / self.train_data.shape[0]

    def predict(self, test_data):
        test_data = np.array(test_data)
        predictions = []

        for x in test_data:
            class_probabilities = {}
            for c in self.classes:
                # Start with the prior probability for the class
                class_prob = np.log(self.priors[c])

                # Multiply by the Gaussian probability of each feature
                for i in range(len(x)):
                    mean = self.means[c][i]
                    variance = self.variances[c][i]
                    class_prob += np.log(self.gaussian_probability(x[i], mean, variance))

                class_probabilities[c] = class_prob

            # Choose the class with the highest probability
            predictions.append(max(class_probabilities, key=class_probabilities.get))

        return predictions



| Feature               | My GNB Implementation           | Sklearn GaussianNB                     |
|-----------------------|-------------------------------------|----------------------------------------|
| **Distribution**      | Gaussian (Normal) only             | Gaussian (Normal) only                 |
| **Parameter Calculation** | Manual mean, variance, and prior probability calculation | Automatic mean, variance, and prior calculation |
| **Ease of Use**       | Requires manual setup              | Simple API with `.fit()` & `.predict()` |
| **Flexibility**       | Limited, supports only Gaussian distribution | Can integrate with Sklearn pipeline and additional options |
| **Performance**       | Slower on larger datasets, unoptimized | Optimized with C libraries for faster processing |
| **Error Handling**    | Basic, minimal error handling      | Robust input validation and error handling |

Principal Component Analysis (5 points)

In [4]:
class PCA:
  def __init__(self, data, n_components):
    self.data = data
    self.n_components = n_components

  def fit(self):
    # implement here
    pass

  def transform(self,x):
    # implement here
    pass

Metrics (2+3 points)



In [5]:
def rmse(y_true, y_pred):
  # implement here
  pass

def accuracyNf1_score(y_true, y_pred):
  # implement here
  pass


Visualization Tools (5 points)

In [6]:
### fill here ###

Validate your implementation using libraries (5 points)

(you can generate synthetic data using numpy of import another toy dataset from sklearn)


In [7]:
### fill here ###

# Experiments (45 points)

Use Sklearn classes

---

Dataset preparation

In [None]:
# Load dataset and split to train and test set
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, test_size=0.2, random_state=454)

# Calculate the frequency of each class in the training set
unique_classes_test, class_counts_test = np.unique(y_test, return_counts=True)
unique_classes_train, class_counts_train = np.unique(y_train, return_counts=True)

for cls_train, count_train, cls_test, count_test in zip(unique_classes_train, class_counts_train, unique_classes_test, class_counts_test):
  print(f"Class {cls_train}: {count_train} train  {count_test} test {count_train/count_test} ratio")

Class 0: 144 train  34 test 4.235294117647059 ratio
Class 1: 137 train  45 test 3.0444444444444443 ratio
Class 2: 150 train  27 test 5.555555555555555 ratio
Class 3: 145 train  38 test 3.8157894736842106 ratio
Class 4: 145 train  36 test 4.027777777777778 ratio
Class 5: 144 train  38 test 3.789473684210526 ratio
Class 6: 146 train  35 test 4.171428571428572 ratio
Class 7: 142 train  37 test 3.8378378378378377 ratio
Class 8: 143 train  31 test 4.612903225806452 ratio
Class 9: 141 train  39 test 3.6153846153846154 ratio


Apply classification methods using the dataset directly (10 points)
(paramtre denemeleri tarzı şeyler)

In [None]:
### fill here ###

Apply PCA and find optimal #components with the best recontruction (RMSE) as the objective (10 points)

In [None]:
### fill here ###

Apply PCA class-wise and merge the transformed features (10 points)

In [None]:
### fill here ###

Apply classification methods on the transformed features (PCA outputs) both nomral and class-wise (15 points)
(normal PCA dimension 30, class-wise PCA dimension 3x10)

In [None]:
### fill here ###