**Name : Archana Kalburgi**

**CWID : 10469491**

**Solution to Question 1**

How did I select the Principal Components:

- Calculate the Covariance matrix

- Compute the Eigenvalues and Eigenvectors for the calculated Covariance matrix.

- The Eigenvectors Orthogonal to each other and each vector represents a principal axis.

- A Higher Eigenvalue corresponds to a higher variability.

- Therefore, the principal axis with the higher Eigenvalue will be an axis capturing higher variability in the data.

- Sort the Eigenvalues in the descending order along with their corresponding Eigenvector.

- Arranging Eigenvectors in descending order of their Eigenvalue will automatically arrange the principal component in descending order of their variability.

- Hence the first column in the rearranged Eigen vector-matrix will be a principal component that captures the highest variability.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

df = pd.read_csv("/content/drive/MyDrive/ML_Assignments/ML_Assign_test/pima-indians-diabetes.csv")

X = df.iloc[:, 0:8].to_numpy()

X_meaned = X - np.mean(X, axis=0)
covariance = np.cov(X_meaned, rowvar=False)
evalue, evector = np.linalg.eigh(covariance)

sorted_idx = np.argsort(evalue)[::-1]
sorted_evalues = evalue[sorted_idx]
sorted_evectors = evector[:, sorted_idx]

pca = sorted_evectors[:, 0:3]

X_reduced = np.dot(pca.transpose(), X_meaned.transpose()).transpose()

target = df.iloc[:, 8]

ddf = pd.DataFrame(X_reduced, columns = ['PC1', 'PC2', 'PC3'])

df_with_class = pd.concat([ddf, pd.DataFrame(target)], axis= 1)
df_with_class

Unnamed: 0,PC1,PC2,PC3,Class variable
0,75.714655,-35.950783,7.260789,1
1,82.358268,28.908213,5.496671,0
2,74.630643,-67.906496,-19.461808,1
3,-11.077423,34.898486,0.053018,0
4,-89.743788,-2.746937,-25.212859,1
...,...,...,...,...
763,-99.237881,25.080927,19.534825,0
764,78.641239,-7.688010,4.137227,0
765,-32.113198,3.376665,1.587864,0
766,80.214494,-14.186020,-12.351264,1


In [None]:
# Train a classifier using MLE after the data have been projected.

def noramal_eq(xi, mu, sigma_inv, scalar):
  pp = (-1/2)*np.dot(np.matmul(xi - mu, sigma_inv), xi - mu)
  return scalar * (np.e**pp)

# calculating mean, sigma, sigma_inverse, scalar
def components(x):
  mu = np.mean(x, axis=0)
  sigma = np.cov(x, rowvar=False)
  sigma_inv = np.linalg.inv(sigma)
  scalar = 1/np.sqrt(((2*np.pi)**x.shape[1])*np.linalg.det(sigma))
  return (mu, sigma_inv, scalar)

# computing likelihood
def likelihood(x, mu, sigma_inv, scalar):
  return [noramal_eq(x, mu, sigma_inv, scalar) for x in range(x.shape[0])]

# computing the accuracy 
def predit(train_x, train_y, test_x, test_y):
  # compute train accuracy(on train data)
  mu0, sigma_inv0, scalar0 = components(train_x)
  l0 = likelihood(train_x, mu0, sigma_inv0, scalar0 )
  mu1, sigma_inv1, scalar1 = components(train_x)
  l1 = likelihood(train_x, mu1, sigma_inv1, scalar1)

  predicted_y = np.array([1 if ll1 > ll0 else 0 for (ll0, ll1) in zip(l0, l1)])
  train_n_correct = sum([1 if predicted_y[i] == train_y[i] else 0 for i in range(train_y.shape[0])])
  
  # compute test accuracy (on test data)
  tl_0 = likelihood(test_x, mu0, sigma_inv0, scalar0 )
  tl_1 = likelihood(test_x, mu1, sigma_inv1, scalar1 )

  predited_test_y = np.array([1 if ll1 > ll0 else 0 for (ll0, ll1) in zip(tl_0, tl_1)])
  test_n_correct = sum([1 if predicted_y[i] == test_y[i] else 0 for i in range(test_y.shape[0])])

  return (train_n_correct / train_y.shape[0], test_n_correct / test_y.shape[0]) 

In [None]:
train_10scores = []
test_10scores = []
for i in range(1,11):
    print("----------------------------------------")
    (x_train, x_test, y_train, y_test) = train_test_split(df_with_class, df_with_class.iloc[:,3], train_size=0.5)
    train_score, test_score = predit(x_train.iloc[:,0:3].to_numpy(), y_train.to_numpy(), x_test.iloc[:,0:3].to_numpy(), y_test.to_numpy())
    np.array(train_10scores.append(train_score))
    np.array(test_10scores.append(test_score))
    print(f"Train Score is {train_score}")
    print(f"Test Score is {test_score}") 

# printing out the mean and standard deviation of all the 10 accuracy scores of train and test data 
print("-------------------------------------------------")
print("\n")
# print(f"Mean of the accuracies for train data = {np.mean(train_10scores)}")
print(f"Average classification accuracy over 10 runs = {round(np.mean(test_10scores),4)}") 
print("\n")
print(f"Three principal components selected are:")
print("\n")
print(pca.transpose()) 

----------------------------------------
Train Score is 0.6744791666666666
Test Score is 0.6276041666666666
----------------------------------------
Train Score is 0.671875
Test Score is 0.6302083333333334
----------------------------------------
Train Score is 0.65625
Test Score is 0.6458333333333334
----------------------------------------
Train Score is 0.6380208333333334
Test Score is 0.6640625
----------------------------------------
Train Score is 0.640625
Test Score is 0.6614583333333334
----------------------------------------
Train Score is 0.6432291666666666
Test Score is 0.6588541666666666
----------------------------------------
Train Score is 0.6536458333333334
Test Score is 0.6484375
----------------------------------------
Train Score is 0.6796875
Test Score is 0.6223958333333334
----------------------------------------
Train Score is 0.6510416666666666
Test Score is 0.6510416666666666
----------------------------------------
Train Score is 0.6614583333333334
Test Score 