Alleyah Pauline C. Manalili - CS 3101 Prefinal Exam

<b>Parsing Sample Dataset</b>

In [2]:
# Open the ARFF file
with open('2017 Q1.arff', 'r') as file:
    data_lines = file.readlines()  # Read all lines

# Initialize variables
data = []
attributes = []
data_started = False

# Parse the lines
for line in data_lines:
    line = line.strip()  # Remove leading/trailing whitespace

    if not line or line.startswith('%'):
        continue  # Skip empty lines or comments

    if not data_started:
        if line.lower().startswith('@attribute'):
            # Extract attribute names
            attribute_name = line.split()[1]
            attributes.append(attribute_name)
        elif line.lower().startswith('@data'):
            data_started = True
    else:
        # Start parsing data
        instance_values = line.split(',')
        data.append(instance_values)

# Display parsed data and attributes
print("Attributes:", attributes)
print("Data:")
for instance in data:
    print(instance)


Attributes: ['Num', 'Country', 'X1', 'X2', 'X3', 'X4', 'X5', 'X6', 'X7', 'X8', 'X9', 'X10', 'X11', 'X12', 'X13', 'X14', 'X15', 'X16', 'X17', 'X18', 'X19', 'X20', 'X21', 'X22', 'X23', 'X24', 'X25', 'X26', 'X27', 'X28', 'X29', 'X30', 'X31', 'X32', 'X33', 'X34', 'X35', 'X36', 'X37', 'X38', 'X39', 'X40', 'X41', 'X42', 'X43', 'X44', 'X45', 'X46', 'X47', 'X48', 'X49', 'X50', 'X51', 'X52', 'X53', 'X54', 'X55', 'X56', 'X57', 'X58', 'X59', 'X60', 'X61', 'X62', 'X63', 'X64', 'X65', 'X66', 'X67', 'X68', 'X69', 'X70', 'X71', 'X72', 'X73', 'X74', 'X75', 'X76', 'X77', 'X78', 'X79', 'X80', 'X81', 'X82', 'S']
Data:
['10', 'Hungary', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', '0', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 'm', 

<b>Using PCA and SVD with Libraries</b>

<i>Attempt 1: PCA and SVD with Linear Interpolation and Z-Score Standardization</i>

In [1]:
import pandas as pd
import numpy as np
from sklearn.decomposition import PCA

# Open the ARFF file
with open('2017 Q1.arff', 'r') as file:
    lines = file.readlines()

# Initialize empty lists for data and attribute names
data = []
attributes = []

# Parse the ARFF file
data_section = False
for line in lines:
    line = line.strip()
    if not line or line.startswith('%'):
        continue
    if line.lower().startswith('@attribute'):
        attribute_name = line.split()[1]
        attributes.append(attribute_name)
    elif line.lower().startswith('@data'):
        data_section = True
    elif data_section:
        data.append(line.split(','))

# Create a DataFrame
df = pd.DataFrame(data, columns=attributes)

# Select columns X1 to X82
X = df.loc[:, 'X1':'X82']

# Replace 'm' with NaN
X = X.replace('m', np.nan)

# Convert to numeric type
X = X.astype(float)

# Perform linear interpolation
X.interpolate(method='linear', axis=0, inplace=True)

# Fill NaN values using forward and backward filling along columns
X = X.ffill(axis=1).bfill(axis=1)

# Check for NaN values after filling
nan_columns = X.columns[X.isnull().any()]
if len(nan_columns) > 0:
    print("Columns with NaN values after filling:")
    print(nan_columns)
else:
    print("No NaN values found after filling\n")

print('Original Dataframe shape:', X.shape)

# Standardization
X_mean = X.mean()
X_std = X.std()
Z = (X - X_mean) / X_std

# Check for infinite or NaN values in the data
if not np.isfinite(Z).all().all():
    # Handle infinite or NaN values if found
    Z = Z.replace([np.inf, -np.inf], np.nan)
    Z.fillna(0, inplace=True)  # Replace NaN with 0 (or any value suitable for your analysis)

# Check for NaN values after handling
nan_columns = Z.columns[Z.isnull().any()]
if len(nan_columns) > 0:
    print("\nColumns with NaN values after handling infinities:")
    print(nan_columns)
else:
    print("\nNo NaN values found after handling infinities\n")

# Calculate covariance after handling infinities/NaNs
c = Z.cov()

# Calculating eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(c)

# Sorting eigenvalues and corresponding eigenvectors
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

print('Eigen Values (sorted):\n', eigenvalues)
print('\nEigen Values Shape:', eigenvalues.shape)
print('Eigen Vector Shape:', eigenvectors.shape)
print('\n')

explained_var = np.cumsum(eigenvalues) / np.sum(eigenvalues)
explained_var

n_components = np.argmax(explained_var >= 0.50) + 1
n_components

# PCA component or unit matrix
u = eigenvectors[:, :n_components]
pca_component = pd.DataFrame(u,
                             index=X.columns,
                             columns=['PC{}'.format(i + 1) for i in range(n_components)]
                             )

# Matrix multiplication or dot Product
Z_pca = Z @ pca_component

# Print the Principal Component values
print(Z_pca)

# components
pca = PCA()
pca.fit(Z)

# Access the principal components (eigenvectors)
pca_components = pca.components_
pca.components_ 

Z_pca_sklearn = pca.transform(Z)  # PCA transformation using scikit-learn

# Now perform SVD on the PCA-transformed data
U_pca, singular_values_pca, Vt_pca = np.linalg.svd(Z_pca_sklearn)

# Print the matrices obtained from SVD
print("\nU matrix:\n", U_pca)
print("\nSigma matrix:\n", singular_values_pca)
print("\nV^T matrix:\n", Vt_pca)

No NaN values found after filling

Original Dataframe shape: (450, 82)

No NaN values found after handling infinities

Eigen Values (sorted):
 [ 1.39511691e+01  1.06821010e+01  9.07374338e+00  5.78738997e+00
  4.97906424e+00  3.45981196e+00  2.99293153e+00  2.94146233e+00
  2.55138965e+00  2.37437747e+00  2.23600235e+00  1.99401684e+00
  1.76551045e+00  1.68638659e+00  1.42889662e+00  1.35327902e+00
  1.25148714e+00  1.15010541e+00  1.03929347e+00  9.82537956e-01
  9.60825285e-01  9.10575030e-01  8.33172094e-01  7.62114067e-01
  7.03168994e-01  6.35353085e-01  5.46821574e-01  4.63491014e-01
  2.68594574e-01  2.64134971e-01  1.89728987e-01  1.59975068e-01
  1.26322968e-01  8.78613374e-02  7.82161737e-02  7.62994072e-02
  4.79674913e-02  4.25034757e-02  3.40355837e-02  2.55035734e-02
  1.68512398e-02  1.33445086e-02  1.18797072e-02  1.04175553e-02
  8.94865879e-03  8.33202695e-03  7.71003237e-03  6.62371699e-03
  5.47338627e-03  3.23882332e-03  2.52816851e-03  1.86751919e-03
  1.27014461

<i>Version 2: PCA for comparison without Linear Interpolation and Z-Score Standardization, I set 'm' or the missing values as -1</i>

In [11]:
import pandas as pd
import numpy as np

# Open the ARFF file
with open('2017 Q1.arff', 'r') as file:
    lines = file.readlines()

# Initialize empty lists for data and attribute names
data = []
attributes = []

# Parse the ARFF file
data_section = False
for line in lines:
    line = line.strip()
    if not line or line.startswith('%'):
        continue
    if line.lower().startswith('@attribute'):
        attribute_name = line.split()[1]
        attributes.append(attribute_name)
    elif line.lower().startswith('@data'):
        data_section = True
    elif data_section:
        data.append(line.split(','))

# Create a DataFrame
df = pd.DataFrame(data, columns=attributes)

# Select columns X1 to X82
X = df.loc[:, 'X1':'X82']

# Replace 'm' with -1
X = X.replace('m', -1)

# checking shape
print('Original Dataframe shape:', X.shape)

# Convert to numeric type (if necessary)
X = X.astype(float)  # Convert to float type if the columns are not already numeric

# Mean
X_mean = X.mean()

# Standard deviation
X_std = X.std()

# Standardization
Z = (X - X_mean) / X_std

# covariance
c = Z.cov()

# Calculating eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(c)

# Sorting eigenvalues and corresponding eigenvectors
idx = eigenvalues.argsort()[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]

print('Eigen values (sorted):\n', eigenvalues)
print('Eigen values Shape (sorted):', eigenvalues.shape)
print('Eigen Vector Shape (sorted):', eigenvectors.shape)

explained_var = np.cumsum(eigenvalues) / np.sum(eigenvalues)
explained_var

n_components = np.argmax(explained_var >= 0.50) + 1
n_components

# PCA component or unit matrix
u = eigenvectors[:, :n_components]
pca_component = pd.DataFrame(u,
                             index=X.columns,
                             columns=['PC{}'.format(i + 1) for i in range(n_components)]
                             )

# Matrix multiplication or dot Product
Z_pca = Z @ pca_component

# Print the Principal Component values
print(Z_pca)

# components
pca.components_ 


Original Dataframe shape: (450, 82)
Eigen values (sorted):
 [ 1.87087347e+01  1.36621859e+01  6.51016493e+00  5.21050635e+00
  4.96836870e+00  4.33741803e+00  2.80035032e+00  2.62128173e+00
  2.09675319e+00  1.99118400e+00  1.93105442e+00  1.67887285e+00
  1.51835948e+00  1.28753161e+00  1.18420355e+00  1.07380791e+00
  1.01553065e+00  9.88433043e-01  9.49218359e-01  9.22188356e-01
  8.68511139e-01  8.36908656e-01  6.57799449e-01  6.30763394e-01
  5.58296559e-01  5.22357489e-01  3.93698538e-01  3.14915048e-01
  2.89557157e-01  2.27345393e-01  1.84628683e-01  1.66672190e-01
  1.53424053e-01  1.17261254e-01  9.69312239e-02  8.03424705e-02
  7.19107113e-02  6.53931245e-02  5.47997397e-02  5.13015442e-02
  4.39476693e-02  3.80068747e-02  1.81139776e-02  1.60985388e-02
  1.59016403e-02  1.28827683e-02  1.17638360e-02  1.03043007e-02
  8.10626580e-03  7.29556067e-03  4.87773519e-03  4.54423202e-03
  3.21997104e-03  1.99207807e-03  1.04620251e-03  8.99743707e-04
  5.57328009e-04  5.30518584e-

array([[ 8.34827174e-03,  2.66769444e-01,  2.63365103e-01, ...,
         1.14322976e-02,  1.22879641e-02,  1.19028959e-02],
       [-2.73275011e-02, -2.71668171e-03, -7.49806264e-03, ...,
         2.89661609e-01,  2.73378529e-01,  2.88176443e-01],
       [-2.83111140e-01,  8.04731417e-03, -4.54675630e-02, ...,
        -1.03007608e-02, -6.32124084e-03, -9.86204747e-03],
       ...,
       [-0.00000000e+00,  1.74896768e-10, -3.49628905e-15, ...,
        -4.46030362e-01, -4.20121866e-17, -5.76442474e-17],
       [-0.00000000e+00,  3.82139000e-12, -1.37701182e-16, ...,
         6.76221522e-03, -1.94257777e-17, -2.54999903e-17],
       [ 0.00000000e+00,  5.68399988e-10,  4.28551083e-14, ...,
         6.83869765e-01,  2.93997734e-17, -1.03678111e-16]])

<b>Using PCA and SVD without libraries</b>

<i>Attempt 1: PCA with Linear Interpolation and Z-Score Standardization but it takes forever for the kernel to execute</i>

In [2]:
# def dot_product(vec1, vec2):
#     return sum(x * y for x, y in zip(vec1, vec2))

# def transpose(matrix):
#     return [list(row) for row in zip(*matrix)]

# def matrix_multiply(A, B):
#     return [[dot_product(row, col) for col in transpose(B)] for row in A]

# def subtract_mean(data):
      #Subtracts mean from each feature in the dataset.
#     num_rows = len(data)
#     num_cols = len(data[0])
#     means = [sum(data[i][j] for i in range(num_rows)) / num_rows for j in range(num_cols)]
#     return [[data[i][j] - means[j] for j in range(num_cols)] for i in range(num_rows)]

# def covariance_matrix(data):
#     num_rows = len(data)
#     num_cols = len(data[0])
#     cov_matrix = [[sum(data[i][k] * data[j][k] for k in range(num_cols)) / (num_rows - 1)
#                    for j in range(num_cols)] for i in range(num_cols)]
#     return cov_matrix

# def power_iteration(matrix, num_iterations=1000):
      #Power iteration to find dominant eigenvector and eigenvalue.
#     n, d = len(matrix), len(matrix[0])
#     vec = [1] * d
#     for _ in range(num_iterations):
#         # Multiply matrix by vector
#         vec = [sum(matrix[i][j] * vec[j] for j in range(d)) for i in range(n)]
#         norm = sum(x**2 for x in vec) ** 0.5
#         vec = [x / norm for x in vec]
#     eigenvalue = sum(vec[i] * sum(matrix[i][j] * vec[j] for j in range(d)) for i in range(n))
#     return eigenvalue, vec

# def deflate(matrix, eigenvalue, eigenvector):
      #Deflate the matrix to find next eigenvector and eigenvalue.
#     n = len(matrix)
#     w = [[eigenvalue * eigenvector[i] * eigenvector[j] for j in range(n)] for i in range(n)]
#     return [[matrix[i][j] - w[i][j] for j in range(n)] for i in range(n)]

# # Linear Interpolation Function
# def linear_interpolate(data):
#     num_rows = len(data)
#     num_cols = len(data[0])
    
#     for col_index in range(num_cols):
#         column = [row[col_index] for row in data]
#         indices_with_data = [i for i, val in enumerate(column) if val is not None]

#         if len(indices_with_data) > 1:
#             for i, val in enumerate(column):
#                 if val is None:
#                     left_index, right_index = None, None
#                     for idx in indices_with_data:
#                         if idx < i:
#                             left_index = idx
#                         elif idx > i:
#                             right_index = idx
#                             break

#                     if left_index is not None and right_index is not None:
#                         left_val = column[left_index]
#                         right_val = column[right_index]
#                         left_weight = (right_index - i) / (right_index - left_index)
#                         right_weight = (i - left_index) / (right_index - left_index)

#                         interpolated_val = left_val * left_weight + right_val * right_weight
#                         data[i][col_index] = interpolated_val

# # Z-Score Standardization
# def z_score_standardization(data):
#     num_rows = len(data)
#     num_cols = len(data[0])
    
#     means = [sum(row) / num_rows for row in zip(*data)]
#     std_devs = [((sum((val - means[col_index]) ** 2 for val in col if val is not None) / num_rows) ** 0.5)
#                 for col_index, col in enumerate(zip(*data))]
    
#     for i in range(num_rows):
#         for j in range(num_cols):
#             if data[i][j] is not None:
#                 data[i][j] = (data[i][j] - means[j]) / std_devs[j]

# # Read data from ARFF file
# with open('2017 Q1.arff', 'r') as file:
#     lines = file.readlines()

# # Initialize empty lists for data and attribute names
# data = []
# attributes = []

# # Parse the ARFF file
# data_section = False
# for line in lines:
#     line = line.strip()
#     if not line or line.startswith('%'):
#         continue
#     if line.lower().startswith('@attribute'):
#         attribute_name = line.split()[1]
#         attributes.append(attribute_name)
#     elif line.lower().startswith('@data'):
#         data_section = True
#     elif data_section:
#         data.append(line.split(','))

# # Find the indices of 'X1' to 'X82' in the attributes list
# start_index = attributes.index('X1')
# end_index = attributes.index('X82') + 1

# # Create a list of lists containing 'm' replaced with None for columns X1 to X82
# data_replaced = []
# for row in data:
#     replaced_row = [float(value) if value != 'm' else None for value in row[start_index:end_index]]
#     data_replaced.append(replaced_row)

# # Perform Linear Interpolation on the original data
# while any(None in row for row in data_replaced):
#     linear_interpolate(data_replaced)

# # Calculate mean for each column
# num_rows = len(data_replaced)
# num_columns = len(data_replaced[0])
# column_sums = [0] * num_columns
# num_valid_values = [0] * num_columns

# for i in range(num_rows):
#     for j in range(num_columns):
#         if data_replaced[i][j] is not None:
#             column_sums[j] += data_replaced[i][j]
#             num_valid_values[j] += 1

# means = [column_sum / num_valid_values[j] for j, column_sum in enumerate(column_sums)]

# # Subtract mean from the data
# centered_data = [
#     [
#         data_replaced[i][j] - means[j] if data_replaced[i][j] is not None else None
#         for j in range(num_columns)
#     ]
#     for i in range(num_rows)
# ]

# # Apply Z-Score Standardization
# std_devs = [
#     (
#         sum(
#             (val - means[col_index]) ** 2
#             for val in col
#             if val is not None
#         ) / num_valid_values[col_index]
#     ) ** 0.5
#     for col_index, col in enumerate(zip(*data_replaced))
# ]

# for i in range(num_rows):
#     for j in range(num_columns):
#         if centered_data[i][j] is not None:
#             centered_data[i][j] /= std_devs[j]

# # Compute covariance matrix using optimized code
# cov_matrix_python = covariance_matrix(centered_data)

# # Extract eigenvalues and eigenvectors using power iteration and matrix deflation
# eigenvalues_python, eigenvectors_python = [], []
# for _ in range(num_columns):
#     eigenvalue, eigenvector = power_iteration(cov_matrix_python)
#     eigenvalues_python.append(eigenvalue)
#     eigenvectors_python.append(eigenvector)
#     cov_matrix_python = deflate(cov_matrix_python, eigenvalue, eigenvector)

# # Sort eigenvalues and corresponding eigenvectors
# idx_python = sorted(range(len(eigenvalues_python)), key=lambda i: eigenvalues_python[i], reverse=True)
# eigenvalues_python = [eigenvalues_python[i] for i in idx_python]
# eigenvectors_python = [eigenvectors_python[i] for i in idx_python]

# # Printing original dataframe shape
# print(f"Original Dataframe shape: ({num_rows}, {num_columns})")

# # Print the sorted eigenvalues
# print("Eigen values (sorted):")
# print(eigenvalues_python)

# # Print the shape of sorted eigenvalues
# print(f"Eigen values Shape (sorted): ({len(eigenvalues_python)},)")

# # Print the shape of sorted eigenvectors
# print(f"Eigen Vector Shape (sorted): ({len(eigenvectors_python)}, {len(eigenvectors_python[0])})")

# # Print the first 4 principal components
# print("Principal Components (sorted):")
# for i in range(4):
#     print(f"PC{i+1}: {eigenvectors_python[i]}")

# # Project the data onto the first 4 principal components
# Z_pca_python = matrix_multiply(centered_data, eigenvectors_python[:4])

# # Print the projected data (first 5 rows)
# print("Projected Data (first 5 rows):")
# for row in Z_pca_python[:5]:
#     print(row)


<i>Attempt 2: PCA and SVD Since I could not perform Linear Interpolation and Z-Score Standardization, I set 'm' or the missing values as -1</i>

In [10]:
def dot_product(vec1, vec2):
    return sum(x * y for x, y in zip(vec1, vec2))

def transpose(matrix):
    return [list(row) for row in zip(*matrix)]

def matrix_multiply(A, B):
    return [[dot_product(row, col) for col in transpose(B)] for row in A]

def subtract_mean(data):
    #Subtracts mean from each feature in the dataset.
    num_rows = len(data)
    num_cols = len(data[0])
    means = [sum(data[i][j] for i in range(num_rows)) / num_rows for j in range(num_cols)]
    return [[data[i][j] - means[j] for j in range(num_cols)] for i in range(num_rows)]

def covariance_matrix(data):
    num_rows = len(data)
    num_cols = len(data[0])
    cov_matrix = [[sum(data[i][k] * data[j][k] for k in range(num_cols)) / (num_rows - 1)
                   for j in range(num_cols)] for i in range(num_cols)]
    return cov_matrix

def power_iteration(matrix, num_iterations=1000):
    #Power iteration to find dominant eigenvector and eigenvalue.
    n, d = len(matrix), len(matrix[0])
    vec = [1] * d
    for _ in range(num_iterations):
        # Multiply matrix by vector
        vec = [sum(matrix[i][j] * vec[j] for j in range(d)) for i in range(n)]
        norm = sum(x**2 for x in vec) ** 0.5
        vec = [x / norm for x in vec]
    eigenvalue = sum(vec[i] * sum(matrix[i][j] * vec[j] for j in range(d)) for i in range(n))
    return eigenvalue, vec

def deflate(matrix, eigenvalue, eigenvector):
    #Deflate the matrix to find next eigenvector and eigenvalue.
    n = len(matrix)
    w = [[eigenvalue * eigenvector[i] * eigenvector[j] for j in range(n)] for i in range(n)]
    return [[matrix[i][j] - w[i][j] for j in range(n)] for i in range(n)]

def svd(matrix, epsilon=1e-10, max_iter=100):
    # Transpose of matrix
    def transpose(matrix):
        return [[matrix[j][i] for j in range(len(matrix))] for i in range(len(matrix[0]))]

    # Matrix multiplication
    def matrix_multiply(matrix1, matrix2):
        return [[sum(a * b for a, b in zip(row, col)) for col in transpose(matrix2)] for row in matrix1]

    # Matrix transpose multiplication
    def transpose_multiply(matrix1, matrix2):
        return matrix_multiply(transpose(matrix1), matrix2)

    # Initialize matrices
    U = [[0 for _ in range(len(matrix))] for _ in range(len(matrix))]
    Sigma = [[0 for _ in range(len(matrix[0]))] for _ in range(len(matrix))]

    # Initialize U as an identity matrix
    for i in range(len(U)):
        U[i][i] = 1

    # Initialize V as matrix transpose
    V = transpose(matrix)

    # Perform power iteration for U
    for _ in range(max_iter):
        U_new = matrix_multiply(matrix, V)

        # Normalize U_new
        norm = sum([U_new[i][i] ** 2 for i in range(len(U_new))]) ** 0.5
        U_new = [[U_new[i][j] / norm for j in range(len(U_new[0]))] for i in range(len(U_new))]

        # Calculate convergence
        diff = [[U_new[i][j] - U[i][j] for j in range(len(U[0]))] for i in range(len(U))]
        error = sum([sum([d ** 2 for d in row]) for row in diff])

        # Update U and break if convergence achieved
        U = U_new
        if error < epsilon:
            break

        V_new = transpose_multiply(matrix, U)

        # Normalize V_new
        norm = sum([V_new[i][i] ** 2 for i in range(len(V_new))]) ** 0.5
        V_new = [[V_new[i][j] / norm for j in range(len(V_new[0]))] for i in range(len(V_new))]

        # Calculate convergence
        diff = [[V_new[i][j] - V[i][j] for j in range(len(V[0]))] for i in range(len(V))]
        error = sum([sum([d ** 2 for d in row]) for row in diff])

        # Update V and break if convergence achieved
        V = V_new
        if error < epsilon:
            break

    # Calculate Sigma
    singular_values = [sum(U[i][k] * matrix[k][j] for k in range(len(matrix))) for i in range(len(matrix)) for j in range(len(matrix[0]))]
    Sigma = [[singular_values[i] if i == j else 0 for j in range(len(matrix[0]))] for i in range(len(matrix))]

    return U, Sigma, V

# Read data from ARFF file
with open('2017 Q1.arff', 'r') as file:
    lines = file.readlines()

# Initialize empty lists for data and attribute names
data = []
attributes = []

# Parse the ARFF file
data_section = False
for line in lines:
    line = line.strip()
    if not line or line.startswith('%'):
        continue
    if line.lower().startswith('@attribute'):
        attribute_name = line.split()[1]
        attributes.append(attribute_name)
    elif line.lower().startswith('@data'):
        data_section = True
    elif data_section:
        data.append(line.split(','))

# Find the indices of 'X1' to 'X82' in the attributes list
start_index = attributes.index('X1')
end_index = attributes.index('X82') + 1

# Create a list of lists containing 'm' replaced with -1 for columns X1 to X82
data_replaced = []
for row in data:
    replaced_row = [float(value) if value != 'm' else -1 for value in row[start_index:end_index]]
    data_replaced.append(replaced_row)

# Calculate mean for each column
num_rows = len(data_replaced)
num_columns = len(data_replaced[0])
means = [sum(col) / num_rows for col in zip(*data_replaced)]

# Subtract mean from the data
centered_data = [[data_replaced[i][j] - means[j] for j in range(num_columns)] for i in range(num_rows)]

# Compute covariance matrix using pure Python functions
cov_matrix_python = covariance_matrix(centered_data)

# Extract eigenvalues and eigenvectors using power iteration and matrix deflation
eigenvalues_python, eigenvectors_python = [], []
for _ in range(num_columns):
    eigenvalue, eigenvector = power_iteration(cov_matrix_python)
    eigenvalues_python.append(eigenvalue)
    eigenvectors_python.append(eigenvector)
    cov_matrix_python = deflate(cov_matrix_python, eigenvalue, eigenvector)

# Sort eigenvalues and corresponding eigenvectors
idx_python = sorted(range(len(eigenvalues_python)), key=lambda i: eigenvalues_python[i], reverse=True)
eigenvalues_python = [eigenvalues_python[i] for i in idx_python]
eigenvectors_python = [eigenvectors_python[i] for i in idx_python]

print("Original Dataframe shape: (" + str(num_rows) + ", " + str(num_columns) + ")")

# Print the sorted eigenvalues
print("Eigen Values (sorted):\n", eigenvalues_python)

print("Eigen Values Shape: (" + str(len(eigenvalues_python)) + ",)")
print("Eigen Vector Shape: (" + str(len(eigenvectors_python)) + ", " + str(len(eigenvectors_python[0])) + ")")

# Print the first 5 principal components
print("Principal Components (sorted):")

def print_pc_row(index, values):
    if isinstance(index, int):
        index_str = str(index)
    else:
        index_str = index
    pc_values = " ".join("{:.6f}".format(value) if isinstance(value, float) else value for value in values)
    print("{:<8s}{}".format(index_str, pc_values))

print("{:<8s}{:<10s}{:<10s}{:<10s}{:<10s}{:<10s}".format("Index", "PC1", "PC2", "PC3", "PC4", "PC5"))

num_rows_to_display = 5  # Number of rows to display

for i in range(num_rows_to_display):
    print_pc_row(i, eigenvectors_python[i][:5])

print("...")

# Compute SVD 
U_svd, Sigma_svd, Vt_svd = svd(centered_data)

def print_partial_matrix(matrix, rows=5, cols=5):
    for i in range(min(rows, len(matrix))):
        row_values = matrix[i][:cols]
        print(" ".join("{:.6f}".format(value) if isinstance(value, float) else str(value) for value in row_values))
    if len(matrix) > rows:
        print("...")

print("\nU matrix:")
print_partial_matrix(U_svd)

print("\nSigma matrix:")
print_partial_matrix(Sigma_svd)

print("\nV^T matrix:")
print_partial_matrix(Vt_svd)

Original Dataframe shape: (450, 82)
Eigen Values (sorted):
 [3.85798259379908e+16, 2831065752259.357, 5340703449.384116, 3731452625.19838, 49992157.27451604, 25473200.388354164, 237122.5931884624, 3551.0341445680406, 2934.935505821258, 672.7002573849039, 571.0514661149846, 203.86513700578698, 174.15013298007938, 59.702227682231324, 3.731832629323931, 1.492323675822077, 1.414882614594198, 0.879260983255742, 0.7482398785300962, 0.5674413515140343, 0.4237466029721288, 0.25373643088181724, 0.21985093450510382, 0.17443551445561561, 0.11185284661487263, 0.08525623385346831, 0.04093431961924817, 0.03804600729518891, 0.029178509195898903, 0.026242599764341606, 0.01943118483552194, 0.01726069075283837, 0.013342209008591803, 0.01260755816542752, 0.010643333831488855, 0.007594498510652607, 0.006513159611489723, 0.005876045561281467, 0.003497205155536013, 0.0032755881298789278, 0.0022036881671380778, 0.002001866844016312, 0.001805751530617628, 0.0013573456724996955, 0.0011670484912635565, 0.000951

<b>Conclusion</b>

I used only one file from the zipped dataset since my Jupyter Notebook had execution struggles. For a fair comparison, I will compare the PCA of the codes where I replaced 'm' or the missing values as -1. I noticed both eigenvalue outputs seem different. It may be due to a calculation error on the pure Python side. As for the PCA itself, they are both close to 0. Now, for the PCA and SVD of the one with a library, linear interpolation, and z-score standardization, its eigenvalues and SVD matrices seemed appropriate. Also, the PCA values are close to 0. Be warned the values for the SVD matrices without a library are most likely wrong due to the strange repeating values. Using pure Python without libraries and imports can be increasingly complex and computationally extensive. It is due to performing functions manually. Ultimately, this would result in a severely longer execution time. Completing this code was a struggle due to a lack of sources online for pure Python. Most recommend using libraries in the present. 