**Question 1**

a. Load the data from the featured gene_expression_matrix.txt file (numpy.loadtxt would be helpful). After loading, you should end up with a 62 x 2000 matrix that includes floating point numbers. Print the first three columns of your matrix.

In [1]:
import numpy as np

# Load the data from file
file_path = 'gene_expression_matrix.txt'
gene_expression_data = np.loadtxt(file_path)

# Check the shape of the matrix (62 x 2000)
matrix_shape = gene_expression_data.shape

# Print the first three columns of the matrix
first_three_columns = gene_expression_data[:, :3]

matrix_shape, first_three_columns

((62, 2000),
 array([[ 8589.4163,  5468.2409,  4263.4075],
        [ 3825.705 ,  6970.3614,  5369.9688],
        [ 3230.3287,  3694.45  ,  3400.74  ],
        [ 7126.5988,  3779.0682,  3705.5537],
        [ 9330.6787,  7017.2295,  4723.7825],
        [14876.407 ,  3201.9045,  2327.6263],
        [ 4469.09  ,  5167.0568,  4773.68  ],
        [ 4913.7988,  5215.0477,  4288.6162],
        [ 7144.4062,  2071.4023,  1619.2762],
        [ 5382.3938,  3848.4432,  3372.4887],
        [ 7434.8213,  6471.2114,  5029.6175],
        [ 4214.9   ,  2213.3568,  1611.5188],
        [ 8865.4587,  5447.1864,  4887.0575],
        [ 5934.8888,  3744.9886,  3528.8337],
        [ 5821.6175,  3748.2477,  3439.9538],
        [ 9767.0275,  9785.775 ,  8605.0438],
        [13324.729 ,  9505.0341,  7740.9875],
        [12977.712 ,  7565.6159,  5735.2   ],
        [ 8753.2388,  8978.1341,  7777.8412],
        [ 5012.02  ,  1383.4886,  1269.6487],
        [ 6904.8012,  2260.7773,  1987.0012],
        [ 8347.9838, 

b. Standardize the data to be used in the k-means clustering algorithm, i.e. for each of the 2000 dimensions, subtract the mean of this dimension and divide by its standard deviation for all observations. Print the first three columns of your standardized matrix.

In [2]:
from sklearn.preprocessing import StandardScaler

# Standardize the data
scaler = StandardScaler()
gene_expression_standardized = scaler.fit_transform(gene_expression_data)

# Print the first three columns of the standardized matrix
standardized_first_three_columns = gene_expression_standardized[:, :3]

standardized_first_three_columns

array([[ 0.51292947,  0.23088092,  0.09353633],
       [-1.03981707,  0.92273048,  0.70714741],
       [-1.23388184, -0.58609512, -0.3848306 ],
       [ 0.03611955, -0.5471215 , -0.21580511],
       [ 0.75454626,  0.94431708,  0.34882377],
       [ 2.56219368, -0.81295267, -0.97989429],
       [-0.83010373,  0.09216096,  0.37649296],
       [-0.68514951,  0.1142647 ,  0.10751508],
       [ 0.04192392, -1.33364156, -1.37268904],
       [-0.53240951, -0.51516863, -0.40049653],
       [ 0.13658563,  0.69283101,  0.51841559],
       [-0.91295775, -1.26825988, -1.37699068],
       [ 0.60290635,  0.2211836 ,  0.43936313],
       [-0.35232204, -0.56281795, -0.3138    ],
       [-0.38924318, -0.56131687, -0.36308574],
       [ 0.89677551,  2.2194591 ,  2.50106326],
       [ 2.0564194 ,  2.09015492,  2.02192621],
       [ 1.94330813,  1.19689394,  0.90967565],
       [ 0.56632792,  1.8474743 ,  2.04236235],
       [-0.653134  , -1.65048218, -1.56656471],
       [-0.03617607, -1.24641886, -1.168

c. Implement the k-means clustering algorithm, and use it on your data to obtain a clustering of the 62 patients into 2 clusters (you can use 0 and 1 for cluster labels). Use cosine similarity as your distance/similarity metric.

In [3]:
from numpy.linalg import norm

def cosine_similarity(a, b):
    # Compute the cosine similarity between two vectors
    return np.dot(a, b) / (norm(a) * norm(b))

def k_means_cosine(data, k=2, max_iters=100):
    # Randomly initialize centroids
    np.random.seed(0)  # for reproducibility
    centroids = data[np.random.choice(data.shape[0], k, replace=False)]

    for _ in range(max_iters):
        # Assignment step: Assign points to the nearest centroid
        clusters = np.array([np.argmax([cosine_similarity(x, centroid) for centroid in centroids]) for x in data])

        # Update step: Update centroids to be the mean of points in the cluster
        new_centroids = np.array([data[clusters == j].mean(axis=0) for j in range(k)])

        # Check for convergence (if centroids do not change)
        if np.all(centroids == new_centroids):
            break

        centroids = new_centroids

    return clusters

# Apply k-means clustering to the standardized data
clusters = k_means_cosine(gene_expression_standardized)

clusters

array([1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0])

d. Run the k-means algorithm 5 times on the given data, each time starting from different initial centroids. Calculate and report the squared error distortion for each of the 5 clustering solutions obtained.

In [4]:
# Calculate the squared error distortion for the given clustering solution.
def squared_error_distortion(data, clusters, centroids):
    distortion = 0
    for i, x in enumerate(data):
        centroid = centroids[clusters[i]]
        distance = 1 - cosine_similarity(x, centroid)
        distortion += distance ** 2
    return distortion

# Run k-means 5 times and calculate the squared error distortion for each run
distortions = []
for _ in range(5):
    clusters = k_means_cosine(gene_expression_standardized)
    centroids = [gene_expression_standardized[clusters == j].mean(axis=0) for j in range(2)]
    distortion = squared_error_distortion(gene_expression_standardized, clusters, centroids)
    distortions.append(distortion)

distortions

[16.992952193800757,
 16.992952193800757,
 16.992952193800757,
 16.992952193800757,
 16.992952193800757]

e. For the clustering solution with the lowest squared error distortion, calculate the percentage of cancer patients that ended up in each of the two clusters, and print these percentages (calculate these percentages without including the 62nd patient). Also, print the cluster to which your algorithm assigned the 62nd patient.

In [5]:
# We will use the last clustering result as all results were identical
# The number of cancer patients (first 40 rows) and healthy patients (next 21 rows)
num_cancer_patients = 40
num_healthy_patients = 21

# Exclude the 62nd patient for the initial analysis
clusters_excluding_last = clusters[:-1]

# Calculate the percentage of cancer patients in each cluster
cancer_in_cluster_0 = np.sum(clusters_excluding_last[:num_cancer_patients] == 0) / num_cancer_patients * 100
cancer_in_cluster_1 = np.sum(clusters_excluding_last[:num_cancer_patients] == 1) / num_cancer_patients * 100

# Cluster assignment of the 62nd patient
cluster_62nd_patient = clusters[-1]

cancer_in_cluster_0, cancer_in_cluster_1, cluster_62nd_patient

(40.0, 60.0, 0)

I will try a different initialization for the k-means algorithm and follow the steps again. This may effect the results positively (to distinguish patiens and healty people better) hopefully!

In [6]:
def k_means_plus_plus_initialization(data, k):
    #Initialize centroids using the k-means++ method.
    np.random.seed(0)
    centroids = [data[np.random.choice(data.shape[0])]]

    for _ in range(1, k):
        distances = np.array([min([np.square(1 - cosine_similarity(x, centroid)) for centroid in centroids]) for x in data])
        probabilities = distances / distances.sum()
        new_centroid = data[np.random.choice(data.shape[0], p=probabilities)]
        centroids.append(new_centroid)

    return np.array(centroids)

def k_means_cosine_plus_plus(data, k=2, max_iters=100):
    #K-means clustering using cosine similarity and k-means++ initialization.
    # Initialize centroids using k-means++
    centroids = k_means_plus_plus_initialization(data, k)

    for _ in range(max_iters):
        # Assignment step
        clusters = np.array([np.argmax([cosine_similarity(x, centroid) for centroid in centroids]) for x in data])

        # Update step
        new_centroids = np.array([data[clusters == j].mean(axis=0) for j in range(k)])

        # Check for convergence
        if np.all(centroids == new_centroids):
            break

        centroids = new_centroids

    return clusters

# Run k-means with k-means++ initialization and calculate distortions
distortions_plus_plus = []
for _ in range(5):
    clusters_plus_plus = k_means_cosine_plus_plus(gene_expression_standardized)
    centroids_plus_plus = [gene_expression_standardized[clusters_plus_plus == j].mean(axis=0) for j in range(2)]
    distortion_plus_plus = squared_error_distortion(gene_expression_standardized, clusters_plus_plus, centroids_plus_plus)
    distortions_plus_plus.append(distortion_plus_plus)

distortions_plus_plus

[17.00800257040696,
 17.00800257040696,
 17.00800257040696,
 17.00800257040696,
 17.00800257040696]

In [7]:
# Exclude the 62nd patient for the initial analysis
clusters_plus_plus_excluding_last = clusters_plus_plus[:-1]

# Calculate the percentage of cancer patients in each cluster
cancer_in_cluster_0_plus_plus = np.sum(clusters_plus_plus_excluding_last[:num_cancer_patients] == 0) / num_cancer_patients * 100
cancer_in_cluster_1_plus_plus = np.sum(clusters_plus_plus_excluding_last[:num_cancer_patients] == 1) / num_cancer_patients * 100

# Cluster assignment of the 62nd patient
cluster_62nd_patient_plus_plus = clusters_plus_plus[-1]

cancer_in_cluster_0_plus_plus, cancer_in_cluster_1_plus_plus, cluster_62nd_patient_plus_plus

(37.5, 62.5, 0)

f. Based on your results in (e), are there any significant differences between the clusters in terms of what percentage of them are cancer patients? Can you reach a reliable conclusion as to whether or not the 62nd patient is a cancer or non-cancer sample based on this clustering? Write and discuss your conclusion briefly.

-- Even with the second implementation of K-means clustering (plus plus method), the patients and healty individuals were not seperated clearly. We can maybe say that the 62nd patient is likely healthy since they are on the cluster 0, but it is not reliable by any means since there is no clear segregation between cancer and non-cancer patients in the clusters, even cluster 0 has a significant percentage of cancer patients (37.5).

**Question 2**

a. Create an example string of length 100 of nucleotides and find the k-mers of that string. Calculate the k-mers for each of the k values 3,4,5,20,30,50.

In [8]:
import random

# Generate a random nucleotide string of a given length
def generate_nucleotide_string(length):
    nucleotides = ['A', 'C', 'G', 'T']
    return ''.join(random.choices(nucleotides, k=length))

# Find k-mers of a string
def find_kmers(string, k):
    return [string[i:i+k] for i in range(len(string) - k + 1)]

# Generate a nucleotide string of length 100
nucleotide_string = generate_nucleotide_string(100)

# Calculate k-mers for k = 3, 4, 5, 20, 30, 50
k_values = [3, 4, 5, 20, 30, 50]
kmers = {k: find_kmers(nucleotide_string, k) for k in k_values}

# Print the original nucleotide string
nucleotide_string, kmers

('GAATAAGTGGAAACAGAGCTGTACCCATGATCGGCCTGACCACGCACGAATCCTAGTAATGTTCCGTTGGATCAGTTGCCACGATAAGCTCTCTTTCGAC',
 {3: ['GAA',
   'AAT',
   'ATA',
   'TAA',
   'AAG',
   'AGT',
   'GTG',
   'TGG',
   'GGA',
   'GAA',
   'AAA',
   'AAC',
   'ACA',
   'CAG',
   'AGA',
   'GAG',
   'AGC',
   'GCT',
   'CTG',
   'TGT',
   'GTA',
   'TAC',
   'ACC',
   'CCC',
   'CCA',
   'CAT',
   'ATG',
   'TGA',
   'GAT',
   'ATC',
   'TCG',
   'CGG',
   'GGC',
   'GCC',
   'CCT',
   'CTG',
   'TGA',
   'GAC',
   'ACC',
   'CCA',
   'CAC',
   'ACG',
   'CGC',
   'GCA',
   'CAC',
   'ACG',
   'CGA',
   'GAA',
   'AAT',
   'ATC',
   'TCC',
   'CCT',
   'CTA',
   'TAG',
   'AGT',
   'GTA',
   'TAA',
   'AAT',
   'ATG',
   'TGT',
   'GTT',
   'TTC',
   'TCC',
   'CCG',
   'CGT',
   'GTT',
   'TTG',
   'TGG',
   'GGA',
   'GAT',
   'ATC',
   'TCA',
   'CAG',
   'AGT',
   'GTT',
   'TTG',
   'TGC',
   'GCC',
   'CCA',
   'CAC',
   'ACG',
   'CGA',
   'GAT',
   'ATA',
   'TAA',
   'AAG',
   'AGC',
   'GCT',
   'CTC',
  

b. Implement the below function to reconstruct your original string using the k-mers you generated in step a.
def reconstruct_string_from_kmers(kmers):
#TODO
              return reconstructed_string

In [9]:
# Assuming the k-mers are in order (sequential)
# Assuming each k-mer overlaps with its predecessor by k-1 characters.

def reconstruct_string_from_kmers(kmers):
    if not kmers:
        return ""

    # Start with the first k-mer
    reconstructed_string = kmers[0]

    # Iterate over the k-mers and add the last character from each k-mer
    for kmer in kmers[1:]:
        reconstructed_string += kmer[-1]

    return reconstructed_string

# Reconstruct the string for each k value and compare with the original
reconstructed_strings = {k: reconstruct_string_from_kmers(kmers[k]) for k in k_values}

# Compare reconstructed strings with the original
reconstruction_success = {k: (reconstructed_strings[k] == nucleotide_string) for k in k_values}
reconstruction_success

{3: True, 4: True, 5: True, 20: True, 30: True, 50: True}

c. Explain your implementation in a paragraph. Which approach and algorithms did you use and why did your approach work?

-- I have created a simple implementation demonstrating the process. This only works because I assumed k-mers are in order and they are overlapping. If they are not in order, this problem becomes harder to solve and implement from scratch. We need to do:

1- Build a De Bruijn Graph
2- Find an Eulerian Path

I will use external libraries for the De Brujin Graph (networkx)

In [10]:
import networkx as nx

def create_de_bruijn_graph(kmers):
    G = nx.DiGraph()
    for kmer in kmers:
        left_part = kmer[:-1]
        right_part = kmer[1:]
        G.add_edge(left_part, right_part)
    return G

def find_eulerian_path(G):
    return [u for u, v in nx.eulerian_path(G)]

def reconstruct_string_from_path(path):
    reconstructed_string = path[0]
    for node in path[1:]:
        reconstructed_string += node[-1]
    return reconstructed_string

# Choose k-mer length
chosen_k = 30
kmers_for_reconstruction = kmers[chosen_k]
print(kmers_for_reconstruction)

# Create De Bruijn graph from the chosen k-mers
G = create_de_bruijn_graph(kmers_for_reconstruction)

# Attempt to find an Eulerian path and reconstruct the string
try:
    path = find_eulerian_path(G)
    reconstructed_string = reconstruct_string_from_path(path)
    print("Reconstructed String:", reconstructed_string)
except nx.NetworkXError as e:
    print("Error in finding Eulerian path:", e)

['GAATAAGTGGAAACAGAGCTGTACCCATGA', 'AATAAGTGGAAACAGAGCTGTACCCATGAT', 'ATAAGTGGAAACAGAGCTGTACCCATGATC', 'TAAGTGGAAACAGAGCTGTACCCATGATCG', 'AAGTGGAAACAGAGCTGTACCCATGATCGG', 'AGTGGAAACAGAGCTGTACCCATGATCGGC', 'GTGGAAACAGAGCTGTACCCATGATCGGCC', 'TGGAAACAGAGCTGTACCCATGATCGGCCT', 'GGAAACAGAGCTGTACCCATGATCGGCCTG', 'GAAACAGAGCTGTACCCATGATCGGCCTGA', 'AAACAGAGCTGTACCCATGATCGGCCTGAC', 'AACAGAGCTGTACCCATGATCGGCCTGACC', 'ACAGAGCTGTACCCATGATCGGCCTGACCA', 'CAGAGCTGTACCCATGATCGGCCTGACCAC', 'AGAGCTGTACCCATGATCGGCCTGACCACG', 'GAGCTGTACCCATGATCGGCCTGACCACGC', 'AGCTGTACCCATGATCGGCCTGACCACGCA', 'GCTGTACCCATGATCGGCCTGACCACGCAC', 'CTGTACCCATGATCGGCCTGACCACGCACG', 'TGTACCCATGATCGGCCTGACCACGCACGA', 'GTACCCATGATCGGCCTGACCACGCACGAA', 'TACCCATGATCGGCCTGACCACGCACGAAT', 'ACCCATGATCGGCCTGACCACGCACGAATC', 'CCCATGATCGGCCTGACCACGCACGAATCC', 'CCATGATCGGCCTGACCACGCACGAATCCT', 'CATGATCGGCCTGACCACGCACGAATCCTA', 'ATGATCGGCCTGACCACGCACGAATCCTAG', 'TGATCGGCCTGACCACGCACGAATCCTAGT', 'GATCGGCCTGACCACGCACGAATCCTAGTA', 'ATCGGCCTGACC

d. Does the reconstructed string always match your original one? If not, why?

-- First of all, the algorithm could not reconstruct the string always. It did not work with 3-mers or 4-mers, probably because the graph become very complex. It also does not reconstruct the original string completely correct. The reason may be small errors in k-mers dependent on my coding or multiple Eulerian paths on the graph. 