# Clustering on DeepFold Embeddings
Haerang Lee

I'm going to take Skyler's DeepFold embeddings. Those files are in `embeddings/DeepFold` in the GCS bucket.

Then let me run some clustering models on top of it.

**Silly question**: If I want to put this notebook under a directory, how do I access `utils` in the parent directory? Right now I just put my notebook in the home dir.

In [1]:
from google.cloud import storage
import argparse
import gzip
import os
import sys
import time
from multiprocessing import Pool

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

from utils import gcs_utils as gcs

In [2]:
# Get all the keys from gcs
allkeys = gcs.list_keys()

In [3]:
# What's in here?
allkeys[0:10]

['/annotations/blast_annotations.csv',
 'UP000005640_9606_HUMAN.tar',
 'UP000005640_9606_HUMAN/cif/AF-A0A024R1R8-F1-model_v1.cif.gz',
 'UP000005640_9606_HUMAN/cif/AF-A0A024RBG1-F1-model_v1.cif.gz',
 'UP000005640_9606_HUMAN/cif/AF-A0A024RCN7-F1-model_v1.cif.gz',
 'UP000005640_9606_HUMAN/cif/AF-A0A075B6H5-F1-model_v1.cif.gz',
 'UP000005640_9606_HUMAN/cif/AF-A0A075B6H7-F1-model_v1.cif.gz',
 'UP000005640_9606_HUMAN/cif/AF-A0A075B6H8-F1-model_v1.cif.gz',
 'UP000005640_9606_HUMAN/cif/AF-A0A075B6H9-F1-model_v1.cif.gz',
 'UP000005640_9606_HUMAN/cif/AF-A0A075B6I0-F1-model_v1.cif.gz']

In [4]:
# How many files are there?
len(allkeys)

46812

In [5]:
# I just want the DeepFold embedding files
for k in allkeys:
    if "embed" in k:
        print(k)

embeddings/
embeddings/DeepFold/
embeddings/DeepFold/embeddings_0.csv
embeddings/DeepFold/embeddings_1.csv
embeddings/DeepFold/embeddings_10.csv
embeddings/DeepFold/embeddings_11.csv
embeddings/DeepFold/embeddings_12.csv
embeddings/DeepFold/embeddings_13.csv
embeddings/DeepFold/embeddings_14.csv
embeddings/DeepFold/embeddings_15.csv
embeddings/DeepFold/embeddings_16.csv
embeddings/DeepFold/embeddings_17.csv
embeddings/DeepFold/embeddings_18.csv
embeddings/DeepFold/embeddings_19.csv
embeddings/DeepFold/embeddings_2.csv
embeddings/DeepFold/embeddings_20.csv
embeddings/DeepFold/embeddings_21.csv
embeddings/DeepFold/embeddings_22.csv
embeddings/DeepFold/embeddings_23.csv
embeddings/DeepFold/embeddings_3.csv
embeddings/DeepFold/embeddings_4.csv
embeddings/DeepFold/embeddings_5.csv
embeddings/DeepFold/embeddings_6.csv
embeddings/DeepFold/embeddings_7.csv
embeddings/DeepFold/embeddings_8.csv
embeddings/DeepFold/embeddings_9.csv


In [6]:
prefix = 'embeddings/DeepFold'
keys = gcs.list_file_paths(prefix)

There are 24 files in the embeddings folder, each containing 1,000 proteins (except the last one). Here are the file names.

In [7]:
keys

['gs://capstone-fall21-protein/embeddings/DeepFold/',
 'gs://capstone-fall21-protein/embeddings/DeepFold/embeddings_0.csv',
 'gs://capstone-fall21-protein/embeddings/DeepFold/embeddings_1.csv',
 'gs://capstone-fall21-protein/embeddings/DeepFold/embeddings_10.csv',
 'gs://capstone-fall21-protein/embeddings/DeepFold/embeddings_11.csv',
 'gs://capstone-fall21-protein/embeddings/DeepFold/embeddings_12.csv',
 'gs://capstone-fall21-protein/embeddings/DeepFold/embeddings_13.csv',
 'gs://capstone-fall21-protein/embeddings/DeepFold/embeddings_14.csv',
 'gs://capstone-fall21-protein/embeddings/DeepFold/embeddings_15.csv',
 'gs://capstone-fall21-protein/embeddings/DeepFold/embeddings_16.csv',
 'gs://capstone-fall21-protein/embeddings/DeepFold/embeddings_17.csv',
 'gs://capstone-fall21-protein/embeddings/DeepFold/embeddings_18.csv',
 'gs://capstone-fall21-protein/embeddings/DeepFold/embeddings_19.csv',
 'gs://capstone-fall21-protein/embeddings/DeepFold/embeddings_2.csv',
 'gs://capstone-fall21-pro

One of those embeddings files contains 1,000 proteins.
The first three elements appear to be empty data, or just `'', '0', '1\n0'`. The remaining 2,000 are pairs of protein name and relevant embedding (2\*1000).

In [8]:
keys[1]

'gs://capstone-fall21-protein/embeddings/DeepFold/embeddings_0.csv'

In [65]:
key = gcs.uri_to_bucket_and_key(keys[22])[1]
key

'embeddings/DeepFold/embeddings_7.csv'

## Download and Parse DeepFold Embeddings

In [68]:
# Let me download one and play with it.
df_emb = gcs.download_text(key)

In [69]:
# Decode it then split it into a list.
df_emb_decode = df_emb.decode('utf-8').split(",")

In [70]:
# 2003 items, where the first few are not just metadata or empty strings
len(df_emb_decode)

2003

In [71]:
# Item index 3 is where the real data starts. That's the protein name.
df_emb_decode[0:4]

['', '0', '1\n0', 'AF-P52758-F1-model_v1']

In [75]:
# ... Followed by the 398-length vector that represents a DeepFold embedding
df_emb_decode[4]

'"[0.02314364 0.         0.         0.         0.         0.\n 0.21306136 0.         0.         0.         0.         0.\n 0.00894155 0.14864615 0.         0.16724743 0.         0.\n 0.         0.         0.14527881 0.         0.05551712 0.01512884\n 0.         0.         0.         0.03125526 0.         0.\n 0.08913685 0.         0.         0.         0.         0.\n 0.13471818 0.         0.         0.         0.         0.11604603\n 0.         0.         0.0529888  0.00097862 0.         0.\n 0.         0.         0.00811869 0.         0.         0.02698607\n 0.         0.         0.         0.         0.07556766 0.\n 0.18254162 0.00297375 0.         0.         0.         0.\n 0.         0.         0.02795849 0.         0.         0.06582197\n 0.         0.         0.04518154 0.         0.         0.\n 0.         0.         0.07439934 0.         0.1683139  0.\n 0.         0.         0.01760291 0.         0.07878704 0.01621049\n 0.         0.12183473 0.         0.02944913 0.         0.

In [91]:
# Let's figure out how to parse the DeepFold embedding. 
# There's a lot of funny stuff in here.
# First, get rid of the file number at the end and just keep the vector

sample_emb = df_emb_decode[6].rsplit('\n', 1)[0]

In [165]:
# Now get rid of the double quotes and brackets to just get the values of the array

sample_emb_np = np.array(sample_emb[2:-2].split())
sample_emb_np

array(['0.', '0.', '0.', '0.', '0.05980834', '0.', '0.11338624',
       '0.03886417', '0.02524441', '0.', '0.', '0.', '0.', '0.05950396',
       '0.01365902', '0.12355073', '0.00876983', '0.01822875',
       '0.09609001', '0.04257268', '0.04186481', '0.02684545',
       '0.03763113', '0.', '0.03659998', '0.', '0.', '0.', '0.059562',
       '0.', '0.06115272', '0.1620137', '0.03983198', '0.01684552',
       '0.02932842', '0.00660839', '0.00770707', '0.04277336',
       '0.01949005', '0.', '0.', '0.', '0.', '0.04254051', '0.07754262',
       '0.02618456', '0.', '0.', '0.', '0.', '0.', '0.', '0.01862345',
       '0.', '0.04642207', '0.', '0.', '0.07284059', '0.06610846',
       '0.12522222', '0.', '0.', '0.', '0.03716445', '0.', '0.',
       '0.04749022', '0.', '0.03763493', '0.0703754', '0.', '0.',
       '0.02273948', '0.', '0.00482177', '0.03592151', '0.', '0.', '0.',
       '0.', '0.00829109', '0.02022448', '0.12619579', '0.', '0.', '0.',
       '0.06418524', '0.', '0.', '0.09890323',

In [166]:
# Size is as expected 

sample_emb_np.shape

(398,)

In [125]:
# items 0-2 in the embedding file are not relevant. 
# Take the rest and convert into a numpy array 
    # First column is the protein names
    # Second  column is the DeepFold. vector (size 398)

np_emb = np.array(df_emb_decode[3:]).reshape(1000, 2)

'AF-P52758-F1-model_v1'

In [233]:
# Let's take those columns separately to clean it up

X=np_emb[:,1].reshape(1000,1)
protein=np_emb[:,0].reshape(1000,1)

In [236]:
X.shape

(1000, 1)

## Parsing function

In [234]:
def parse_deepfold_embedding(unparsed):
    """Given a single DeepFold vector in a text format, 
    drop the metadata and convert the into a numpy array."""
    
    return np.array([float(n) for n in unparsed[0].rsplit('\n', 1)[0][2:-2].split()])

In [301]:
# Map the custom parse function. 
# The result is in a list.

parsed_list = list(map(parse_deepfold_embedding, X))

# Proteins that are missing DeepFold emeddings

In [302]:
# Unfortunately, not every protein has a 398-D embedding.
# Let's find those out. 

missing=[]
for i in range(len(parsed_list)):
    if len(parsed_list[i])!= 398:
        missing.append(i)

In [295]:
# These proteisns are missing DeepFold embeddings

np_emb[missing, 0]

array(['AF-P53803-F1-model_v1', 'AF-P56378-F1-model_v1',
       'AF-P56381-F1-model_v1', 'AF-P58511-F1-model_v1',
       'AF-P62273-F1-model_v1', 'AF-P62328-F1-model_v1',
       'AF-P62891-F1-model_v1', 'AF-P62945-F1-model_v1',
       'AF-P63313-F1-model_v1', 'AF-P80294-F1-model_v1',
       'AF-P80297-F1-model_v1', 'AF-P84101-F1-model_v1',
       'AF-Q00LT1-F1-model_v1'], dtype='<U6076')

In [296]:
[len(embedding) for embedding in np_emb[missing, 1]]

[3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4]

In [297]:
np_emb[missing, 1]

array(['\n60', '\n220', '\n221', '\n340', '\n551', '\n563', '\n600',
       '\n608', '\n648', '\n781', '\n782', '\n834', '\n987'],
      dtype='<U6076')

In [305]:
len(missing)

13

In [303]:
# Let's generate a clean dataset by eliminating the proteins with missing embeddings 
# We end up with 987 out of 1,000 proteins. 

X_clean = np.stack([x for i,x in enumerate(parsed_list) if i not in missing])
X_clean.shape

(987, 398)

In [299]:
protein_clean =  np.stack([x for i,x in enumerate(protein) if i not in missing])
protein_clean.shape

(987, 1)

# Clustering

In [316]:
from sklearn.cluster import DBSCAN
clustering = DBSCAN(eps=0.5, min_samples=5).fit(X_clean)

In [317]:
print(np.asarray(np.unique(clustering.labels_, return_counts=True)).T)

[[ -1 552]
 [  0   5]
 [  1 257]
 [  2  21]
 [  3  33]
 [  4   5]
 [  5  40]
 [  6  26]
 [  7  20]
 [  8   8]
 [  9   8]
 [ 10  12]]
