# Evaluate cluster outcome
Haerang Lee

Across two embeddings (DeepFold and SeqVec) and two algorithms (DBSCAN, HDBSCAN), we saw similar issues arise: a vast majority of the proteins (50-90%) did not fall into a cluster. We explored a few different hypotheses to explain the phenomenon.

1. Maybe the dimensionality is too high and our algorithms can't handle. **Tried to remediate it** by reducing dimensions, but that only worsened the problem. 

2. Maybe 398D was enough to predict a small number of functions, but might be too sparse to capture enough information for our big data and task. **Potential solution: We could try to increase dimension by re-doing this with higher D.** To do that, we'd need to re-do the protein function prediction training and just re-train the top layer. 

3. Given that both DeepFold and SeqVec returned similar results, this may just be how the proteins are and not an issue with an embedding or the model. **Evaluate the quality of clusters through BLAST and functional similarity.**

So now, the focus has come down again to the evalution. 
The plan is to evaluate the clusters from three angles.
- AA sequential sim: blast 
- functional sim: 
    * Could we just use DeepFold second half to predict function? Looks like we can't. Their code just shows function labels as a number between 0 and 255, so it's not helpful. 
    * Could use gene ontology molecular functions (GO MF) from GSEA 
- structural sim: TM score, RMSD


In [1]:
from google.cloud import storage
import argparse
import gzip
import os
import sys
import time
from multiprocessing import Pool

import numpy as np
import pandas as pd
from tqdm.auto import tqdm

from utils import gcs_utils as gcs
from utils import model_and_evaluate_cluster as ev
import urllib.parse
import urllib.request

import io 

import importlib

In [4]:
# Get all the keys from gcs
allkeys = gcs.list_keys()

# Cluster model
Let's do SeqVec + HDBSCAN

In [88]:
prefix = 'embeddings/SeqVec/'
keys = gcs.list_file_paths(prefix)[1:]
# This is the list of file paths in gcs 
len(keys)

1

In [89]:
seqvec_df = gcs.download_pkl(gcs.uri_to_bucket_and_key(keys[0])[1])

In [11]:
# Convert pd Series into np arr
seqvec_np = np.stack( seqvec_df["seqvec"]).astype(np.float64)

In [216]:
seqvec_df[seqvec_df["seqvec"].str.len()==3]

Unnamed: 0,pdbx_db_accession,db_code,db_name,protein_id,pdbx_seq_one_letter_code,protein_filename,seqvec


In [12]:
seqvec_np.shape

(20504, 1024)

In [13]:
import hdbscan

In [140]:
clusterer = hdbscan.HDBSCAN(algorithm='generic', alpha=1.0, approx_min_span_tree=True,
    gen_min_span_tree=False, leaf_size=40, 
    metric='cosine', min_cluster_size=5, min_samples=1, p=None)
clusterer.fit(seqvec_np)
clusterer.labels_.max()

756

In [141]:
np.unique(clusterer.labels_, return_counts=True)[1][0:10]

array([11579,    10,    24,     6,     5,     6,    53,     6,     6,
          17])

In [212]:
np.sort(np.unique(clusterer.labels_, return_counts=True)[1])[-20:]

array([   40,    41,    43,    47,    47,    49,    50,    53,    53,
          57,    58,    65,    67,    69,    89,    99,   129,   195,
         416, 11579])

In [142]:
np.unique(clusterer.labels_, return_counts=True)[1].min()

5

In [16]:
protein_id_full= seqvec_df["protein_id"]
protein_id_full[0:10]

0    A0A024R1R8
1    A0A024RBG1
2    A0A024RCN7
3    A0A075B6H5
4    A0A075B6H7
5    A0A075B6H8
6    A0A075B6H9
7    A0A075B6I0
8    A0A075B6I1
9    A0A075B6I3
Name: protein_id, dtype: object

In [143]:
clusters = pd.DataFrame({'protein': protein_id_full, 
                         'cluster_label': clusterer.labels_})
clusters[0:10]

Unnamed: 0,protein,cluster_label
0,A0A024R1R8,-1
1,A0A024RBG1,83
2,A0A024RCN7,-1
3,A0A075B6H5,-1
4,A0A075B6H7,105
5,A0A075B6H8,-1
6,A0A075B6H9,97
7,A0A075B6I0,100
8,A0A075B6I1,-1
9,A0A075B6I3,97


In [269]:
importlib.reload(ev)
# seqvec_df[seqvec_df["seqvec"].str.len()==3]

<module 'utils.model_and_evaluate_cluster' from '/Users/haeranglee/Documents/pss/utils/model_and_evaluate_cluster.py'>

In [222]:
sil_sc, db_sc = ev.silhouette_n_davies(seqvec_np, clusterer.labels_)

 | Silhouette -0.1411 | DB sc 1.5924 | noise 11579 | k 758 | max clus size 416


In [226]:
# Try without noise
ev.silhouette_n_davies(seqvec_np[clusterer.labels_!=-1],
                       clusterer.labels_[clusterer.labels_!=-1])

 | Silhouette 0.2347 | DB sc 1.3479 | noise 0 | k 757 | max clus size 416


(0.23466609163568503, 1.347855372900973)

In [236]:
# Danie's request - why does 579 have so many proteins yet such high TM score
clusters[clusters.cluster_label==579]

Unnamed: 0,protein,cluster_label
105,A0A096LPK9,579
233,A0A0G2JMP0,579
241,A0A0G2JNH3,579
296,A0A0X1KG70,579
297,A0A126GWB0,579
...,...,...
20091,Q9Y3N9,579
20138,Q9Y4A9,579
20216,Q9Y585,579
20295,Q9Y5P0,579


In [244]:
clusters[0:100]

Unnamed: 0,protein,cluster_label
0,A0A024R1R8,-1
1,A0A024RBG1,83
2,A0A024RCN7,-1
3,A0A075B6H5,-1
4,A0A075B6H7,105
...,...,...
95,A0A096LNP1,329
96,A0A096LNT9,-1
97,A0A096LNW5,360
98,A0A096LP01,-1


In [257]:
clusters[clusters.cluster_label==578][100:120]

Unnamed: 0,protein,cluster_label
5949,P46094,578
5962,P46663,578
5983,P47211,578
5990,P47775,578
6007,P47898,578
6008,P47900,578
6009,P47901,578
6022,P48039,578
6035,P48145,578
6036,P48146,578


**Try older model**

See how silhouette and davids scores are

In [228]:
clusterer2 = hdbscan.HDBSCAN(algorithm='generic', alpha=1.0, approx_min_span_tree=True,
    gen_min_span_tree=False, leaf_size=40, 
    metric='cosine', min_cluster_size=5, min_samples=5, p=None)
clusterer2.fit(seqvec_np)
clusterer2.labels_.max()

print("Scores over all data")
ev.silhouette_n_davies(seqvec_np, clusterer2.labels_)
print("Scores without noise")
ev.silhouette_n_davies(seqvec_np[clusterer2.labels_!=-1],
                       clusterer2.labels_[clusterer2.labels_!=-1])

Scores over all data
 | Silhouette -0.1757 | DB sc 1.401 | noise 14433 | k 344 | max clus size 401
Scores without noise
 | Silhouette 0.3358 | DB sc 1.0604 | noise 0 | k 343 | max clus size 401


(0.33577963376676456, 1.0604359945846213)

In [230]:
# min_cluster_size =2 

clusterer2 = hdbscan.HDBSCAN(algorithm='generic', alpha=1.0, approx_min_span_tree=True,
    gen_min_span_tree=False, leaf_size=40, 
    metric='cosine', min_cluster_size=2, min_samples=5, p=None)
clusterer2.fit(seqvec_np)
clusterer2.labels_.max()

print("Scores over all data")
ev.silhouette_n_davies(seqvec_np, clusterer2.labels_)
print("Scores without noise")
ev.silhouette_n_davies(seqvec_np[clusterer2.labels_!=-1],
                       clusterer2.labels_[clusterer2.labels_!=-1])

Scores over all data
 | Silhouette -0.2133 | DB sc 1.476 | noise 14040 | k 616 | max clus size 401
Scores without noise
 | Silhouette 0.2985 | DB sc 1.115 | noise 0 | k 615 | max clus size 401


(0.29854277019559855, 1.1149654692123605)

In [245]:
# min samples = 1, leaf size = 60

clusterer2 = hdbscan.HDBSCAN(algorithm='generic', alpha=1.0, approx_min_span_tree=True,
    gen_min_span_tree=False, leaf_size=60, 
    metric='manhattan', min_cluster_size=5, min_samples=1, p=None)
clusterer2.fit(seqvec_np)
clusterer2.labels_.max()

print("Scores over all data")
ev.silhouette_n_davies(seqvec_np, clusterer2.labels_)
print("Scores without noise")
ev.silhouette_n_davies(seqvec_np[clusterer2.labels_!=-1],
                       clusterer2.labels_[clusterer2.labels_!=-1])

Scores over all data
 | Silhouette -0.1752 | DB sc 1.5795 | noise 12557 | k 701 | max clus size 546
Scores without noise
 | Silhouette 0.2443 | DB sc 1.3066 | noise 0 | k 700 | max clus size 546


(0.24434287366736993, 1.306627814100805)

In [247]:
# min samples = 1, leaf size = 20

clusterer2 = hdbscan.HDBSCAN(algorithm='generic', alpha=1.0, approx_min_span_tree=True,
    gen_min_span_tree=False, leaf_size=20, 
    metric='manhattan', min_cluster_size=5, min_samples=1, p=None)
clusterer2.fit(seqvec_np)
clusterer2.labels_.max()

print("Scores over all data")
ev.silhouette_n_davies(seqvec_np, clusterer2.labels_)
print("Scores without noise")
ev.silhouette_n_davies(seqvec_np[clusterer2.labels_!=-1],
                       clusterer2.labels_[clusterer2.labels_!=-1])

Scores over all data
 | Silhouette -0.1752 | DB sc 1.5795 | noise 12557 | k 701 | max clus size 546
Scores without noise
 | Silhouette 0.2443 | DB sc 1.3066 | noise 0 | k 700 | max clus size 546


(0.24434287366736993, 1.306627814100805)

In [246]:
# leaf size = 60 

clusterer2 = hdbscan.HDBSCAN(algorithm='generic', alpha=1.0, approx_min_span_tree=True,
    gen_min_span_tree=False, leaf_size=60, 
    metric='manhattan', min_cluster_size=5, min_samples=5, p=None)
clusterer2.fit(seqvec_np)
clusterer2.labels_.max()

print("Scores over all data")
ev.silhouette_n_davies(seqvec_np, clusterer2.labels_)
print("Scores without noise")
ev.silhouette_n_davies(seqvec_np[clusterer2.labels_!=-1],
                       clusterer2.labels_[clusterer2.labels_!=-1])

Scores over all data
 | Silhouette -0.2012 | DB sc 1.3971 | noise 15329 | k 274 | max clus size 401
Scores without noise
 | Silhouette 0.3605 | DB sc 0.9981 | noise 0 | k 273 | max clus size 401


(0.36047182355517876, 0.9981394913017515)

(1, 400)

# Data for Linda and Danie
XYZ coordinates for viz, plus all pair combinations within clusters for Danie

### 3D embedding of seqvec

In [144]:
# Reduce to 3D space

from sklearn.manifold import Isomap

print("X_full orig shape", seqvec_np.shape)

embedding = Isomap(n_components=3)
seqvec_np_3d = embedding.fit_transform(seqvec_np)
print("X_transformed new shape", seqvec_np_3d.shape)

X_full orig shape (20504, 1024)
X_transformed new shape (20504, 3)


In [147]:
pd.DataFrame(seqvec_np_3d, columns=['x', 'y', 'z']).head()

Unnamed: 0,x,y,z
0,-5.638844,1.582528,-5.042023
1,-1.77737,2.982981,-0.836539
2,1.113939,2.127296,1.51276
3,1.022581,9.176342,5.943778
4,0.701044,12.698147,8.236269


In [148]:
clusters_with_3d = pd.concat([clusters, pd.DataFrame(seqvec_np_3d, columns=['x', 'y', 'z'])], 
                             axis=1)
clusters_with_3d[0:10]

Unnamed: 0,protein,cluster_label,x,y,z
0,A0A024R1R8,-1,-5.638844,1.582528,-5.042023
1,A0A024RBG1,83,-1.77737,2.982981,-0.836539
2,A0A024RCN7,-1,1.113939,2.127296,1.51276
3,A0A075B6H5,-1,1.022581,9.176342,5.943778
4,A0A075B6H7,105,0.701044,12.698147,8.236269
5,A0A075B6H8,-1,0.885868,10.113062,6.64485
6,A0A075B6H9,97,1.9761,9.575064,7.970276
7,A0A075B6I0,100,1.811466,12.191325,9.289609
8,A0A075B6I1,-1,3.259366,10.761558,8.696108
9,A0A075B6I3,97,1.235302,10.945654,9.556887


In [149]:
clusters_with_3d.shape

(20504, 5)

In [168]:
clusters_with_3d.to_parquet(path="proteins_with_clusters_n_3d_embed.parquet")

In [169]:
gcs.upload_blob('proteins_with_clusters_n_3d_embed.parquet', 'samples/proteins_with_clusters_n_3d_embed.parquet')

In [170]:
test = gcs.download_parquet('samples/proteins_with_clusters_n_3d_embed.parquet')
test[0:10]

Unnamed: 0,protein,cluster_label,x,y,z
0,A0A024R1R8,-1,-5.638844,1.582528,-5.042023
1,A0A024RBG1,83,-1.77737,2.982981,-0.836539
2,A0A024RCN7,-1,1.113939,2.127296,1.51276
3,A0A075B6H5,-1,1.022581,9.176342,5.943778
4,A0A075B6H7,105,0.701044,12.698147,8.236269
5,A0A075B6H8,-1,0.885868,10.113062,6.64485
6,A0A075B6H9,97,1.9761,9.575064,7.970276
7,A0A075B6I0,100,1.811466,12.191325,9.289609
8,A0A075B6I1,-1,3.259366,10.761558,8.696108
9,A0A075B6I3,97,1.235302,10.945654,9.556887


### All protein combos

In [153]:
all_protein_combos_per_cluster = ev.find_all_protein_combos_per_cluster(clusters, exclude_unclustered=True)
all_protein_combos_per_cluster.shape

(371866, 3)

In [155]:
all_protein_combos_per_cluster[0:10]

Unnamed: 0,query_protein,target_protein,cluster
1,O96009,P00797,0
2,O96009,P07339,0
3,O96009,P0DJD7,0
4,O96009,P0DJD8,0
5,O96009,P0DJD9,0
6,O96009,P14091,0
7,O96009,P20142,0
8,O96009,P56817,0
9,O96009,Q9Y5Z0,0
10,P00797,O96009,0


In [171]:
all_protein_combos_per_cluster.to_parquet("all_protein_combos_per_cluster.parquet")
gcs.upload_blob('all_protein_combos_per_cluster.parquet', 'samples/all_protein_combos_per_cluster.parquet')

### Protein pairs with blast scores

In [165]:
proteins_w_blast = ev.join_blast(clusters, pairwise_metrics)
proteins_w_blast.shape

(588988, 5)

In [166]:
proteins_w_blast[0:10]

Unnamed: 0,query_protein,target_protein,cluster,bitscore,evalue
0,O96009,P00797,0,312.0,2.77e-103
1,O96009,P07339,0,390.0,1.15e-133
2,O96009,P0DJD7,0,290.0,3.77e-95
3,O96009,P0DJD8,0,291.0,2.61e-95
4,O96009,P0DJD9,0,292.0,9.25e-96
5,O96009,P14091,0,331.0,8.4e-111
6,O96009,P20142,0,260.0,2.26e-83
7,O96009,P56817,0,65.5,2.16e-11
8,O96009,Q9Y5Z0,0,79.7,6.28e-16
9,P00797,O96009,0,310.0,1.6e-102


In [172]:
proteins_w_blast.to_parquet("proteins_w_blast.parquet")
gcs.upload_blob('proteins_w_blast.parquet', 'samples/proteins_w_blast.parquet')

### Avg blast by cluster

In [162]:
blast_stats_by_cluster = ev.cluster_blast ("SeqVec", "HDBSCAN", proteins_w_blast)
blast_stats_by_cluster.shape

(757, 8)

In [173]:
blast_stats_by_cluster.to_parquet("blast_stats_by_cluster.parquet")
gcs.upload_blob('blast_stats_by_cluster.parquet', 'samples/blast_stats_by_cluster.parquet')

In [164]:
blast_stats_by_cluster[0:10]

Unnamed: 0,embedding,model,cluster,bitscore_mean,bitscore_std_dev,evalue_mean,evalue_std_dev,ratio_pairs_wo_blast
0,SeqVec,HDBSCAN,0,277.063333,181.083222,5.233507e-13,3.503063e-12,0.0
1,SeqVec,HDBSCAN,1,22.803306,2.053531,2.85,2.656713,0.780797
2,SeqVec,HDBSCAN,2,94.733708,31.729463,0.1697885,0.9590162,0.10101
3,SeqVec,HDBSCAN,3,230.0,90.651215,5.850000000000001e-22,1.083684e-21,0.6
4,SeqVec,HDBSCAN,4,104.054545,60.223953,0.01123546,0.04695454,0.47619
5,SeqVec,HDBSCAN,5,63.90532,49.455172,0.268949,1.09795,0.021094
6,SeqVec,HDBSCAN,6,183.2,192.977766,0.04333403,0.1034258,0.6
7,SeqVec,HDBSCAN,7,900.561538,663.085294,7.107692e-21,3.246821e-20,0.0
8,SeqVec,HDBSCAN,8,403.009524,364.067838,0.06513084,0.6818389,0.0
9,SeqVec,HDBSCAN,9,481.367754,548.443086,4.708859e-08,4.289834e-07,0.0


### All metrics by cluster

In [201]:
# Parquet
cluster_stats2.to_parquet("cluster_stats2.parquet")
gcs.upload_blob('cluster_stats2.parquet', 'samples/cluster_stats2.parquet')

In [203]:
cluster_stats2.shape

(758, 16)

In [204]:
cluster_stats2[0:10]

Unnamed: 0_level_0,D_perc,protein_confidence,count,mean_seq_len,min_seq_len,median_seq_len,max_seq_len,std_seq_len,seq_len_arr,embedding,model,bitscore_mean,bitscore_std_dev,evalue_mean,evalue_std_dev,ratio_pairs_wo_blast
cluster_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
-1,0.26126,72.521775,11579,640.978668,16,379.0,50384,1865.771342,"[64, 75, 130, 117, 120, 117, 118, 115, 119, 11...",,,,,,,
0,0.083798,87.255972,10,420.5,388,401.0,518,48.413152,"[420, 406, 412, 388, 388, 388, 396, 388, 501, ...",SeqVec,HDBSCAN,277.063333,181.083222,5.233507e-13,3.503063e-12,0.0
1,0.003968,83.361675,24,18.125,16,17.0,24,2.383138,"[16, 18, 20, 20, 18, 16, 17, 24, 16, 23, 16, 1...",SeqVec,HDBSCAN,22.803306,2.053531,2.85,2.656713,0.780797
2,0.811621,43.5035,6,6215.666667,313,1587.5,20861,8544.603903,"[20861, 2391, 784, 12450, 313, 495]",SeqVec,HDBSCAN,93.320213,31.847542,0.1607582,0.933666,0.096154
3,0.019166,88.869283,5,175.6,166,175.0,192,10.784248,"[166, 179, 166, 175, 192]",SeqVec,HDBSCAN,230.0,90.651215,5.850000000000001e-22,1.083684e-21,0.6
4,0.436205,55.182522,6,287.833333,134,309.5,416,115.520417,"[166, 416, 392, 310, 309, 134]",SeqVec,HDBSCAN,104.054545,60.223953,0.01123546,0.04695454,0.47619
5,0.072103,81.133274,53,339.773585,274,312.0,678,80.551857,"[301, 307, 678, 325, 323, 298, 298, 298, 332, ...",SeqVec,HDBSCAN,63.894564,49.450326,0.2690957,1.097832,0.021087
6,0.022885,93.866406,6,318.166667,277,323.0,359,32.257816,"[339, 288, 308, 338, 277, 359]",SeqVec,HDBSCAN,183.2,192.977766,0.04333403,0.1034258,0.6
7,0.754423,43.239589,6,1091.333333,928,1135.5,1161,88.691976,"[1138, 1133, 1138, 1161, 928, 1050]",SeqVec,HDBSCAN,900.561538,663.085294,7.107692e-21,3.246821e-20,0.0
8,0.122072,78.943548,17,1233.352941,1047,1196.0,1499,122.631532,"[1251, 1147, 1499, 1300, 1047, 1461, 1134, 120...",SeqVec,HDBSCAN,403.009524,364.067838,0.06513084,0.6818389,0.0


# Cluster Model Evaluation Function

Putting some thoughts down on what the evaluation function should look like. 

`evaluate_cluster`
1. **Overall cluster info:** 
    1. Shows a list of proteins in a cluster
    1. range and average of AA seq length. 
    1. **New vs. Known:** Show what % of the cluster was novel structure from AlphaFold vs. already in PDB. 
    1. **Confidence level:** Show average confidence level of the proteins in this cluster. 
2. **Sequential similarity:** Find mean and std dev of BLAST bitscore, BLAST evalue, and other metrics for each cluster. (When calculating BLAST bitscore, substitute 0 for when there's no value.) 
    1. Shows a histogram of BLAST bitscore, BLAST evalue, and other metrics
1. **Functional Similarity:** Something like % of functional agreement within a cluster. (TBD on whether to run an algorithm to predict function.)
1. **Structural Similarity:** RMSD, TM-Align. Aggregation methodology TBD. 

`evaluate_model`
1. Show the spread of average BLAST score across all clusters
2. Show the number of clusters

# Overall cluster info

## Sequence stats per protein - all data

Sequence-level stats saved at `capstone-fall21-protein/structure_files/sequences/sequences.parquet`

In [177]:
clusters.shape

(20504, 2)

In [178]:
cluster_stats = ev.sequence_stats(clusters)

In [179]:
cluster_stats[0:3]

Unnamed: 0,cluster_label,count,mean_seq_len,min_seq_len,median_seq_len,max_seq_len,std_seq_len,seq_len_arr
0,-1,11579,640.978668,16,379.0,50384,1865.771342,"[64, 75, 130, 117, 120, 117, 118, 115, 119, 11..."
1,0,10,420.5,388,401.0,518,48.413152,"[420, 406, 412, 388, 388, 388, 396, 388, 501, ..."
2,1,24,18.125,16,17.0,24,2.383138,"[16, 18, 20, 20, 18, 16, 17, 24, 16, 23, 16, 1..."


In [180]:
cluster_stats.shape

(758, 8)

### Exploration used to create module. Skip if using module.

In [1413]:
sequences = gcs.download_parquet("structure_files/sequences/sequences.parquet")

In [1414]:
sequences.columns

Index(['pdbx_db_accession', 'db_code', 'db_name', 'protein_id',
       'pdbx_seq_one_letter_code', 'protein_filename'],
      dtype='object')

In [1127]:
sequences[0:10]

Unnamed: 0,pdbx_db_accession,db_code,db_name,protein_id,pdbx_seq_one_letter_code,protein_filename
0,A0A024R1R8,A0A024R1R8_HUMAN,UNP,A0A024R1R8,MSSHEGGKKKALKQPKKQAKEMDEEEKAFKQKQKEEQKKLEVLKAK...,AF-A0A024R1R8-F1-model_v1
1,A0A024RBG1,NUD4B_HUMAN,UNP,A0A024RBG1,MMKFKPNQTRTYDREGFKKRAACLCFRSEQEDEVLLVSSSRYPDQW...,AF-A0A024RBG1-F1-model_v1
2,A0A024RCN7,A0A024RCN7_HUMAN,UNP,A0A024RCN7,MERSFVWLSCLDSDSCNLTFRLGEVESHACSPSLLWNLLTQYLPPG...,AF-A0A024RCN7-F1-model_v1
3,A0A075B6H5,A0A075B6H5_HUMAN,UNP,A0A075B6H5,METVVTTLPREGGVGPSRKMLLLLLLLGPGSGLSAVVSQHPSRVIC...,AF-A0A075B6H5-F1-model_v1
4,A0A075B6H7,KV37_HUMAN,UNP,A0A075B6H7,MEAPAQLLFLLLLWLPDTTREIVMTQSPPTLSLSPGERVTLSCRAS...,AF-A0A075B6H7-F1-model_v1
5,A0A075B6H8,KVD42_HUMAN,UNP,A0A075B6H8,MDMRVPAQLLGLLLLWLPGVRFDIQMTQSPSFLSASVGDRVSIICW...,AF-A0A075B6H8-F1-model_v1
6,A0A075B6H9,LV469_HUMAN,UNP,A0A075B6H9,MAWTPLLFLTLLLHCTGSLSQLVLTQSPSASASLGASVKLTCTLSS...,AF-A0A075B6H9-F1-model_v1
7,A0A075B6I0,LV861_HUMAN,UNP,A0A075B6I0,MSVPTMAWMMLLLGLLAYGSGVDSQTVVTQEPSFSVSPGGTVTLTC...,AF-A0A075B6I0-F1-model_v1
8,A0A075B6I1,LV460_HUMAN,UNP,A0A075B6I1,MAWTPLLLLFPLLLHCTGSLSQPVLTQSSSASASLGSSVKLTCTLS...,AF-A0A075B6I1-F1-model_v1
9,A0A075B6I3,LVK55_HUMAN,UNP,A0A075B6I3,MALTPLLLLLLSHCTGSLSRPVLTQPPSLSASPGATARLPCTLSSD...,AF-A0A075B6I3-F1-model_v1


In [1128]:
sequences["db_name"].unique()

array(['UNP'], dtype=object)

In [1129]:
sequences.shape

(20504, 6)

In [1132]:
# Add sequence length. 
sequences["seq_len"] = sequences["pdbx_seq_one_letter_code"].str.len()

In [1133]:
sequences["seq_len"].describe()

count     20504.000000
mean        724.261559
std        2714.713843
min          16.000000
25%         248.000000
50%         412.000000
75%         668.000000
max      232350.000000
Name: seq_len, dtype: float64

### Join protein-level information into clusters table

In [1395]:
update_clusters = clusters.set_index(["protein"])\
    .join(sequences.set_index(["protein_id"]),
          how='outer'
         ).reset_index()[['protein', 'cluster_label', 'db_code', 'db_name','pdbx_seq_one_letter_code', 'protein_filename', 'seq_len']]

# update_clusters=update_clusters.rename(columns={'index': "protein"})

In [1396]:
update_clusters.iloc[0:2]

Unnamed: 0,protein,cluster_label,db_code,db_name,pdbx_seq_one_letter_code,protein_filename,seq_len
0,A0A024R1R8,-1,A0A024R1R8_HUMAN,UNP,MSSHEGGKKKALKQPKKQAKEMDEEEKAFKQKQKEEQKKLEVLKAK...,AF-A0A024R1R8-F1-model_v1,64
1,A0A024RBG1,58,NUD4B_HUMAN,UNP,MMKFKPNQTRTYDREGFKKRAACLCFRSEQEDEVLLVSSSRYPDQW...,AF-A0A024RBG1-F1-model_v1,181


In [1397]:
update_clusters.shape

(20504, 7)

In [1398]:
update_clusters.columns

Index(['protein', 'cluster_label', 'db_code', 'db_name',
       'pdbx_seq_one_letter_code', 'protein_filename', 'seq_len'],
      dtype='object')

In [1399]:
cluster_stats = pd.pivot_table(update_clusters, values="seq_len",
                             index="cluster_label",
                             aggfunc={"seq_len": [len, np.mean, np.std, np.min, np.median, np.max,
                                                 lambda x: list(x)]}).reset_index()

cluster_stats[0:3]

Unnamed: 0,cluster_label,<lambda_0>,amax,amin,len,mean,median,std
0,-1,"[64, 75, 130, 117, 119, 122, 120, 123, 117, 10...",50384,16,14433,689.668122,394.0,2035.484446
1,0,"[420, 406, 412, 388, 388, 388, 396, 388, 501, ...",518,388,10,420.5,401.0,48.413152
2,1,"[16, 18, 20, 20, 18, 16, 17, 24, 16, 23, 16, 1...",24,16,24,18.125,17.0,2.383138
3,2,"[301, 307, 678, 325, 323, 298, 298, 298, 332, ...",678,274,53,339.773585,312.0,80.551857
4,3,"[1138, 1133, 1138, 1161, 928, 1984, 1050]",1984,928,7,1218.857143,1138.0,346.974749
5,4,"[1251, 1147, 1300, 1047, 1461, 1134, 1209, 119...",1461,1047,15,1216.666667,1192.0,108.770838
6,5,"[1001, 946, 1023, 1013, 1042, 1220, 1035, 1241...",1243,919,15,1066.0,1029.0,108.403347
7,6,"[515, 446, 508, 497, 506, 512, 508, 521, 516, ...",544,340,60,503.35,503.0,27.851194
8,7,"[173, 160, 173, 175, 205, 182, 170, 196]",205,160,8,179.25,174.0,14.655569
9,8,"[349, 375, 375, 375, 380, 374, 368, 386, 357, ...",419,329,18,368.777778,374.5,24.08943


In [1400]:
cluster_stats = cluster_stats.rename(columns={'<lambda_0>': "seq_len_arr", 
                              'amax': "max_seq_len", 
                              'amin': "min_seq_len", 
                              'mean': "mean_seq_len", 
                              'median': "median_seq_len", 
                              'len': "count",
                              'std': "std_seq_len"})[['cluster_label', "count", 'mean_seq_len', 'min_seq_len',  'median_seq_len', 'max_seq_len', 
       'std_seq_len','seq_len_arr']]

cluster_stats[0:3]

Unnamed: 0,cluster_label,count,mean_seq_len,min_seq_len,median_seq_len,max_seq_len,std_seq_len,seq_len_arr
0,-1,14433,689.668122,16,394.0,50384,2035.484446,"[64, 75, 130, 117, 119, 122, 120, 123, 117, 10..."
1,0,10,420.5,388,401.0,518,48.413152,"[420, 406, 412, 388, 388, 388, 396, 388, 501, ..."
2,1,24,18.125,16,17.0,24,2.383138,"[16, 18, 20, 20, 18, 16, 17, 24, 16, 23, 16, 1..."
3,2,53,339.773585,274,312.0,678,80.551857,"[301, 307, 678, 325, 323, 298, 298, 298, 332, ..."
4,3,7,1218.857143,928,1138.0,1984,346.974749,"[1138, 1133, 1138, 1161, 928, 1984, 1050]"
...,...,...,...,...,...,...,...,...
339,338,18,2421.111111,1052,1863.0,13930,2905.122107,"[13930, 1068, 1863, 2156, 1516, 2430, 1909, 10..."
340,339,12,821.583333,379,740.0,1868,456.083417,"[379, 1496, 1868, 783, 535, 963, 793, 495, 697..."
341,340,13,1037.846154,779,977.0,1287,193.07548,"[779, 1173, 1287, 1274, 1253, 804, 974, 1278, ..."
342,341,8,2029.125,1830,2071.0,2186,127.75919,"[2073, 1966, 2099, 1830, 2186, 2140, 2069, 1870]"


## Confidence level per protein - exploration

See if we can grab the confidence for amino acid level. Currently, the raw data is at the atom level.

In [213]:
importlib.reload(ev)

<module 'utils.evaluate_cluster' from '/Users/haeranglee/Documents/pss/utils/evaluate_cluster.py'>

In [183]:
cluster_conf = ev.protein_confidence_agg(clusters)

gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_00.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_01.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_02.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_03.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_04.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_05.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_06.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_07.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_08.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_09.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_10.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_11.parquet
gs:/

In [184]:
cluster_conf.shape

(758, 2)

In [185]:
cluster_stats2 = ev.merge_cluster_stats(cluster_conf, cluster_stats.set_index("cluster_label"))

### Exploration to create module. Ignore if using module.

In [26]:
prefix = 'structure_files/atom_sites'
keys = gcs.list_file_paths(prefix)
keys

['gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_00.parquet',
 'gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_01.parquet',
 'gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_02.parquet',
 'gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_03.parquet',
 'gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_04.parquet',
 'gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_05.parquet',
 'gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_06.parquet',
 'gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_07.parquet',
 'gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_08.parquet',
 'gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_09.parquet',
 'gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_10.parquet',
 'gs://capstone-fall21-protein/structure_fi

In [1210]:
asp_00 = gcs.download_parquet("structure_files/atom_sites/atom_sites_part_00.parquet")

In [1237]:
asp_00.shape

(2806019, 28)

In [1211]:
asp_00[0:10]

Unnamed: 0,group_PDB,id,type_symbol,label_atom_id,label_alt_id,label_comp_id,label_asym_id,label_entity_id,label_seq_id,pdbx_PDB_ins_code,...,auth_asym_id,auth_atom_id,pdbx_PDB_model_num,pdbx_sifts_xref_db_acc,pdbx_sifts_xref_db_name,pdbx_sifts_xref_db_num,pdbx_sifts_xref_db_res,protein_filename,protein_id,confidence_pLDDT
0,ATOM,1,N,N,,MET,A,1,1,,...,A,N,1,A0A024R1R8,UNP,1,M,AF-A0A024R1R8-F1-model_v1,A0A024R1R8,59.87
1,ATOM,2,C,CA,,MET,A,1,1,,...,A,CA,1,A0A024R1R8,UNP,1,M,AF-A0A024R1R8-F1-model_v1,A0A024R1R8,59.87
2,ATOM,3,C,C,,MET,A,1,1,,...,A,C,1,A0A024R1R8,UNP,1,M,AF-A0A024R1R8-F1-model_v1,A0A024R1R8,59.87
3,ATOM,4,C,CB,,MET,A,1,1,,...,A,CB,1,A0A024R1R8,UNP,1,M,AF-A0A024R1R8-F1-model_v1,A0A024R1R8,59.87
4,ATOM,5,O,O,,MET,A,1,1,,...,A,O,1,A0A024R1R8,UNP,1,M,AF-A0A024R1R8-F1-model_v1,A0A024R1R8,59.87
5,ATOM,6,C,CG,,MET,A,1,1,,...,A,CG,1,A0A024R1R8,UNP,1,M,AF-A0A024R1R8-F1-model_v1,A0A024R1R8,59.87
6,ATOM,7,S,SD,,MET,A,1,1,,...,A,SD,1,A0A024R1R8,UNP,1,M,AF-A0A024R1R8-F1-model_v1,A0A024R1R8,59.87
7,ATOM,8,C,CE,,MET,A,1,1,,...,A,CE,1,A0A024R1R8,UNP,1,M,AF-A0A024R1R8-F1-model_v1,A0A024R1R8,59.87
8,ATOM,9,N,N,,SER,A,1,2,,...,A,N,1,A0A024R1R8,UNP,2,S,AF-A0A024R1R8-F1-model_v1,A0A024R1R8,55.91
9,ATOM,10,C,CA,,SER,A,1,2,,...,A,CA,1,A0A024R1R8,UNP,2,S,AF-A0A024R1R8-F1-model_v1,A0A024R1R8,55.91


In [1212]:
asp_00.columns

Index(['group_PDB', 'id', 'type_symbol', 'label_atom_id', 'label_alt_id',
       'label_comp_id', 'label_asym_id', 'label_entity_id', 'label_seq_id',
       'pdbx_PDB_ins_code', 'Cartn_x', 'Cartn_y', 'Cartn_z', 'occupancy',
       'B_iso_or_equiv', 'pdbx_formal_charge', 'auth_seq_id', 'auth_comp_id',
       'auth_asym_id', 'auth_atom_id', 'pdbx_PDB_model_num',
       'pdbx_sifts_xref_db_acc', 'pdbx_sifts_xref_db_name',
       'pdbx_sifts_xref_db_num', 'pdbx_sifts_xref_db_res', 'protein_filename',
       'protein_id', 'confidence_pLDDT'],
      dtype='object')

In [1214]:
asp_00.iloc[0]

group_PDB                                       ATOM
id                                                 1
type_symbol                                        N
label_atom_id                                      N
label_alt_id                                        
label_comp_id                                    MET
label_asym_id                                      A
label_entity_id                                    1
label_seq_id                                       1
pdbx_PDB_ins_code                               None
Cartn_x                                      -52.339
Cartn_y                                       -6.285
Cartn_z                                       37.051
occupancy                                        1.0
B_iso_or_equiv                                 59.87
pdbx_formal_charge                              None
auth_seq_id                                        1
auth_comp_id                                     MET
auth_asym_id                                  

In [1215]:
asp_00.pdbx_sifts_xref_db_res.unique()

array(['M', 'S', 'H', 'E', 'G', 'K', 'A', 'L', 'Q', 'P', 'D', 'F', 'V',
       'T', 'I', 'N', 'R', 'Y', 'C', 'W'], dtype=object)

In [1261]:
# Take unique rows of amino acid and its confidence 
asp_00 = asp_00[["protein_id", "label_seq_id", "pdbx_sifts_xref_db_res", "confidence_pLDDT"]
               ].drop_duplicates()

asp_00.shape

(360111, 4)

In [1262]:
asp_00[0:5]

Unnamed: 0,protein_id,label_seq_id,pdbx_sifts_xref_db_res,confidence_pLDDT
0,A0A024R1R8,1,M,59.87
8,A0A024R1R8,2,S,55.91
14,A0A024R1R8,3,S,62.74
20,A0A024R1R8,4,H,59.86
30,A0A024R1R8,5,E,61.56


In [1267]:
asp_00["label_seq_id"] = asp_00["label_seq_id"].astype('float64')

In [1276]:
asp_00["confidence_pLDDT"] = asp_00["confidence_pLDDT"].astype('float64')

In [1269]:
pvt = pd.pivot_table(asp_00,
               index=["protein_id"],
               values="label_seq_id",
               aggfunc=[len, np.max])
pvt.columns

MultiIndex([( 'len', 'label_seq_id'),
            ('amax', 'label_seq_id')],
           )

In [1273]:
pvt[pvt[( 'len', 'label_seq_id')]!=pvt[('amax', 'label_seq_id')]]

Unnamed: 0_level_0,len,amax
Unnamed: 0_level_1,label_seq_id,label_seq_id
protein_id,Unnamed: 1_level_2,Unnamed: 2_level_2
A0A087WUL8,19438,1400.0
A2VEC9,27948,1400.0
A4UGR9,15371,1400.0


Remember to deduplicate across the files!

In [1277]:
np.mean(asp_00["confidence_pLDDT"])
#                aggfunc=[np.min, np.max, np.mean, np.median, np.std])

67.57131823243411

In [1278]:
np.min(asp_00["confidence_pLDDT"])

16.23

In [1247]:
np.max(asp_00["confidence_pLDDT"])

98.96

In [1248]:
np.std(asp_00["confidence_pLDDT"])

24.749614549471183

In [1249]:
np.median(asp_00["confidence_pLDDT"])

72.7

**Interpreting confidence**

From the [AlphaFold FAQ](https://alphafold.ebi.ac.uk/faq)

> AlphaFold produces a per-residue estimate of its confidence on a scale from 0 - 100 . This confidence measure is called pLDDT and corresponds to the model’s predicted score on the lDDT-Cα metric. It is stored in the B-factor fields of the mmCIF and PDB files available for download (although unlike a B-factor, higher pLDDT is better). pLDDT is also used to colour-code the residues of the model in the 3D structure viewer. The following rules of thumb provide guidance on the expected reliability of a given region:
>
> *    Regions with pLDDT > 90 are expected to be modelled to high accuracy. These should be suitable for any application that benefits from high accuracy (e.g. characterising binding sites).
> *    Regions with pLDDT between 70 and 90 are expected to be modelled well (a generally good backbone prediction).
> *    Regions with pLDDT between 50 and 70 are low confidence and should be treated with caution.
> *    The 3D coordinates of regions with pLDDT < 50 often have a ribbon-like appearance and should not be interpreted. We show in our paper that pLDDT < 50 is a reasonably strong predictor of disorder, i.e. it suggests such a region is either unstructured in physiological conditions or only structured as part of a complex.
> *    Structured domains with many inter-residue contacts are likely to be more reliable than extended linkers or isolated long helices.
> *    Unphysical bond lengths and clashes do not usually appear in confident regions. Any part of a structure with several of these should be disregarded.
>
>Note that the PDB and mmCIF files contain coordinates for all regions, regardless of their pLDDT score. It is up to the user to interpret the model judiciously, in accordance with the guidance above.

**Proposal**: Since pLDDT < 50 is a "reasonably strong predictor of disorder", let's use that as a metric to identify disorder. We'll still save the full confidence level and category (high accuracy, modelled well, low confidence, do not interpret) for our visualization etc purposes.
Can I call this `high`, `medium`, `low`, and `DNI` (do not interpret)? 

In [1280]:
asp_00["confidence"] = pd.cut(asp_00["confidence_pLDDT"], 
                                   [0, 50, 70, 90, 100], 
                                   labels=['D', 'L', 'M', 'H'],
                                   right=False)

In [1281]:
asp_00["confidence"].unique()

['L', 'M', 'H', 'D']
Categories (4, object): ['D' < 'L' < 'M' < 'H']

In [1282]:
asp_00[0:5]

Unnamed: 0,protein_id,label_seq_id,pdbx_sifts_xref_db_res,confidence_pLDDT,confidence
0,A0A024R1R8,1.0,M,59.87,L
8,A0A024R1R8,2.0,S,55.91,L
14,A0A024R1R8,3.0,S,62.74,L
20,A0A024R1R8,4.0,H,59.86,L
30,A0A024R1R8,5.0,E,61.56,L


In [1290]:
asp_00_pivot = pd.pivot_table(asp_00,
                              index=["protein_id"],
                              columns=["confidence"],
                              values=["label_seq_id"],
                              aggfunc=[len]
                             )

In [1291]:
asp_00_pivot[0:10].fillna(0)

Unnamed: 0_level_0,len,len,len,len
Unnamed: 0_level_1,label_seq_id,label_seq_id,label_seq_id,label_seq_id
confidence,D,L,M,H
protein_id,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3
A0A024R1R8,0.0,36.0,18.0,10.0
A0A024RBG1,30.0,12.0,10.0,129.0
A0A024RCN7,27.0,48.0,0.0,0.0
A0A075B6H5,6.0,28.0,4.0,92.0
A0A075B6H7,1.0,19.0,5.0,91.0
A0A075B6H8,1.0,21.0,3.0,92.0
A0A075B6H9,2.0,21.0,5.0,91.0
A0A075B6I0,3.0,16.0,14.0,89.0
A0A075B6I1,1.0,20.0,4.0,95.0
A0A075B6I3,1.0,23.0,7.0,92.0


## Show confidence level per protein - all data

In [1350]:
def protein_confidence_agg(n=None):
    """Downloads protein files, then summarizes amino acid sequence-level confidence information."""
    
    prefix = 'structure_files/atom_sites'
    keys = gcs.list_file_paths(prefix)
    
    if not n:
        n = len(keys)

    # Download, dedupe, and add 
    asp = pd.DataFrame(columns=["protein_id", "label_seq_id", "pdbx_sifts_xref_db_res", "confidence_pLDDT"])

    for key in keys[:n]:
        print(key)
        # Download file 
        asp_i = gcs.download_parquet(gcs.uri_to_bucket_and_key(key)[1])

        # Dedupicate it at the amino acid level 
        asp_i = asp_i[["protein_id", "label_seq_id", "pdbx_sifts_xref_db_res", "confidence_pLDDT"]
                   ].drop_duplicates()
        # Convert data type 
        asp_i["label_seq_id"] = asp_i["label_seq_id"].astype('float64')
        asp_i["confidence_pLDDT"] = asp_i["confidence_pLDDT"].astype('float64')

        # Add to the all dataset
        asp = asp.append(asp_i)

    # Deduplicate asp because there may be data present across different files 
    asp = asp[["protein_id", "label_seq_id", "pdbx_sifts_xref_db_res", "confidence_pLDDT"]
                   ].drop_duplicates()
    
    # Find avg confidence per protein
    avg_conf_protein = pd.pivot_table(asp,
                                      index="protein_id",
                                      values="confidence_pLDDT",
                                      aggfunc=np.mean
                                     ).reset_index()
    avg_conf_protein.columns=["protein", "protein_confidence"]

    # Add confidence category for amino acid
    asp["confidence"] = pd.cut(asp["confidence_pLDDT"], 
                                       [0, 50, 70, 90, 100], 
                                       labels=['D', 'L', 'M', 'H'],
                                       right=False)

    # Show count of amino acids under each confidence category
    asp_pivot = pd.pivot_table(asp,
                              index=["protein_id"],
                              columns=["confidence"],
                              values=["label_seq_id"],
                              aggfunc=[len]
                             ).fillna(0).reset_index()

    asp_pivot.columns=["protein", "D", "L", "M", "H"]
    
    asp = asp_pivot.set_index("protein").join(avg_conf_protein.set_index("protein"), how="outer").reset_index()
    
    asp["D_perc"] = asp["D"]/(asp["D"] + asp["L"] + asp["M"] + asp["H"])
    
    # Return df with the conf data for amino acids and protein level 
    return asp

In [1354]:
asp  = protein_confidence_agg()

gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_00.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_01.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_02.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_03.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_04.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_05.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_06.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_07.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_08.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_09.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_10.parquet
gs://capstone-fall21-protein/structure_files/atom_sites/atom_sites_part_11.parquet
gs:/

Protein confidence exists in another file. SEe if it's the same?

In [1359]:
asp.shape

(20504, 7)

In [1360]:
asp[0:10]

Unnamed: 0,protein,D,L,M,H,protein_confidence,D_perc
0,A0A024R1R8,0.0,36.0,18.0,10.0,72.220313,0.0
1,A0A024RBG1,30.0,12.0,10.0,129.0,83.989116,0.165746
2,A0A024RCN7,27.0,48.0,0.0,0.0,54.187467,0.36
3,A0A075B6H5,6.0,28.0,4.0,92.0,85.970923,0.046154
4,A0A075B6H7,1.0,19.0,5.0,91.0,90.421034,0.008621
5,A0A075B6H8,1.0,21.0,3.0,92.0,91.096667,0.008547
6,A0A075B6H9,2.0,21.0,5.0,91.0,89.549328,0.016807
7,A0A075B6I0,3.0,16.0,14.0,89.0,89.309672,0.02459
8,A0A075B6I1,1.0,20.0,4.0,95.0,89.48325,0.008333
9,A0A075B6I3,1.0,23.0,7.0,92.0,89.388374,0.00813


In [1403]:
asp.columns

Index(['protein', 'D', 'L', 'M', 'H', 'protein_confidence', 'D_perc',
       'cluster_label'],
      dtype='object')

In [1404]:
clusters.columns

Index(['protein', 'cluster_label'], dtype='object')

In [1405]:
asp = asp.set_index(["protein", "cluster_label"])\
            .join(clusters.set_index(["protein", "cluster_label"]),
                  how='outer'
                 ).reset_index()

In [1406]:
asp[0:10]

Unnamed: 0,protein,cluster_label,D,L,M,H,protein_confidence,D_perc
0,A0A024R1R8,-1,0.0,36.0,18.0,10.0,72.220313,0.0
1,A0A024RBG1,58,30.0,12.0,10.0,129.0,83.989116,0.165746
2,A0A024RCN7,-1,27.0,48.0,0.0,0.0,54.187467,0.36
3,A0A075B6H5,-1,6.0,28.0,4.0,92.0,85.970923,0.046154
4,A0A075B6H7,50,1.0,19.0,5.0,91.0,90.421034,0.008621
5,A0A075B6H7,51,,,,,,
6,A0A075B6H8,-1,1.0,21.0,3.0,92.0,91.096667,0.008547
7,A0A075B6H9,-1,2.0,21.0,5.0,91.0,89.549328,0.016807
8,A0A075B6I0,-1,3.0,16.0,14.0,89.0,89.309672,0.02459
9,A0A075B6I1,-1,1.0,20.0,4.0,95.0,89.48325,0.008333


In [1407]:
cluster_conf = pd.pivot_table(asp, index="cluster_label",
              values=["protein_confidence", "D_perc"],
              aggfunc=np.mean)

cluster_conf[0:10]

Unnamed: 0_level_0,D_perc,protein_confidence
cluster_label,Unnamed: 1_level_1,Unnamed: 2_level_1
-1,0.25279,73.247063
0,0.083798,87.255972
1,0.003968,83.361675
2,0.072103,81.133274
3,0.754423,43.239589
4,0.113511,79.536018
5,0.07394,83.335785
6,0.02151,92.003835
7,0.17063,73.211089
8,0.032757,95.000711


In [1408]:
cluster_stats = cluster_stats.set_index("cluster_label").join(cluster_conf)

In [1409]:
cluster_stats[0:5]

Unnamed: 0_level_0,count,mean_seq_len,min_seq_len,median_seq_len,max_seq_len,std_seq_len,seq_len_arr,D_perc,protein_confidence
cluster_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
-1,14433,689.668122,16,394.0,50384,2035.484446,"[64, 75, 130, 117, 119, 122, 120, 123, 117, 10...",0.25279,73.247063
0,10,420.5,388,401.0,518,48.413152,"[420, 406, 412, 388, 388, 388, 396, 388, 501, ...",0.083798,87.255972
1,24,18.125,16,17.0,24,2.383138,"[16, 18, 20, 20, 18, 16, 17, 24, 16, 23, 16, 1...",0.003968,83.361675
2,53,339.773585,274,312.0,678,80.551857,"[301, 307, 678, 325, 323, 298, 298, 298, 332, ...",0.072103,81.133274
3,7,1218.857143,928,1138.0,1984,346.974749,"[1138, 1133, 1138, 1161, 928, 1984, 1050]",0.754423,43.239589


In [1420]:
cluster_stats[cluster_stats["D_perc"]>0.2527].shape#[0:10]

(76, 9)

In [1411]:
cluster_stats[cluster_stats["std_seq_len"]>2036]

Unnamed: 0_level_0,count,mean_seq_len,min_seq_len,median_seq_len,max_seq_len,std_seq_len,seq_len_arr,D_perc,protein_confidence
cluster_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
29,18,7490.888889,261,566.5,93707,21982.210727,"[558, 400, 310, 385, 720, 15323, 310, 575, 243...",0.546674,51.187152
61,19,1499.526316,270,569.0,16771,3721.44102,"[465, 345, 597, 661, 381, 1231, 392, 830, 610,...",0.094613,81.887347
63,15,21999.4,13892,23627.0,25046,3080.253185,"[16707, 25046, 13892, 23671, 22307, 20916, 250...",0.047611,91.80581
77,11,7546.272727,860,1613.0,25055,10705.746122,"[1905, 1615, 1613, 860, 1207, 873, 25055, 2374...",0.133493,76.747863
101,8,16111.375,832,18860.0,26581,10233.85964,"[832, 23788, 26581, 15371, 23757, 859, 15354, ...",0.279917,73.737141
118,12,6161.833333,1172,1792.0,18095,6835.487159,"[1761, 18095, 1786, 1609, 13922, 13875, 1798, ...",0.163346,82.384657
141,13,18894.615385,238,1251.0,232350,64137.332261,"[238, 354, 1685, 1465, 1141, 477, 1141, 1274, ...",0.049166,92.189357
153,16,3111.625,439,1086.0,12512,4663.458171,"[443, 575, 703, 12471, 12512, 1184, 493, 1721,...",0.369979,68.083939
176,17,14894.588235,894,2472.0,53197,18631.086351,"[2390, 911, 2419, 2137, 18085, 894, 16633, 236...",0.069324,91.867803
194,9,6324.555556,415,1015.0,18023,8135.981765,"[1013, 18023, 986, 994, 16687, 16764, 1024, 41...",0.127711,78.484717


### Observation summary

* The unclustered proteins (label `-1`) have an average "Do not use" confidence level presence of 25.2790%. There are 75 other clusters in the toatl of 339 with a D percentage higher than this. For some of these with a really high percentage probably have less credibility. 
* 27 clusters have 50% or more amino acids that fall under the D category, or "Do not interpret". 27 out of 344 isn't too many, but these clusters cannot be trusted.
* The unclustered proteins also had a really big std dev of sequence length. There are only 15 clusters that have a greater std dev of seq length. 
* Some clusters consist of bigger proteins than others. That's a promising sign that our cluster model is picking up on the size

In [1417]:
cluster_stats.shape

(344, 9)

In [1418]:
cluster_stats.columns

Index(['count', 'mean_seq_len', 'min_seq_len', 'median_seq_len', 'max_seq_len',
       'std_seq_len', 'seq_len_arr', 'D_perc', 'protein_confidence'],
      dtype='object')

In [1419]:
# Save as CSV
# Upload to google docs 
cluster_stats[['count', 'mean_seq_len', 'min_seq_len', 'median_seq_len', 'max_seq_len',
       'std_seq_len', 'D_perc', 'protein_confidence']
             ].to_csv('cluster_stats_hdbscan_seqvec.csv', sep='\t')

# run BLAST

BLAST gives us a sequential similarity metric. It consists of a bitscore and an evalue.

In [112]:
# importlib.reload(ev)

<module 'utils.evaluate_cluster' from '/Users/haeranglee/Documents/pss/utils/evaluate_cluster.py'>

In [191]:
clusters_w_blast = ev.join_blast(clusters)

In [192]:
clusters_w_blast[0:3]

Unnamed: 0,query_protein,target_protein,cluster,bitscore,evalue
0,O96009,P00797,0,312.0,2.77e-103
1,O96009,P07339,0,390.0,1.15e-133
2,O96009,P0DJD7,0,290.0,3.77e-95


In [193]:
clusters_w_blast.shape

(766147, 5)

In [194]:
stats_by_cluster = ev.cluster_blast(embedding_name="SeqVec", 
                                    model_name="HDBSCAN", 
                                    clusters_w_blast=clusters_w_blast)


In [195]:
stats_by_cluster[0:10]

Unnamed: 0,embedding,model,cluster,bitscore_mean,bitscore_std_dev,evalue_mean,evalue_std_dev,ratio_pairs_wo_blast
0,SeqVec,HDBSCAN,0,277.063333,181.083222,5.233507e-13,3.503063e-12,0.0
1,SeqVec,HDBSCAN,1,22.803306,2.053531,2.85,2.656713,0.780797
2,SeqVec,HDBSCAN,2,93.320213,31.847542,0.1607582,0.933666,0.096154
3,SeqVec,HDBSCAN,3,230.0,90.651215,5.850000000000001e-22,1.083684e-21,0.6
4,SeqVec,HDBSCAN,4,104.054545,60.223953,0.01123546,0.04695454,0.47619
5,SeqVec,HDBSCAN,5,63.894564,49.450326,0.2690957,1.097832,0.021087
6,SeqVec,HDBSCAN,6,183.2,192.977766,0.04333403,0.1034258,0.6
7,SeqVec,HDBSCAN,7,900.561538,663.085294,7.107692e-21,3.246821e-20,0.0
8,SeqVec,HDBSCAN,8,403.009524,364.067838,0.06513084,0.6818389,0.0
9,SeqVec,HDBSCAN,9,481.367754,548.443086,4.708859e-08,4.289834e-07,0.0


In [196]:
stats_by_cluster.to_csv("blast_stats_by_cluster.csv")

In [197]:
cluster_conf.index

Int64Index([ -1,   0,   1,   2,   3,   4,   5,   6,   7,   8,
            ...
            747, 748, 749, 750, 751, 752, 753, 754, 755, 756],
           dtype='int64', name='cluster_label', length=758)

In [198]:
cluster_stats2 = ev.merge_cluster_stats(cluster_conf, cluster_stats.set_index("cluster_label"))
cluster_stats2 = ev.merge_cluster_stats(cluster_stats2, stats_by_cluster.set_index("cluster"))

In [199]:
cluster_stats2.columns

Index(['D_perc', 'protein_confidence', 'count', 'mean_seq_len', 'min_seq_len',
       'median_seq_len', 'max_seq_len', 'std_seq_len', 'seq_len_arr',
       'embedding', 'model', 'bitscore_mean', 'bitscore_std_dev',
       'evalue_mean', 'evalue_std_dev', 'ratio_pairs_wo_blast'],
      dtype='object')

In [200]:
cluster_stats2[['D_perc', 'protein_confidence', 'count', 'mean_seq_len', 'min_seq_len',
       'median_seq_len', 'max_seq_len', 'std_seq_len', 
                #'seq_len_arr',
                
       'embedding', 'model', 'bitscore_mean', 'bitscore_std_dev',
       'evalue_mean', 'evalue_std_dev', 'ratio_pairs_wo_blast']].to_csv("cluster_stats_with_blast.csv", sep='\t')

### Exploration: Ignore if using utils

In [39]:
import io 
# Imported pairwise BLAST metrics Danie calculated
pairwise_metrics = pd.read_csv(io.StringIO(gcs.download_text('annotations/blast_annotations.csv')))


In [40]:

pairwise_metrics.shape

(7791727, 4)

In [41]:
pairwise_metrics[0:10]

Unnamed: 0,query_protein,target_protein,bitscore,evalue
0,A0A024R1R8,Q9Y2S6,57.4,3.77e-13
1,A0A024R1R8,P98177,31.2,0.03
2,A0A024R1R8,Q9H7D0,30.4,0.045
3,A0A024R1R8,Q96CB9,29.3,0.15
4,A0A024R1R8,Q9UPA5,26.2,1.7
5,A0A024R1R8,Q9UPA5,26.2,1.7
6,A0A024R1R8,Q9UPA5,26.2,1.7
7,A0A024R1R8,Q9UPA5,26.2,1.7
8,A0A024R1R8,Q9UPA5,26.2,1.7
9,A0A024R1R8,Q8N2Y8,25.0,3.9


**Question out to Danie** why are there duplicates

In [42]:
pairwise_metrics = pairwise_metrics.drop_duplicates()
pairwise_metrics.shape

(6268501, 4)

**consider** whether we should take out unclustered proteins `-1` from the blast lookup, espcially when that group is so big

In [44]:
# # copied from Danie's code
import itertools

# This took two hours! 

# Find all combinations of proteins WITHIN clusters
all_combos = pd.DataFrame()

for clust in sorted(clusters.cluster_label.unique()[1:]):
    print(clust)
    clust_combos = pd.DataFrame(itertools.product(clusters[clusters.cluster_label == clust].protein, repeat=2),
                                columns=['query_protein', 'target_protein'])
    clust_combos = clust_combos[clust_combos.query_protein != clust_combos.target_protein]
    clust_combos['cluster'] = clust
    all_combos = all_combos.append(clust_combos)
    

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [45]:
all_combos[0:10] 

Unnamed: 0,query_protein,target_protein,cluster
1,O96009,P00797,0
2,O96009,P07339,0
3,O96009,P0DJD7,0
4,O96009,P0DJD8,0
5,O96009,P0DJD9,0
6,O96009,P14091,0
7,O96009,P20142,0
8,O96009,P56817,0
9,O96009,Q9Y5Z0,0
10,P00797,O96009,0


In [47]:
all_combos.shape

(467302, 3)

In [62]:
all_combos[["query_protein", "target_protein"]].dropna().shape

(467302, 2)

In [71]:
all_combos[(all_combos.query_protein=="A0A087WUV0") &
(all_combos.target_protein=="A0A1W2PQL4")]

Unnamed: 0,query_protein,target_protein,cluster
1,A0A087WUV0,A0A1W2PQL4,321


In [72]:
pairwise_metrics[(pairwise_metrics.query_protein=="A0A087WUV0") &
(pairwise_metrics.target_protein=="A0A1W2PQL4")]

Unnamed: 0,query_protein,target_protein,bitscore,evalue
20208,A0A087WUV0,A0A1W2PQL4,267.0,1.09e-84
20209,A0A087WUV0,A0A1W2PQL4,258.0,4.27e-81
20210,A0A087WUV0,A0A1W2PQL4,252.0,7.03e-79
20211,A0A087WUV0,A0A1W2PQL4,228.0,8.43e-70


In [63]:
pairwise_metrics[["query_protein", "target_protein"]].dropna().shape

(6268501, 2)

In [64]:
pairwise_metrics.dropna().shape

(6268501, 4)

In [54]:
# LEFT JOIN with pairwise metrics
blast_result = all_combos.set_index(['query_protein','target_protein']).\
         join(pairwise_metrics.set_index(['query_protein','target_protein']), 
              on=['query_protein', 'target_protein'], 
              how='left'
             ).reset_index()
blast_result[0:10]

Unnamed: 0,query_protein,target_protein,cluster,bitscore,evalue
0,O96009,P00797,0,312.0,2.77e-103
1,O96009,P07339,0,390.0,1.15e-133
2,O96009,P0DJD7,0,290.0,3.77e-95
3,O96009,P0DJD8,0,291.0,2.61e-95
4,O96009,P0DJD9,0,292.0,9.25e-96
5,O96009,P14091,0,331.0,8.4e-111
6,O96009,P20142,0,260.0,2.26e-83
7,O96009,P56817,0,65.5,2.16e-11
8,O96009,Q9Y5Z0,0,79.7,6.28e-16
9,P00797,O96009,0,310.0,1.6e-102


(warning: If cluster -1 is too big and included, this will take way too long)

In [None]:
result=0

In [56]:
blast_result.shape

(1398897, 5)

In [57]:
blast_result.dropna().shape

(1353846, 5)

In [66]:
test = blast_result.pivot_table(index=['query_protein', 'target_protein'],
                        aggfunc=len).reset_index()

test[0:10]

Unnamed: 0,query_protein,target_protein,bitscore,cluster,evalue
0,A0A024RBG1,O95989,1,1,1
1,A0A024RBG1,P0C024,1,1,1
2,A0A024RBG1,P36639,1,1,1
3,A0A024RBG1,P50583,1,1,1
4,A0A024RBG1,Q8NFP7,1,1,1
5,A0A024RBG1,Q96G61,1,1,1
6,A0A024RBG1,Q9NV35,1,1,1
7,A0A024RBG1,Q9NZJ9,1,1,1
8,A0A024RBG1,Q9UKK9,1,1,1
9,A0A075B6H7,A0A075B6S4,1,1,1


**Not sure why there are duplicates in `pairwise_metrics`. Let me take it out for now.**

In [68]:
test[test["bitscore"]>1]

Unnamed: 0,query_protein,target_protein,bitscore,cluster,evalue
637,A0A087WUV0,A0A1W2PQL4,4,4,4
638,A0A087WUV0,A2RRD8,5,5,5
639,A0A087WUV0,A6NDX5,7,7,7
640,A0A087WUV0,A6NHJ4,5,5,5
641,A0A087WUV0,A6NK53,3,3,3
...,...,...,...,...,...
467188,Q9Y6Y0,Q9P2J3,3,3,3
467189,Q9Y6Y0,Q9P2N7,3,3,3
467190,Q9Y6Y0,Q9UH77,3,3,3
467191,Q9Y6Y0,Q9Y2M5,2,2,2


In [70]:
blast_result[(blast_result.query_protein=="A0A087WUV0") &
(blast_result.target_protein=="A0A1W2PQL4")]

Unnamed: 0,query_protein,target_protein,cluster,bitscore,evalue
464379,A0A087WUV0,A0A1W2PQL4,321,267.0,1.09e-84
464380,A0A087WUV0,A0A1W2PQL4,321,258.0,4.27e-81
464381,A0A087WUV0,A0A1W2PQL4,321,252.0,7.03e-79
464382,A0A087WUV0,A0A1W2PQL4,321,228.0,8.43e-70


In [78]:
blast_result = blast_result.pivot_table(index=["query_protein", "target_protein", 'cluster'],
                        values=["bitscore", "evalue"],
                        aggfunc = np.mean).reset_index()

In [73]:
pairwise_metrics.query_protein.unique().shape

(20461,)

In [79]:
blast_result["evalue"].max()

10.0

**Fill NA for blast**

In [80]:
fill_vals = {"bitscore": 0, "evalue": 10}
blast_result = blast_result.fillna(0)[0:10]

In [81]:
clusters[0:10]

Unnamed: 0,protein,cluster_label
0,A0A024R1R8,-1
1,A0A024RBG1,58
2,A0A024RCN7,-1
3,A0A075B6H5,-1
4,A0A075B6H7,51
5,A0A075B6H8,-1
6,A0A075B6H9,-1
7,A0A075B6I0,-1
8,A0A075B6I1,-1
9,A0A075B6I3,-1


In [None]:
'DeepFold', 'DBSCAN'

In [1009]:
stats_by_cluster.sort_values(by="bitscore_mean", ascending=False)[0:20]

Unnamed: 0,embedding,model,cluster,bitscore_mean,bitscore_std_dev,evalue_mean,evalue_std_dev,ratio_pairs_wo_blast
251,DeepFold,DBSCAN,783,532.5,14.849242,0.0,0.0,0.997151
590,DeepFold,DBSCAN,471,425.0,0.0,3.0050000000000003e-144,7.071068e-147,0.984848
26,DeepFold,DBSCAN,841,416.819623,502.348866,0.03774785,0.453166,0.997656
51,DeepFold,DBSCAN,552,363.96,217.598612,0.0002800599,0.0009797781,0.997869
277,DeepFold,DBSCAN,143,318.5,4.949747,4.988645e-107,7.0447e-107,0.977778
128,DeepFold,DBSCAN,104,314.428571,106.017125,0.06428571,0.2928593,0.933333
508,DeepFold,DBSCAN,69,302.07,209.517276,8.86e-15,2.234075e-14,0.991342
2,DeepFold,DBSCAN,8,301.0,0.0,4.77e-102,1.555635e-103,0.989011
28,DeepFold,DBSCAN,736,295.0,0.0,9.185e-100,2.12132e-102,0.987179
180,DeepFold,DBSCAN,300,246.3125,106.554445,4.4125e-19,8.888907999999999e-19,0.980952


In [1050]:
cluster_sizes = pd.DataFrame(clusters['cluster_label'].value_counts(),
                ).reset_index()
cluster_sizes.columns=["cluster_label", "num_proteins"]

stats_by_cluster2 = stats_by_cluster.set_index(['cluster']).\
         join(cluster_sizes.set_index(['cluster_label']), 
              on=['cluster'], 
              how='left'
             ).reset_index()

In [1055]:
stats_by_cluster2.sort_values(by="bitscore_mean", ascending=False)[0:30]

Unnamed: 0,cluster,embedding,model,bitscore_mean,bitscore_std_dev,evalue_mean,evalue_std_dev,ratio_pairs_wo_blast,num_proteins
251,783,DeepFold,DBSCAN,532.5,14.849242,0.0,0.0,0.997151,27
590,471,DeepFold,DBSCAN,425.0,0.0,3.0050000000000003e-144,7.071068e-147,0.984848,12
26,841,DeepFold,DBSCAN,416.819623,502.348866,0.03774785,0.453166,0.997656,69
51,552,DeepFold,DBSCAN,363.96,217.598612,0.0002800599,0.0009797781,0.997869,69
277,143,DeepFold,DBSCAN,318.5,4.949747,4.988645e-107,7.0447e-107,0.977778,10
128,104,DeepFold,DBSCAN,314.428571,106.017125,0.06428571,0.2928593,0.933333,10
508,69,DeepFold,DBSCAN,302.07,209.517276,8.86e-15,2.234075e-14,0.991342,22
2,8,DeepFold,DBSCAN,301.0,0.0,4.77e-102,1.555635e-103,0.989011,14
28,736,DeepFold,DBSCAN,295.0,0.0,9.185e-100,2.12132e-102,0.987179,13
180,300,DeepFold,DBSCAN,246.3125,106.554445,4.4125e-19,8.888907999999999e-19,0.980952,15


In [1069]:
result[result.cluster.isin([783, 471, 552])].sort_values(by="cluster", ascending=False).\
to_csv('clusters_with_few_matches.csv', index=False)

In [1071]:
result[result.cluster.isin([783, 471, 552])]["cluster"].value_counts()

552    4707
783     702
471     132
Name: cluster, dtype: int64

In [1072]:
gcs.upload_blob('clusters_with_few_matches.csv', 'samples/clusters_with_few_matches.csv')

In [None]:
from sklearn.manifold import Isomap

embedding = Isomap(n_components=100)
print("X_full orig shape", X_full.shape)
X_transformed = embedding.fit_transform(X_full)
print("X_transformed new shape", X_transformed.shape)

# Protein Functions

See if we can get protein functions from GSEA and map that to our data. Then, this helps evaluate if our clustering is somewhat reasonable. 

Source: Ontology gene sets section of http://www.gsea-msigdb.org/gsea/downloads.jsp 

More info on Gene Ontology: http://geneontology.org/docs/ontology-documentation/
1. **Molecular function**: describe activities that occur at the molecular level, such as “catalysis” or “transport”. GO molecular function terms represent activities rather than the entities (molecules or complexes) that perform the actions
1. **Cellular component**: locations relative to cellular structures in which a gene product performs a function, either cellular compartments (e.g., mitochondrion), or stable macromolecular complexes of which they are parts (e.g., the ribosome)
1. **Biological process**: The larger processes, or ‘biological programs’ accomplished by multiple molecular activities. Examples of broad biological process terms are DNA repair or signal transduction.

Below, I start with the molecular function. It seems like a reasonable place to start because GO MF represents activities. For CC, I assume a single complex can have many different proteins with varying shapes. Same for BP. 

In [790]:
# gcs.upload_blob('c5.go.mf.v7.4.symbols.gmt', 'c5.go.mf.v7.4.symbols.gmt')

In [None]:
# Get all the keys from gcs
allkeys = gcs.list_keys()

[k for k in allkeys if 'symbols' in k]

In [13]:
# GO molecular functions as Gene Symbols
gomf = gcs.download_text('c5.go.mf.v7.4.symbols.gmt')
gomf_dict = {e.split("\t")[0]: e.split("\t")[2:] for e in gomf.split("\n")}
len(gomf_dict)

1709

In [298]:
for k, v in gomf_dict.items():
    print(k, len(v))

GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE_ACTIVITY 10
GOMF_MANNOSYLTRANSFERASE_ACTIVITY 25
GOMF_PEPTIDYLTRANSFERASE_ACTIVITY 5
GOMF_TRNA_BINDING 67
GOMF_L_ORNITHINE_TRANSMEMBRANE_TRANSPORTER_ACTIVITY 5
GOMF_SULFUR_AMINO_ACID_TRANSMEMBRANE_TRANSPORTER_ACTIVITY 7
GOMF_SUCCINATE_DEHYDROGENASE_ACTIVITY 5
GOMF_MICROFILAMENT_MOTOR_ACTIVITY 29
GOMF_SNARE_BINDING 114
GOMF_RDNA_BINDING 8
GOMF_DNA_SECONDARY_STRUCTURE_BINDING 36
GOMF_3_KETO_STEROL_REDUCTASE_ACTIVITY 5
GOMF_PEROXISOME_TARGETING_SEQUENCE_BINDING 6
GOMF_MAGNESIUM_ION_BINDING 216
GOMF_ENDOPOLYPHOSPHATASE_ACTIVITY 5
GOMF_RNA_CAP_BINDING 19
GOMF_RNA_7_METHYLGUANOSINE_CAP_BINDING 12
GOMF_FOUR_WAY_JUNCTION_DNA_BINDING 17
GOMF_Y_FORM_DNA_BINDING 5
GOMF_BUBBLE_DNA_BINDING 8
GOMF_ADENYL_NUCLEOTIDE_EXCHANGE_FACTOR_ACTIVITY 12
GOMF_INOSITOL_HEXAKISPHOSPHATE_KINASE_ACTIVITY 8
GOMF_INOSITOL_HEXAKISPHOSPHATE_5_KINASE_ACTIVITY 5
GOMF_TRANSLATION_REPRESSOR_ACTIVITY_MRNA_REGULATORY_ELEMENT_BINDING 15
GOMF_RNA_POLYMERASE_II_CORE_PROMOTER_SEQUEN

GOMF_INTRAMOLECULAR_TRANSFERASE_ACTIVITY_PHOSPHOTRANSFERASES 10
GOMF_LIGASE_ACTIVITY 163
GOMF_LIGASE_ACTIVITY_FORMING_CARBON_OXYGEN_BONDS 41
GOMF_LIGASE_ACTIVITY_FORMING_CARBON_SULFUR_BONDS 41
GOMF_ACID_THIOL_LIGASE_ACTIVITY 31
GOMF_LIGASE_ACTIVITY_FORMING_CARBON_NITROGEN_BONDS 47
GOMF_ACID_AMINO_ACID_LIGASE_ACTIVITY 21
GOMF_CARBON_NITROGEN_LIGASE_ACTIVITY_WITH_GLUTAMINE_AS_AMIDO_N_DONOR 10
GOMF_LIGASE_ACTIVITY_FORMING_CARBON_CARBON_BONDS 7
GOMF_LIGASE_ACTIVITY_FORMING_PHOSPHORIC_ESTER_BONDS 5
GOMF_ATPASE_ACTIVITY 479
GOMF_ENDODEOXYRIBONUCLEASE_ACTIVITY_PRODUCING_5_PHOSPHOMONOESTERS 17
GOMF_ENDODEOXYRIBONUCLEASE_ACTIVITY_PRODUCING_3_PHOSPHOMONOESTERS 10
GOMF_ENDORIBONUCLEASE_ACTIVITY_PRODUCING_5_PHOSPHOMONOESTERS 28
GOMF_ENDORIBONUCLEASE_ACTIVITY_PRODUCING_3_PHOSPHOMONOESTERS 8
GOMF_ENDONUCLEASE_ACTIVITY_ACTIVE_WITH_EITHER_RIBO_OR_DEOXYRIBONUCLEIC_ACIDS_AND_PRODUCING_5_PHOSPHOMONOESTERS 41
GOMF_ENDONUCLEASE_ACTIVITY_ACTIVE_WITH_EITHER_RIBO_OR_DEOXYRIBONUCLEIC_ACIDS_AND_PRODUCING_3_PHOS

GOMF_CYSTEINE_TYPE_ENDOPEPTIDASE_ACTIVITY_INVOLVED_IN_APOPTOTIC_SIGNALING_PATHWAY 10
GOMF_CYSTEINE_TYPE_ENDOPEPTIDASE_ACTIVITY_INVOLVED_IN_EXECUTION_PHASE_OF_APOPTOSIS 9
GOMF_7SK_SNRNA_BINDING 8
GOMF_MDM2_MDM4_FAMILY_PROTEIN_BINDING 8
GOMF_CYCLIN_DEPENDENT_PROTEIN_KINASE_ACTIVITY 29
GOMF_STRUCTURAL_MOLECULE_ACTIVITY_CONFERRING_ELASTICITY 11
GOMF_CULLIN_FAMILY_PROTEIN_BINDING 25
GOMF_ANNEALING_ACTIVITY 13
GOMF_CALCITONIN_FAMILY_RECEPTOR_ACTIVITY 6
GOMF_STAT_FAMILY_PROTEIN_BINDING 9
GOMF_DISORDERED_DOMAIN_SPECIFIC_BINDING 34
GOMF_RNA_POLYMERASE_ACTIVITY 43
GOMF_G_RICH_STRAND_TELOMERIC_DNA_BINDING 10
GOMF_LIGAND_ACTIVATED_TRANSCRIPTION_FACTOR_ACTIVITY 52
GOMF_CELL_ADHESION_MEDIATOR_ACTIVITY 58
GOMF_CELL_CELL_ADHESION_MEDIATOR_ACTIVITY 49
GOMF_CELL_MATRIX_ADHESION_MEDIATOR_ACTIVITY 7
GOMF_COLLAGEN_BINDING_INVOLVED_IN_CELL_MATRIX_ADHESION 5
GOMF_CADHERIN_BINDING_INVOLVED_IN_CELL_CELL_ADHESION 18
GOMF_NITRITE_REDUCTASE_ACTIVITY 6
GOMF_BMP_RECEPTOR_ACTIVITY 6
GOMF_SEQUENCE_SPECIFIC_SINGLE_STR

In [830]:
# Each molecular function has a list of proteins attached to it.
# The format of the protein ID is a little different from that of AlphaFold.

for protein in gomf_dict["GOMF_INOSITOL_HEXAKISPHOSPHATE_5_KINASE_ACTIVITY"]:
    print(protein, protein in protein_id_full)

IP6K3 False
PPIP5K2 False
IP6K2 False
PPIP5K1 False
IP6K1 False


In [262]:
# Let's translate these gene names into a protein ID format we recognize.

' '.join(gomf_dict["GOMF_INOSITOL_HEXAKISPHOSPHATE_5_KINASE_ACTIVITY"])

'IP6K3 PPIP5K2 IP6K2 PPIP5K1 IP6K1'

In [4]:
# How to convert gene names to uniprot protein ID 

# https://www.uniprot.org/help/api_idmapping


url = 'https://www.uniprot.org/uploadlists/'

params = {
'from': 'GENENAME',
'to': 'SWISSPROT',
'format': 'tab',
'query': 'IP6K3 PPIP5K2 IP6K2 PPIP5K1 IP6K1'
}

data = urllib.parse.urlencode(params)
data = data.encode('utf-8')
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as f:
    response = f.read()
print(response.decode('utf-8'))

From	To
IP6K3	Q8BWD2
IP6K3	Q96PC2
PPIP5K2	O43314
PPIP5K2	Q5REW0
PPIP5K2	Q5XHF8
PPIP5K2	Q6ZQB6
IP6K2	Q80V72
IP6K2	Q95221
IP6K2	Q9R0U1
IP6K2	Q9UHH9
PPIP5K1	A2ARP1
PPIP5K1	A7Z050
PPIP5K1	P0C644
PPIP5K1	Q5RDF1
PPIP5K1	Q6PFW1
IP6K1	Q6PD10
IP6K1	Q92551
IP6K1	Q9ESM0



In [5]:
params = {
'from': 'GENENAME',
'to': 'ID',
'format': 'tab',
'query': 'IP6K3 PPIP5K2 IP6K2 PPIP5K1 IP6K1'
}

data = urllib.parse.urlencode(params)
data = data.encode('utf-8')
req = urllib.request.Request(url, data)
with urllib.request.urlopen(req) as f:
    response = f.read()
print(response.decode('utf-8'))

From	To
IP6K3	A0A096NHK4_PAPAN
IP6K3	A0A0B2UTA3_TOXCA
IP6K3	A0A0C9QMZ6_9HYME
IP6K3	A0A0D9RBD7_CHLSB
IP6K3	A0A0U1RRE0_HUMAN
IP6K3	A0A0U1RRR7_RAT
IP6K3	A0A151NY10_ALLMI
IP6K3	A0A1C7NKM9_9FUNG
IP6K3	A0A1D5QSL2_MACMU
IP6K3	A0A1S3A0M5_ERIEU
IP6K3	A0A1S3GNI8_DIPOR
IP6K3	A0A1U7RDJ3_MESAU
IP6K3	A0A1U7RWI7_ALLSI
IP6K3	A0A1U7S3X0_ALLSI
IP6K3	A0A1U7UIA9_CARSF
IP6K3	A0A1V4KWU2_PATFA
IP6K3	A0A218UDF5_9PASE
IP6K3	A0A250Y049_CASCN
IP6K3	A0A286ZTI7_PIG
IP6K3	A0A287AEV2_PIG
IP6K3	A0A2I0LNL4_COLLI
IP6K3	A0A2I2ULA1_FELCA
IP6K3	A0A2J8NYQ2_PANTR
IP6K3	A0A2J8Y371_PONAB
IP6K3	A0A2K5EZR2_AOTNA
IP6K3	A0A2K5MMR7_CERAT
IP6K3	A0A2K5SF28_CEBIM
IP6K3	A0A2K5TQF7_MACFA
IP6K3	A0A2K5ZRW7_MANLE
IP6K3	A0A2K6D3M1_MACNE
IP6K3	A0A2K6FZP8_PROCO
IP6K3	A0A2K6KCM7_RHIBE
IP6K3	A0A2K6QQ55_RHIRO
IP6K3	A0A2K6UWD2_SAIBB
IP6K3	A0A2R9ALD0_PANPA
IP6K3	A0A2U3V6B6_TURTR
IP6K3	A0A2U3WGD8_ODORO
IP6K3	A0A2U3Z5X8_LEPWE
IP6K3	A0A2Y9GSM1_NEOSC
IP6K3	A0A2Y9NQD3_DELLE
IP6K3	A0A2Y9SGR8_PHYMC
IP6K3	A0A340XLA2_LIPVE
IP6K3	A0A341D7S8_NEOAA
IP6K3	A0A

Each gene name translates to multiple proteins. Seems to be a one-to-many relationship. Let me look up one of these.

```
PPIP5K2	O43314
PPIP5K2	Q5REW0
PPIP5K2	Q5XHF8
PPIP5K2	Q6ZQB6
```

`PPIP5K2`: https://www.genecards.org/cgi-bin/carddisp.pl?gene=PPIP5K2

`O43314` searching this on PDB returned 20 structure results. how? [PDB](https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_polymer_entity_container_identifiers.reference_sequence_identifiers.database_accession%22%2C%22operator%22%3A%22in%22%2C%22value%22%3A%5B%22O43314%22%5D%7D%7D%5D%2C%22logical_operator%22%3A%22and%22%7D%5D%2C%22logical_operator%22%3A%22and%22%2C%22label%22%3A%22text%22%7D%5D%2C%22logical_operator%22%3A%22and%22%7D%2C%22return_type%22%3A%22entry%22%2C%22request_info%22%3A%7B%22query_id%22%3A%22302776a49406a53b591a570891ae4fd5%22%7D%2C%22request_options%22%3A%7B%22pager%22%3A%7B%22start%22%3A0%2C%22rows%22%3A25%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%7D%7D)

In [1202]:
'O43314' in protein_id_full.values

True

That's in our AlphaFold human proteome dataset and is a human protein. Found on UniProt.

In [1204]:
'Q5REW0' in protein_id_full.values

False

This turns out to be a protein the Sumatran orangutan.

In [1205]:
'Q5XHF8' in protein_id_full.values

False

This turns out to be a protein in the African clawed frog.

In [7]:
gomf_list = list(gomf_dict.keys())
gomf_list[0:10]

['GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE_ACTIVITY',
 'GOMF_MANNOSYLTRANSFERASE_ACTIVITY',
 'GOMF_PEPTIDYLTRANSFERASE_ACTIVITY',
 'GOMF_TRNA_BINDING',
 'GOMF_L_ORNITHINE_TRANSMEMBRANE_TRANSPORTER_ACTIVITY',
 'GOMF_SULFUR_AMINO_ACID_TRANSMEMBRANE_TRANSPORTER_ACTIVITY',
 'GOMF_SUCCINATE_DEHYDROGENASE_ACTIVITY',
 'GOMF_MICROFILAMENT_MOTOR_ACTIVITY',
 'GOMF_SNARE_BINDING',
 'GOMF_RDNA_BINDING']

# GOMF processing

1. Downloaded protein ID asociated with gene names under each GOMF
2. Joined it to SeqVec dataset using protein ID. Note this ID is not the uniprot ID.

In [None]:
#herehere
# Downloaded protein ID associated with gene name in GOMF 


url = 'https://www.uniprot.org/uploadlists/'
gomf_proteins = pd.DataFrame()
gomf_fail=[]
gomf_success=[]

for k in gomf_list:
    v_full = gomf_dict[k]
    print("function number:", len(gomf_success))
    
    i = 0 
    while i < len(v_full):
        v = v_full[i:i+10]

        q = ' '.join(v)
        print(k, i)

        params = {  'from': 'GENENAME',
                    'to': 'ID',
                    'format': 'tab',
                    'query': q
                    }

        for attempt in range(10):
            try:
                data = urllib.parse.urlencode(params)
                data = data.encode('utf-8')
                req = urllib.request.Request(url, data)
                
                with urllib.request.urlopen(req) as f:
                    response = f.read()
                
                gomf_proteins_subset = pd.read_csv(io.BytesIO(response), sep='\t')
                gomf_proteins_subset["GOMF"] = k
                gomf_proteins_subset[["protein_id", "Species"]] = gomf_proteins_subset["To"].str.split('_', expand=True)

                # only append human 
                gomf_proteins = gomf_proteins.append(gomf_proteins_subset[gomf_proteins_subset.Species=="HUMAN"])
                gomf_proteins = gomf_proteins.append(gomf_proteins_subset)

            except:
                print("attempt {attempt} failed for {k} from position {i}".format(attempt=attempt, k=k, i=i))
            else:
                # If successful, move on. 
                break
        else:
            # failed all attempts
            gomf_fail.append((v, q, i))
                
        gomf_success.append(v)
        
        i+=10
        
    

In [382]:
gomf_proteins.shape

(992981, 5)

In [384]:
gomf_proteins.to_parquet("gomf_proteins.parquet")

In [379]:
len(gomf_fail)

65

In [381]:
gomf_fail[0]

('MIRLET7F1 MIRLET7G MIRLET7I MIR100 MIR101-1 MIR103A1 MIR105-1 MIR106A MIR106B MIR107',
 760)

In [380]:
len(gomf_proteins.GOMF.unique())

1707

In [386]:
gomf_fail_2 = []

for failed in gomf_fail:
    v, i = failed
    
    q = ' '.join(v)
    print(k, i)

    params = {  'from': 'GENENAME',
                'to': 'ID',   # SWISSPROT?
                'format': 'tab',
                'query': q
                }

    for attempt in range(10):
        try:
            data = urllib.parse.urlencode(params)
            data = data.encode('utf-8')
            req = urllib.request.Request(url, data)

            with urllib.request.urlopen(req) as f:
                response = f.read()

            gomf_proteins_subset = pd.read_csv(io.BytesIO(response), sep='\t')
            gomf_proteins_subset["GOMF"] = k
            gomf_proteins_subset[["protein_id", "Species"]] = gomf_proteins_subset["To"].str.split('_', expand=True)

            # only append human 
            gomf_proteins = gomf_proteins.append(gomf_proteins_subset[gomf_proteins_subset.Species=="HUMAN"])

        except:
            print("attempt {attempt} failed for {k} from position {i}".format(attempt=attempt, k=k, i=i))
        else:
            # If successful, move on. 
            break
    else:
        # failed all attempts
        gomf_fail_2.append((q, i))

        

MIRLET7F1 MIRLET7G MIRLET7I MIR100 MIR101-1 MIR103A1 MIR105-1 MIR106A MIR106B MIR107 760
MIRLET7F1 MIRLET7G MIRLET7I MIR100 MIR101-1 MIR103A1 MIR105-1 MIR106A MIR106B MIR107 780
MIRLET7F1 MIRLET7G MIRLET7I MIR100 MIR101-1 MIR103A1 MIR105-1 MIR106A MIR106B MIR107 790
MIRLET7F1 MIRLET7G MIRLET7I MIR100 MIR101-1 MIR103A1 MIR105-1 MIR106A MIR106B MIR107 800
MIRLET7F1 MIRLET7G MIRLET7I MIR100 MIR101-1 MIR103A1 MIR105-1 MIR106A MIR106B MIR107 810
MIRLET7F1 MIRLET7G MIRLET7I MIR100 MIR101-1 MIR103A1 MIR105-1 MIR106A MIR106B MIR107 820
MIRLET7F1 MIRLET7G MIRLET7I MIR100 MIR101-1 MIR103A1 MIR105-1 MIR106A MIR106B MIR107 830
MIRLET7F1 MIRLET7G MIRLET7I MIR100 MIR101-1 MIR103A1 MIR105-1 MIR106A MIR106B MIR107 840
MIRLET7F1 MIRLET7G MIRLET7I MIR100 MIR101-1 MIR103A1 MIR105-1 MIR106A MIR106B MIR107 850
MIRLET7F1 MIRLET7G MIRLET7I MIR100 MIR101-1 MIR103A1 MIR105-1 MIR106A MIR106B MIR107 860
MIRLET7F1 MIRLET7G MIRLET7I MIR100 MIR101-1 MIR103A1 MIR105-1 MIR106A MIR106B MIR107 870
MIRLET7F1 MIRLET7G MI

In [387]:
len(gomf_fail_2)

0

In [388]:
gomf_proteins.shape

(993016, 5)

In [389]:
gomf_proteins.to_parquet("gomf_proteins.parquet")

In [390]:
gomf_proteins[0:5]

Unnamed: 0,From,To,GOMF,protein_id,Species
477,RAD50,A0A494BZW0_HUMAN,GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE...,A0A494BZW0,HUMAN
478,RAD50,A0A494BZX5_HUMAN,GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE...,A0A494BZX5,HUMAN
479,RAD50,A0A494BZX8_HUMAN,GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE...,A0A494BZX8,HUMAN
480,RAD50,A0A494C0Y7_HUMAN,GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE...,A0A494C0Y7,HUMAN
481,RAD50,A0A494C122_HUMAN,GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE...,A0A494C122,HUMAN


In [392]:
gomf_proteins.GOMF.unique().size

1708

In [393]:
gomf_proteins.protein_id.unique().size

124645

### Join data

In [103]:
gomf_to_seqvec =seqvec_df.merge(gomf_proteins,
               how='inner',
               left_on='db_code',
               right_on='To')[['db_code', 'protein_id_x', 'GOMF']]
gomf_to_seqvec

Unnamed: 0,db_code,protein_id_x,GOMF
0,NUD4B_HUMAN,A0A024RBG1,GOMF_ENDOPOLYPHOSPHATASE_ACTIVITY
1,NUD4B_HUMAN,A0A024RBG1,GOMF_RNA_BINDING
2,NUD4B_HUMAN,A0A024RBG1,GOMF_NUCLEOTIDE_DIPHOSPHATASE_ACTIVITY
3,NUD4B_HUMAN,A0A024RBG1,GOMF_HYDROLASE_ACTIVITY_ACTING_ON_ACID_ANHYDRIDES
4,NUD4B_HUMAN,A0A024RBG1,GOMF_M7G_5_PPPN_DIPHOSPHATASE_ACTIVITY
...,...,...,...
105341,A3LT2_HUMAN,U3KPV4,GOMF_UDP_GLYCOSYLTRANSFERASE_ACTIVITY
105342,A3LT2_HUMAN,U3KPV4,GOMF_GALACTOSYLTRANSFERASE_ACTIVITY
105343,A3LT2_HUMAN,U3KPV4,GOMF_TRANSFERASE_ACTIVITY_TRANSFERRING_GLYCOSY...
105344,A3LT2_HUMAN,U3KPV4,GOMF_TRANSFERASE_ACTIVITY_TRANSFERRING_HEXOSYL...


In [115]:
gomf_to_seqvec.shape

(105346, 3)

In [116]:
gomf_to_seqvec.drop_duplicates().shape

(104271, 3)

In [117]:
gomf_to_seqvec = gomf_to_seqvec.drop_duplicates()

In [105]:
gomf_to_seqvec.columns=["db_code", "protein_id", "gomf"]

In [118]:
gomf_to_seqvec.to_parquet("functions/gomf_to_protein.parquet")

In [119]:
gomf_to_seqvec.pivot_table(index="protein_id",
                           values="gomf",
                          aggfunc=len).sort_values(by="gomf")

Unnamed: 0_level_0,gomf
protein_id,Unnamed: 1_level_1
Q587J8,1
Q9H4I0,1
P01721,1
P01718,1
P01717,1
...,...
Q99497,47
P27695,47
Q9BZL6,47
P04637,47


So a protein can have up to 49 functions, wow. 

In [141]:
gomf_to_seqvec[gomf_to_seqvec.protein_id=='P37231'].head()

Unnamed: 0,db_code,protein_id,gomf
33402,PPARG_HUMAN,P37231,GOMF_CIS_REGULATORY_REGION_SEQUENCE_SPECIFIC_D...
33403,PPARG_HUMAN,P37231,GOMF_RNA_POLYMERASE_II_REPRESSING_TRANSCRIPTIO...
33404,PPARG_HUMAN,P37231,GOMF_DNA_BINDING_TRANSCRIPTION_REPRESSOR_ACTIVITY
33405,PPARG_HUMAN,P37231,GOMF_CHROMATIN_BINDING
33406,PPARG_HUMAN,P37231,GOMF_DNA_BINDING_TRANSCRIPTION_FACTOR_ACTIVITY


In [125]:
"PPARG" in gomf_dict["GOMF_PROSTANOID_RECEPTOR_ACTIVITY"]

True

In [126]:
"PPARG" in gomf_dict["GOMF_E_BOX_BINDING"]

True

In [127]:
"PPARG" in gomf_dict["GOMF_ALPHA_ACTININ_BINDING"]

True

### Try mapping it to cluster results

In [136]:
protein_id_seqvec = np.array(seqvec_df["protein_id"])

In [131]:
import hdbscan 
X_seqvec = np.stack( seqvec_df["seqvec"]).astype(np.float64)

b2_model = hdbscan.HDBSCAN(algorithm='generic', 
                                     alpha= 1.0, 
                                     approx_min_span_tree=True,
                                     gen_min_span_tree=False, 
                                     leaf_size=40, 
                                     metric='cosine', 
                                     min_cluster_size= 5, 
                                     min_samples= 1, 
                                     p=None)
b2_model.fit(X_seqvec)



HDBSCAN(algorithm='generic', metric='cosine', min_samples=1)

In [134]:
cluster_label = b2_model.labels_

In [135]:
cluster_label

array([-1, 83, -1, ..., -1, -1, -1])

In [140]:
clu_df = pd.DataFrame()
clu_df["protein_id"] = protein_id_seqvec
clu_df["cluster"] = cluster_label
clu_df.head()

Unnamed: 0,protein_id,cluster
0,A0A024R1R8,-1
1,A0A024RBG1,83
2,A0A024RCN7,-1
3,A0A075B6H5,-1
4,A0A075B6H7,105


In [143]:
clu_df_gomf = clu_df.merge(gomf_to_seqvec,
             how='left',
             left_on='protein_id',
             right_on='protein_id'
            )
clu_df_gomf.head()

Unnamed: 0,protein_id,cluster,db_code,gomf
0,A0A024R1R8,-1,,
1,A0A024RBG1,83,NUD4B_HUMAN,GOMF_ENDOPOLYPHOSPHATASE_ACTIVITY
2,A0A024RBG1,83,NUD4B_HUMAN,GOMF_RNA_BINDING
3,A0A024RBG1,83,NUD4B_HUMAN,GOMF_NUCLEOTIDE_DIPHOSPHATASE_ACTIVITY
4,A0A024RBG1,83,NUD4B_HUMAN,GOMF_HYDROLASE_ACTIVITY_ACTING_ON_ACID_ANHYDRIDES


In [150]:
clu_df_gomf.pivot_table(index=['cluster'],
                        values=["gomf"],
                        aggfunc=[pd.Series.nunique, len]
                       )

Unnamed: 0_level_0,nunique,len
Unnamed: 0_level_1,gomf,gomf
cluster,Unnamed: 1_level_2,Unnamed: 2_level_2
-1,1563,54185
0,17,44
1,0,24
2,6,15
3,3,14
...,...,...
752,39,109
753,36,63
754,18,58
755,160,513


In [159]:
clu_df_gomf_pivot = clu_df_gomf[clu_df_gomf.cluster>=0].pivot_table(index=['cluster', 'gomf'],
                        aggfunc=[len]
                       ).reset_index().sort_values(('len','protein_id'))

In [160]:
clu_df_gomf_pivot[-20:]

Unnamed: 0_level_0,cluster,gomf,len,len
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,db_code,protein_id
16177,724,GOMF_TRANSCRIPTION_REGULATOR_ACTIVITY,52,52
2733,138,GOMF_OXIDOREDUCTASE_ACTIVITY,55,55
15163,667,GOMF_PROTEIN_SERINE_KINASE_ACTIVITY,58,58
7200,349,GOMF_PEPTIDASE_ACTIVITY,63,63
9852,455,GOMF_TRANSPORTER_ACTIVITY,64,64
7188,349,GOMF_ENDOPEPTIDASE_ACTIVITY,65,65
7208,349,GOMF_SERINE_HYDROLASE_ACTIVITY,65,65
15164,667,GOMF_PROTEIN_SERINE_THREONINE_KINASE_ACTIVITY,88,88
15171,667,GOMF_RIBONUCLEOTIDE_BINDING,88,88
15061,667,GOMF_ADENYL_NUCLEOTIDE_BINDING,88,88


In [161]:
clu_df_gomf_pivot[clu_df_gomf_pivot.cluster==579]

Unnamed: 0_level_0,cluster,gomf,len,len
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,db_code,protein_id
12977,579,GOMF_COPPER_ION_BINDING,1,1
12980,579,GOMF_IDENTICAL_PROTEIN_BINDING,1,1
12986,579,GOMF_STEROID_HORMONE_RECEPTOR_ACTIVITY,1,1
12987,579,GOMF_TRANSITION_METAL_ION_BINDING,1,1
12982,579,GOMF_NEUROTRANSMITTER_RECEPTOR_ACTIVITY,12,12
12985,579,GOMF_SEROTONIN_RECEPTOR_ACTIVITY,12,12
12978,579,GOMF_G_PROTEIN_COUPLED_AMINE_RECEPTOR_ACTIVITY,12,12
12983,579,GOMF_ODORANT_BINDING,114,114
12984,579,GOMF_OLFACTORY_RECEPTOR_ACTIVITY,409,409
12979,579,GOMF_G_PROTEIN_COUPLED_RECEPTOR_ACTIVITY,409,409


In [167]:
clu_df_gomf_pivot[(clu_df_gomf_pivot.cluster!=579) & 
                  (clu_df_gomf_pivot.gomf=='GOMF_OLFACTORY_RECEPTOR_ACTIVITY')]

Unnamed: 0_level_0,cluster,gomf,len,len
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,db_code,protein_id
12506,565,GOMF_OLFACTORY_RECEPTOR_ACTIVITY,1,1
12774,576,GOMF_OLFACTORY_RECEPTOR_ACTIVITY,8,8


In [169]:
clu_df_gomf_pivot[clu_df_gomf_pivot.cluster==576]

Unnamed: 0_level_0,cluster,gomf,len,len
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,db_code,protein_id
12774,576,GOMF_OLFACTORY_RECEPTOR_ACTIVITY,8,8
12773,576,GOMF_MOLECULAR_TRANSDUCER_ACTIVITY,8,8
12772,576,GOMF_G_PROTEIN_COUPLED_RECEPTOR_ACTIVITY,8,8


In [170]:
clu_df_gomf_pivot[clu_df_gomf_pivot.cluster==565]

Unnamed: 0_level_0,cluster,gomf,len,len
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,db_code,protein_id
12504,565,GOMF_N_FORMYL_PEPTIDE_RECEPTOR_ACTIVITY,1,1
12505,565,GOMF_ODORANT_BINDING,1,1
12506,565,GOMF_OLFACTORY_RECEPTOR_ACTIVITY,1,1
12507,565,GOMF_PEPTIDE_RECEPTOR_ACTIVITY,1,1
12508,565,GOMF_PROSTANOID_RECEPTOR_ACTIVITY,1,1
12493,565,GOMF_COMPLEMENT_RECEPTOR_ACTIVITY,1,1
12494,565,GOMF_ENZYME_REGULATOR_ACTIVITY,1,1
12495,565,GOMF_GUANYL_NUCLEOTIDE_EXCHANGE_FACTOR_ACTIVITY,1,1
12496,565,GOMF_G_PROTEIN_COUPLED_PURINERGIC_NUCLEOTIDE_R...,1,1
12498,565,GOMF_ICOSANOID_RECEPTOR_ACTIVITY,1,1


Summarize by cluster:
- total num proteins
- proteins with gomf 
- dictionary ordered by descending count: {gomf: count, gomf: count}
- top gomf count / proteins with gomf = % of proteins with top gomf 
- second to top gomf count / proteins with gomf = % of proteins with top gomf 
- third to top gomf count / proteins with gomf = % of proteins with top gomf 

# STRING

https://string-db.org/cgi/download?sessionId=%24input-%3E%7BsessionId%7D&species_text=Homo+sapiens
list of STRING proteins incl. their display names and descriptions

In [421]:
with gzip.open('9606.protein.info.v11.5.txt.gz', mode="rt") as f:
    stringdb_protein_info = f.read()
    
stringdb_protein_info = pd.read_csv (io.StringIO(stringdb_protein_info), sep='\t')

In [423]:
with gzip.open('9606.protein.aliases.v11.5.txt.gz', mode="rt") as f:
    stringdb_protein_aliases = f.read()
    
stringdb_protein_aliases = pd.read_csv (io.StringIO(stringdb_protein_aliases), sep='\t')

In [438]:
stringdb_protein_aliases[stringdb_protein_aliases.source=='Ensembl_UniProt'].shape

(124372, 3)

In [440]:
stringdb_protein_aliases[stringdb_protein_aliases.source=='Ensembl_UniProt'].alias.drop_duplicates().shape

(124372,)

In [434]:
stringdb_protein_info[0:20]

Unnamed: 0,#string_protein_id,preferred_name,protein_size,annotation
0,9606.ENSP00000000233,ARF5,180,ADP-ribosylation factor 5; GTP-binding protein...
1,9606.ENSP00000000412,M6PR,277,Cation-dependent mannose-6-phosphate receptor;...
2,9606.ENSP00000001008,FKBP4,459,Peptidyl-prolyl cis-trans isomerase FKBP4; Imm...
3,9606.ENSP00000001146,CYP26B1,512,Cytochrome P450 26B1; Involved in the metaboli...
4,9606.ENSP00000002125,NDUFAF7,441,"Protein arginine methyltransferase NDUFAF7, mi..."
5,9606.ENSP00000002165,FUCA2,467,Plasma alpha-L-fucosidase; Alpha-L-fucosidase ...
6,9606.ENSP00000002596,HS3ST1,307,Heparan sulfate glucosamine 3-O-sulfotransfera...
7,9606.ENSP00000002829,SEMA3F,785,Semaphorin-3F; May play a role in cell motilit...
8,9606.ENSP00000003084,CFTR,1480,Cystic fibrosis transmembrane conductance regu...
9,9606.ENSP00000003100,CYP51A1,509,Lanosterol 14-alpha demethylase; Catalyzes C14...


In [441]:
stringdb_tgt = stringdb_protein_info.merge(stringdb_protein_aliases[stringdb_protein_aliases.source=='Ensembl_UniProt'],
                           how='left',
                           left_on='preferred_name',
                           right_on='alias')

In [443]:
stringdb_tgt[stringdb_tgt.alias.isna()].shape

(1824, 7)

In [446]:
in_alpha = seqvec_df.merge(stringdb_tgt,
               how='left',
               left_on='protein_id',
               right_on='alias')

In [447]:
in_alpha[0:10]

Unnamed: 0,pdbx_db_accession,db_code,db_name,protein_id,pdbx_seq_one_letter_code,protein_filename,seqvec,#string_protein_id_x,preferred_name,protein_size,annotation,#string_protein_id_y,alias,source
0,A0A024R1R8,A0A024R1R8_HUMAN,UNP,A0A024R1R8,MSSHEGGKKKALKQPKKQAKEMDEEEKAFKQKQKEEQKKLEVLKAK...,AF-A0A024R1R8-F1-model_v1,"[-0.0093454495, 0.032306828, -0.18556054, -0.0...",,,,,,,
1,A0A024RBG1,NUD4B_HUMAN,UNP,A0A024RBG1,MMKFKPNQTRTYDREGFKKRAACLCFRSEQEDEVLLVSSSRYPDQW...,AF-A0A024RBG1-F1-model_v1,"[0.021576513, -0.014022889, -0.06683804, -0.03...",,,,,,,
2,A0A024RCN7,A0A024RCN7_HUMAN,UNP,A0A024RCN7,MERSFVWLSCLDSDSCNLTFRLGEVESHACSPSLLWNLLTQYLPPG...,AF-A0A024RCN7-F1-model_v1,"[0.01512731, 0.021352034, -0.034456804, -0.036...",,,,,,,
3,A0A075B6H5,A0A075B6H5_HUMAN,UNP,A0A075B6H5,METVVTTLPREGGVGPSRKMLLLLLLLGPGSGLSAVVSQHPSRVIC...,AF-A0A075B6H5-F1-model_v1,"[0.12550753, -0.06984283, -0.22499037, -0.1400...",,,,,,,
4,A0A075B6H7,KV37_HUMAN,UNP,A0A075B6H7,MEAPAQLLFLLLLWLPDTTREIVMTQSPPTLSLSPGERVTLSCRAS...,AF-A0A075B6H7-F1-model_v1,"[0.075385824, -0.119837634, -0.2562411, -0.088...",,,,,,,
5,A0A075B6H8,KVD42_HUMAN,UNP,A0A075B6H8,MDMRVPAQLLGLLLLWLPGVRFDIQMTQSPSFLSASVGDRVSIICW...,AF-A0A075B6H8-F1-model_v1,"[0.090758204, -0.093061805, -0.23045431, -0.11...",,,,,,,
6,A0A075B6H9,LV469_HUMAN,UNP,A0A075B6H9,MAWTPLLFLTLLLHCTGSLSQLVLTQSPSASASLGASVKLTCTLSS...,AF-A0A075B6H9-F1-model_v1,"[0.09729957, -0.063693315, -0.28057155, -0.011...",,,,,,,
7,A0A075B6I0,LV861_HUMAN,UNP,A0A075B6I0,MSVPTMAWMMLLLGLLAYGSGVDSQTVVTQEPSFSVSPGGTVTLTC...,AF-A0A075B6I0-F1-model_v1,"[0.010603256, -0.17456998, -0.26808688, -0.023...",,,,,,,
8,A0A075B6I1,LV460_HUMAN,UNP,A0A075B6I1,MAWTPLLLLFPLLLHCTGSLSQPVLTQSSSASASLGSSVKLTCTLS...,AF-A0A075B6I1-F1-model_v1,"[0.065141894, -0.06317002, -0.31697446, -0.005...",,,,,,,
9,A0A075B6I3,LVK55_HUMAN,UNP,A0A075B6I3,MALTPLLLLLLSHCTGSLSRPVLTQPPSLSASPGATARLPCTLSSD...,AF-A0A075B6I3-F1-model_v1,"[0.100569814, -0.05662869, -0.18032314, -0.072...",,,,,,,


In [448]:
in_alpha[in_alpha.preferred_name.isna()].shape

(20504, 14)

# Protein classification

1. [family classification](https://www.ebi.ac.uk/training/online/courses/protein-classification-intro-ebi-resources/protein-classification/what-are-protein-families/) e.g., G protein-coupled receptors > rhodopsin-like GPCRs > opsins > blue-sensitive opsins

from interpro can I download:
* protein ID that maps to alphafold
* family
* GO MF 

Interpro is amazing, but I can only feed 30 at a time into the API. And what the hell is the Interpro dataset's key? I can't download the 50GB stuff. Can I??

In [453]:
seqvec_df.pdbx_seq_one_letter_code[0]

'MSSHEGGKKKALKQPKKQAKEMDEEEKAFKQKQKEEQKKLEVLKAKVVGKGPLATGGIKKSGKK'

In [457]:
# Mappings of InterPro entries to Gene Ontology (GO) terms.
with open('interpro/interpro2go', mode="rt") as f:
    interpro2go  = f.read()

In [460]:
print(interpro2go[0:1000])

!date: 2021/05/26 21:33:13
!Mapping of InterPro entries to GO
!
InterPro:IPR000003 Retinoid X receptor/HNF4 > GO:DNA binding ; GO:0003677
InterPro:IPR000003 Retinoid X receptor/HNF4 > GO:steroid hormone receptor activity ; GO:0003707
InterPro:IPR000003 Retinoid X receptor/HNF4 > GO:zinc ion binding ; GO:0008270
InterPro:IPR000003 Retinoid X receptor/HNF4 > GO:regulation of transcription, DNA-templated ; GO:0006355
InterPro:IPR000003 Retinoid X receptor/HNF4 > GO:nucleus ; GO:0005634
InterPro:IPR000006 Metallothionein, vertebrate > GO:metal ion binding ; GO:0046872
InterPro:IPR000009 Protein phosphatase 2A regulatory subunit PR55 > GO:protein phosphatase regulator activity ; GO:0019888
InterPro:IPR000009 Protein phosphatase 2A regulatory subunit PR55 > GO:protein phosphatase type 2A complex ; GO:0000159
InterPro:IPR000010 Cystatin domain > GO:cysteine-type endopeptidase inhibitor activity ; GO:0004869
InterPro:IPR000011 Ubiquitin/SUMO-activating enzyme E1 > GO:ubiquitin-like modifier ac

In [465]:
# TSV file listing basic InterPro entry information - the accessions, types and names.
entrylist = pd.read_csv('interpro/entry.list', sep='\t')

In [467]:
entrylist[0:10]

Unnamed: 0,ENTRY_AC,ENTRY_TYPE,ENTRY_NAME
0,IPR000126,Active_site,"Serine proteases, V8 family, serine active site"
1,IPR000138,Active_site,"Hydroxymethylglutaryl-CoA lyase, active site"
2,IPR000169,Active_site,"Cysteine peptidase, cysteine active site"
3,IPR000180,Active_site,"Membrane dipeptidase, active site"
4,IPR000189,Active_site,"Prokaryotic transglycosylase, active site"
5,IPR000590,Active_site,"Hydroxymethylglutaryl-coenzyme A synthase, act..."
6,IPR001252,Active_site,"Malate dehydrogenase, active site"
7,IPR001345,Active_site,"Phosphoglycerate/bisphosphoglycerate mutase, a..."
8,IPR001497,Active_site,Methylated-DNA-[protein]-cysteine S-methyltran...
9,IPR001555,Active_site,"Phosphoribosylglycinamide formyltransferase, a..."


In [468]:
entrylist.ENTRY_TYPE.unique()

array(['Active_site', 'Binding_site', 'Conserved_site', 'Domain',
       'Family', 'Homologous_superfamily', 'PTM', 'Repeat'], dtype=object)

In [469]:
entrylist.shape

(38913, 3)

In [472]:
entrylist[entrylist.ENTRY_TYPE=='Homologous_superfamily'].shape

(3356, 3)

In [473]:
entrylist[entrylist.ENTRY_TYPE=='Homologous_superfamily']

Unnamed: 0,ENTRY_AC,ENTRY_TYPE,ENTRY_NAME
35219,IPR000336,Homologous_superfamily,"Flavivirus/Alphavirus glycoprotein, immunoglob..."
35220,IPR000415,Homologous_superfamily,Nitroreductase-like
35221,IPR001227,Homologous_superfamily,Acyl transferase domain superfamily
35222,IPR001920,Homologous_superfamily,Asp/Glu racemase
35223,IPR002869,Homologous_superfamily,"Pyruvate-flavodoxin oxidoreductase, central do..."
...,...,...,...
38570,IPR044944,Homologous_superfamily,"Nitric oxide synthase, domain 3 superfamily"
38571,IPR044945,Homologous_superfamily,"FokI, recognition domain, subdomain 1 and 2"
38572,IPR044946,Homologous_superfamily,Type I restriction modification DNA specificit...
38573,IPR044947,Homologous_superfamily,"Bacteriophage T4, Gp32, single-stranded DNA-bi..."


# Interpro API

https://github.com/ProteinsWebTeam/interpro7-api/tree/master/docs

| Type | 	Description | 	Source |
| ---	 | --- | 	--- |
| Entry | 	Predicted functional and structural domains on proteins |	InterPro, CATH-Gene3D, CDD, HAMAP, PANTHER, Pfam, PIRSF, PRINTS, PROSITE Patterns, PROSITE Profiles, SMART, SFLD, SUPERFAMILY, TIGRFAMs |
| Protein | 	Protein sequence 	UniProtKB (reviewed and unreviewed) |
| Structure | 	Macromolecular structures involving proteins 	|PDB |
| Set | 	Sets describing relationships between entries 	Pfam, |CDD |
| Taxonomy | 	Taxonomic information about proteins 	|UniProtKB |
| Proteome | 	Collections of proteins defined from whole genome sequencing of isolate organisms 	|UniProtKB |


# PFAM
https://www.ebi.ac.uk/about/news/announcements/Pfam-protein-structures

# UniProt
https://ftp.uniprot.org/pub/databases/uniprot/current%5Frelease/knowledgebase/idmapping/by_organism/

data dictionary: https://ftp.uniprot.org/pub/databases/uniprot/knowledgebase/idmapping/README

In [8]:
import io

In [61]:
with gzip.open('HUMAN_9606_idmapping.dat.gz', mode="rt") as f:
    uniprot_human = pd.read_csv (io.StringIO(f.read()), sep='\t', 
                                 names=["UniProtKB-AC", "id_type", 'id'])


In [76]:
uniprot_human.head()

Unnamed: 0,UniProtKB-AC,id_type,id
0,P31946,UniProtKB-ID,1433B_HUMAN
1,P31946,Gene_Name,YWHAB
2,P31946,GI,4507949
3,P31946,GI,377656702
4,P31946,GI,67464628


In [78]:
# gene names
uniprot_gn = uniprot_human[uniprot_human["id_type"] =='Gene_Name']
uniprot_gn.head()

Unnamed: 0,UniProtKB-AC,id_type,id
1,P31946,Gene_Name,YWHAB
110,P62258,Gene_Name,YWHAE
227,Q04917,Gene_Name,YWHAH
305,P61981,Gene_Name,YWHAG
424,P31947,Gene_Name,SFN


In [83]:
uniprot_gn[uniprot_gn["id"]=='RAD50'].head()

Unnamed: 0,UniProtKB-AC,id_type,id
1173675,Q92878,Gene_Name,RAD50
3993560,A0A494C0Y7,Gene_Name,RAD50
4046287,A0A494BZX8,Gene_Name,RAD50
4049144,E9PM98,Gene_Name,RAD50
4050066,C9JNH8,Gene_Name,RAD50


In [None]:
gomf_gn = pd.DataFrame(columns=['gomf', 'gene_name'])

for k, vs in gomf_dict.items():
    for v in vs:
        gomf_gn = gomf_gn.append([k, v])

In [79]:
gomf_gn.head()

Unnamed: 0,gene_name,gomf
477,RAD50,GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE...
1705,ENDOG,GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE...
1719,ERCC1,GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE...
2225,ERCC4,GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE...
3925,MRE11,GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE...


In [72]:
gomf_gn.to_parquet('functions/gomf_gn.parquet')
# gomf_gn = pd.read_parquet("functions/gomf_gn.parquet")

In [85]:
gomf_gn2 = gomf_gn.merge(uniprot_gn,
              how='left',
              left_on='gene_name',
             right_on='id')[["gene_name", "gomf", "UniProtKB-AC"]]
gomf_gn2.head()

Unnamed: 0,gene_name,gomf,UniProtKB-AC
0,RAD50,GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE...,Q92878
1,RAD50,GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE...,A0A494C0Y7
2,RAD50,GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE...,A0A494BZX8
3,RAD50,GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE...,E9PM98
4,RAD50,GOMF_SINGLE_STRANDED_DNA_ENDODEOXYRIBONUCLEASE...,C9JNH8


In [90]:
seqvec_df.head()

Unnamed: 0,pdbx_db_accession,db_code,db_name,protein_id,pdbx_seq_one_letter_code,protein_filename,seqvec
0,A0A024R1R8,A0A024R1R8_HUMAN,UNP,A0A024R1R8,MSSHEGGKKKALKQPKKQAKEMDEEEKAFKQKQKEEQKKLEVLKAK...,AF-A0A024R1R8-F1-model_v1,"[-0.0093454495, 0.032306828, -0.18556054, -0.0..."
1,A0A024RBG1,NUD4B_HUMAN,UNP,A0A024RBG1,MMKFKPNQTRTYDREGFKKRAACLCFRSEQEDEVLLVSSSRYPDQW...,AF-A0A024RBG1-F1-model_v1,"[0.021576513, -0.014022889, -0.06683804, -0.03..."
2,A0A024RCN7,A0A024RCN7_HUMAN,UNP,A0A024RCN7,MERSFVWLSCLDSDSCNLTFRLGEVESHACSPSLLWNLLTQYLPPG...,AF-A0A024RCN7-F1-model_v1,"[0.01512731, 0.021352034, -0.034456804, -0.036..."
3,A0A075B6H5,A0A075B6H5_HUMAN,UNP,A0A075B6H5,METVVTTLPREGGVGPSRKMLLLLLLLGPGSGLSAVVSQHPSRVIC...,AF-A0A075B6H5-F1-model_v1,"[0.12550753, -0.06984283, -0.22499037, -0.1400..."
4,A0A075B6H7,KV37_HUMAN,UNP,A0A075B6H7,MEAPAQLLFLLLLWLPDTTREIVMTQSPPTLSLSPGERVTLSCRAS...,AF-A0A075B6H7-F1-model_v1,"[0.075385824, -0.119837634, -0.2562411, -0.088..."


In [92]:
gomf_proteins = pd.read_parquet("gomf_proteins.parquet")

In [99]:
gomf_proteins[gomf_proteins.To=='KV37_HUMAN']

Unnamed: 0,From,To,GOMF,protein_id,Species


In [101]:
seqvec_df.merge(gomf_proteins,
               how='inner',
               left_on='db_code',
               right_on='To')[['db_code', 'protein_id_x', 'GOMF']]

Unnamed: 0,db_code,protein_id_x,GOMF
0,NUD4B_HUMAN,A0A024RBG1,GOMF_ENDOPOLYPHOSPHATASE_ACTIVITY
1,NUD4B_HUMAN,A0A024RBG1,GOMF_RNA_BINDING
2,NUD4B_HUMAN,A0A024RBG1,GOMF_NUCLEOTIDE_DIPHOSPHATASE_ACTIVITY
3,NUD4B_HUMAN,A0A024RBG1,GOMF_HYDROLASE_ACTIVITY_ACTING_ON_ACID_ANHYDRIDES
4,NUD4B_HUMAN,A0A024RBG1,GOMF_M7G_5_PPPN_DIPHOSPHATASE_ACTIVITY
...,...,...,...
105341,A3LT2_HUMAN,U3KPV4,GOMF_UDP_GLYCOSYLTRANSFERASE_ACTIVITY
105342,A3LT2_HUMAN,U3KPV4,GOMF_GALACTOSYLTRANSFERASE_ACTIVITY
105343,A3LT2_HUMAN,U3KPV4,GOMF_TRANSFERASE_ACTIVITY_TRANSFERRING_GLYCOSY...
105344,A3LT2_HUMAN,U3KPV4,GOMF_TRANSFERASE_ACTIVITY_TRANSFERRING_HEXOSYL...


# GO Hierarchy
* Gene ontology hierarchy from Nucleic Acids Research 2013, 41(Database issue):D536-44. 
* MSigDB_GO_Terms.txt from https://groups.google.com/g/gsea-help/c/fQcIuZViokg/m/dOH0AbKWAwAJ 

In [173]:
with gzip.open('functions/Domain2GO.sql.gz', mode="rt") as f:
    data = f.read()

In [175]:
data.split('\n')[87]

'/*!40000 ALTER TABLE `GO_hie` DISABLE KEYS */;'

In [176]:
data.split('\n')[103]

'/*!40000 ALTER TABLE `GO_hie` ENABLE KEYS */;'

In [192]:
def convert(in_str):
    result = []
    current_tuple = []
    for token in in_str.split(","):
        number = token.replace("(","").replace(")", "")
        current_tuple.append(number)
        if ")" in token:
            result.append(tuple(current_tuple))
            current_tuple = []
    return result

go_hier = pd.DataFrame(columns=['parent', 'child', 'distance'], dtype=str)

for row in data.split('\n')[88:103]:
    df = pd.DataFrame(convert(row[28:-1]), columns=['parent', 'child', 'distance'], dtype=str)
    go_hier = go_hier.append(df)


In [193]:
go_hier.shape

(719628, 3)

In [194]:
go_hier[0:5]

Unnamed: 0,parent,child,distance
0,2,6264,2
1,2,32042,1
2,2,32043,2
3,2,33955,1
4,2,43504,2
