# DSC 190 Protein Topological Data Analysis Data Preparation

Download some protein structures from protein data bank . Compute suitable persistence diagrams (as topological summaries) for these atomic structures. Compare / cluster them via some clustering method. You can try different distance metrics for persistence diagram summaries.

Alternatively, you can focus on a few molecules and provide detailed topological analysis. I think they have structures for some SARS-CoV-2 Spike and Antibodies available [http://pdb101.rcsb.org/motm/256](http://pdb101.rcsb.org/motm/256). You can study their topological profiles.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys
import re
import warnings
import pandas as pd
import numpy as np
import nglview as nv
from Bio.PDB import PDBParser, parse_pdb_header
from Bio.PDB.MMCIFParser import MMCIFParser
from Bio.PDB.PDBExceptions import PDBConstructionWarning



In [3]:
# add project directory to path
sys.path.append(os.path.dirname(os.getcwd()))
from src.scripts import scrape_pdb
from src.scripts import clean_pdb

#hide discontinuous chain warnings
warnings.simplefilter('ignore', PDBConstructionWarning)

## Downloading Data

First start by downloading a working set of files. I make a json to keep track of the various proteins, and their respective PDB ids as well as category if protein.

Categories include 
- "molecules of life"
    - molecules that perform basic functions required for huma
- Immune System
- Virus Proteins
    - HIV/AIDS proteins
    - Coronavirus Proteins
    - Zika Virus Proteins
- Diabetes
    - insulin
    - dipeptidyl petidatse 4 (tells pancreas to secrete insulin)
    - glucagon
    - glucose receptors, etc
- Cancer
    - proteins that may be used to attack cancer cells
    - proteins involved in cellular growth
    - proteins involved in normal cellular death
- Toxins and Poisons
- Antibiiotics
    - proteins used in antibiotics
- Antibiotic Resistance
    - proteins used by antibiotic resistant bacteria to break down antibiotics
- Neural Proteins


In [138]:
url='https://pdb101.rcsb.org/motm/motm-by-category'
protein_dict=scrape_pdb.get_protein_links(url)
protein_dict

100%|██████████| 4/4 [01:21<00:00, 20.39s/it]


{'ABO Blood Type Glycosyltransferases': {'links': ['https://files.rcsb.org/download/3I0G.pdb',
   'https://files.rcsb.org/download/1LU1.pdb',
   'https://files.rcsb.org/download/2J1U.pdb',
   'https://files.rcsb.org/download/2OBS.pdb',
   'https://files.rcsb.org/download/1LZI.pdb'],
  'category': ['You and Your Health']},
 'Amyloids': {'links': ['https://files.rcsb.org/download/2M4J.pdb',
   'https://files.rcsb.org/download/3NHC.pdb',
   'https://files.rcsb.org/download/2KJ3.pdb',
   'https://files.rcsb.org/download/3ZPK.pdb',
   'https://files.rcsb.org/download/2LMN.pdb',
   'https://files.rcsb.org/download/2LMP.pdb'],
  'category': ['You and Your Health']},
 'Apoptosomes': {'links': ['https://files.rcsb.org/download/3J2T.pdb',
   'https://files.rcsb.org/download/2P1H.pdb',
   'https://files.rcsb.org/download/1Z6T.pdb',
   'https://files.rcsb.org/download/3SFZ.pdb',
   'https://files.rcsb.org/download/3IZ8.pdb',
   'https://files.rcsb.org/download/3LQQ.pdb'],
  'category': ['You and Y

In [100]:
url='https://pdb101.rcsb.org/motm/motm-by-category'
protein_dict=scrape_pdb.get_protein_links(url)
protein_dict

100%|██████████| 33/33 [00:00<00:00, 1467.96it/s]


{'ABO Blood Type Glycosyltransferases': {'link': 'https://pdb101.rcsb.org/motm/156',
  'category': ['You and Your Health', 'Enzymes']},
 'Amyloids': {'link': 'https://pdb101.rcsb.org/motm/189',
  'category': ['You and Your Health', 'Integrative/Hybrid Methods']},
 'Apoptosomes': {'link': 'https://pdb101.rcsb.org/motm/177',
  'category': ['You and Your Health', 'Cancer']},
 'Beta-secretase': {'link': 'https://pdb101.rcsb.org/motm/115',
  'category': ['You and Your Health', 'Drug Action', 'Enzymes']},
 'Crystallins': {'link': 'https://pdb101.rcsb.org/motm/127',
  'category': ['You and Your Health', 'Biomolecules']},
 'DNA Methyltransferases': {'link': 'https://pdb101.rcsb.org/motm/139',
  'category': ['You and Your Health', 'Protein Synthesis']},
 'Fetal Hemoglobin': {'link': 'https://pdb101.rcsb.org/motm/257',
  'category': ['You and Your Health', 'Transport']},
 'Glucansucrase': {'link': 'https://pdb101.rcsb.org/motm/138',
  'category': ['You and Your Health', 'Enzymes']},
 'Hypoxanthi

In [87]:
abo=protein_dict[list(protein_dict.keys())[0]]
abo

{'link': 'https://pdb101.rcsb.org/motm/156',
 'category': ['You and Your Health', 'Enzymes']}

https://files.rcsb.org/download/3I0G.pdb

In [113]:
links

b'<!DOCTYPE html>

In [114]:
scrape_pdb.find_links(abo["link"])

['https//files.rcsb.org/download/3I0G.pdb',
 'https//files.rcsb.org/download/1LU1.pdb',
 'https//files.rcsb.org/download/2J1U.pdb',
 'https//files.rcsb.org/download/2OBS.pdb',
 'https//files.rcsb.org/download/1LZI.pdb']

In [112]:
links=scrape_pdb.find_links(abo["link"])

files=links.find_all("div", id="DownloadFilesButton")[0]
download_link=[f["href"] for f in files.find_all("a") if re.match("^.*\.pdb$", f["href"])][0]
download_link="https"+download_link
download_link

'https//files.rcsb.org/download/3I0G.pdb'

In [99]:
abo_html=scrape_pdb.get_html(abo['link'])
[i.find_all("a")[0]["href"].strip("\\\'") for i in abo_html.find_all("span", {"class":"rcsb_id_tag"})]

['http://www.rcsb.org/pdb/explore/explore.do?structureId=3i0g',
 'http://www.rcsb.org/pdb/explore/explore.do?structureId=1lu1',
 'http://www.rcsb.org/pdb/explore/explore.do?structureId=2j1u',
 'http://www.rcsb.org/pdb/explore/explore.do?structureId=2obs',
 'http://www.rcsb.org/pdb/explore/explore.do?structureId=1lzi']

In [11]:
pdb_html=scrape_pdb.get_html(url)
print(pdb_html.prettify())

b'
<!DOCTYPE html>
<html>
 <head>
  <title>
   PDB-101: Molecule of the Month By Category
  </title>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1" name="viewport"/>
  <meta content="PDB101: Molecule of the Month By Category" property="og:title"/>
  <meta content="PDB-101: Educational portal of RCSB PDB" property="og:description"/>
  <meta content="RCSB: PDB-101" property="og:site_name"/>
  <meta content="PDB-101: Educational portal of RCSB PDB" name="description"/>
  <!-- Assocaite with our Google Analytics & Search Console-->
  <meta content="A8M31jAX8SUgQYzbnF5r-wgykna2i5Hp4J9fziVD9Sg" name="google-site-verification"/>
  <link href="https://cdn.rcsb.org/pdb101/common/jquery-ui-1.11.4/jquery-ui.min.css" rel="stylesheet"/>
  <link href="https://cdn.rcsb.org/javascript/bootstrap/latest/css/bootstrap.min.css" rel="stylesheet"/>
  <link href="https://cdn.rcsb.org/javascript/ekko-lightbox/ekko-lightb

In [26]:
search_obj=re.compile("/motm/\d{3}")
search_obj.match('/motm/228')

<re.Match object; span=(0, 9), match='/motm/228'>

In [50]:
category_names=[category.text for category in pdb_html.find_all("a", {"class":"no-underline"})]
category_html=pdb_html.find_all(id=re.compile('subcategory_\d*'))
categories=dict(zip(category_names, category_html))
categories

{'You and Your Health': <div id="subcategory_1" style="display:none;"><div class="row grid-view"><div class="col-xs-6 col-sm-4 col-md-4 col-lg-3 grid-cell"><a href="/motm/156"><img class="icon" src="https://cdn.rcsb.org/pdb101/motm/images/tn/156-ABOBloodTypeGlycosyltransferases_3i0g_composite-tn.png"/></a><div><a href="/motm/156">ABO Blood Type Glycosyltransferases</a></div></div><div class="col-xs-6 col-sm-4 col-md-4 col-lg-3 grid-cell"><a href="/motm/54"><img class="icon" src="https://cdn.rcsb.org/pdb101/motm/images/tn/54-1acj-tn.png"/></a><div><a href="/motm/54">Acetylcholinesterase</a></div></div><div class="col-xs-6 col-sm-4 col-md-4 col-lg-3 grid-cell"><a href="/motm/13"><img class="icon" src="https://cdn.rcsb.org/pdb101/motm/images/tn/13-1htb-tn.png"/></a><div><a href="/motm/13">Alcohol Dehydrogenase</a></div></div><div class="col-xs-6 col-sm-4 col-md-4 col-lg-3 grid-cell"><a href="/motm/79"><img class="icon" src="https://cdn.rcsb.org/pdb101/motm/images/tn/79-app-composite-tn.pn

In [73]:
test_category=categories["You and Your Health"]
test_proteins=test_category.find_all(href=re.compile('/motm/\d{3}'))[1::2]
test_protein_names=[tp.text for tp in test_proteins]
test_protein_links=[site_url+tp["href"] for tp in test_proteins]
test_proteins=dict(zip(test_protein_names, test_protein_links))
test_proteins

{'ABO Blood Type Glycosyltransferases': 'https://pdb101.rcsb.org/motm/156',
 'Amyloids': 'https://pdb101.rcsb.org/motm/189',
 'Apoptosomes': 'https://pdb101.rcsb.org/motm/177',
 'Beta-secretase': 'https://pdb101.rcsb.org/motm/115',
 'Crystallins': 'https://pdb101.rcsb.org/motm/127',
 'DNA Methyltransferases': 'https://pdb101.rcsb.org/motm/139',
 'Fetal Hemoglobin': 'https://pdb101.rcsb.org/motm/257',
 'Glucansucrase': 'https://pdb101.rcsb.org/motm/138',
 'Hypoxanthine-guanine phosphoribosyltransferase (HGPRT)': 'https://pdb101.rcsb.org/motm/151',
 'Hypoxia-Inducible Factors': 'https://pdb101.rcsb.org/motm/240',
 'Leptin': 'https://pdb101.rcsb.org/motm/149',
 'Opioid Receptors': 'https://pdb101.rcsb.org/motm/217',
 'Phospholipase A2': 'https://pdb101.rcsb.org/motm/239',
 'Piezo1 Mechanosensitive Channel': 'https://pdb101.rcsb.org/motm/223',
 'Prions': 'https://pdb101.rcsb.org/motm/101',
 'Proton-Gated Urea Channel': 'https://pdb101.rcsb.org/motm/158',
 'S-Nitrosylated Hemoglobin': 'http

In [69]:
site_url+'/motm/156'

'https://pdb101.rcsb.org/motm/156'

In [71]:
site_url="/".join(url.split("/")[:-2])


In [30]:
category_html=pdb_html.find_all(href=re.compile('/motm/\d{3}'))
category_html

[<a href="/motm/156"><img class="icon" src="https://cdn.rcsb.org/pdb101/motm/images/tn/156-ABOBloodTypeGlycosyltransferases_3i0g_composite-tn.png"/></a>,
 <a href="/motm/156">ABO Blood Type Glycosyltransferases</a>,
 <a href="/motm/189"><img class="icon" src="https://cdn.rcsb.org/pdb101/motm/images/tn/189-Amyloids_2m4j-tn.png"/></a>,
 <a href="/motm/189">Amyloids</a>,
 <a href="/motm/177"><img class="icon" src="https://cdn.rcsb.org/pdb101/motm/images/tn/177-Apoptosomes_human_apoptosome-tn.png"/></a>,
 <a href="/motm/177">Apoptosomes</a>,
 <a href="/motm/115"><img class="icon" src="https://cdn.rcsb.org/pdb101/motm/images/tn/115-1sgz_1py1-tn.png"/></a>,
 <a href="/motm/115">Beta-secretase</a>,
 <a href="/motm/127"><img class="icon" src="https://cdn.rcsb.org/pdb101/motm/images/tn/127-mom127_crystallins2-tn.png"/></a>,
 <a href="/motm/127">Crystallins</a>,
 <a href="/motm/139"><img class="icon" src="https://cdn.rcsb.org/pdb101/motm/images/tn/139-DNAMethylases_DNMT-tn.png"/></a>,
 <a href="

## Cleaning Data

### Example file to test cleaning

Extract files to dataframe and then save as csv files for each protein structure

In [173]:
%%timeit

fn='../data/ABO-Blood-Type-Glycosyltransferases/1LU1.pdb'

parser=PDBParser()
structure=parser.get_structure('1LU1', fn)

structures=[]
for model in structure.get_models():
    for chain in model.get_chains():
        for residue in chain.get_residues():
            for atom in residue.get_atoms():
                structure_id, model_id, chain_id, residue_id, atom_name=atom.get_full_id()
                data=[structure_id, model_id, chain_id, residue_id[1], atom_name[0]]
                data=data+list(atom.get_coord())
                structures.append(data)

                
structure_cols=[
    'protein_id',
    'model_id',
    'chain_id',
    'residue_id',
    # 'residue_name',
    'atom_name',
    'atom_coord_x',
    'atom_coord_y',
    'atom_coord_z'
]
    
pd.DataFrame(
    data=structures,
    columns=structure_cols
)

22.9 ms ± 345 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Clean all files

1. Get all protein files in `../data` path
2. Extract to dataframe
3. Save to csv

In [250]:
src='../data/raw'
dst='../data/cleanead'

df=clean_pdb.extract_pdb_files(src, dst)
df

  0%|          | 0/476 [00:00<?, ?it/s]

Completed data cleaning in 283 seconds. 
Saved data to ../data/pdb-data.csv


Unnamed: 0,protein_id,model_id,chain_id,residue_id,atom_name,atom_coord_x,atom_coord_y,atom_coord_z
0,1LU1,0,A,1,N,50.942,31.718,123.911
1,1LU1,0,A,1,CA,49.554,31.356,123.485
2,1LU1,0,A,1,C,48.946,32.417,122.588
3,1LU1,0,A,1,O,49.163,33.608,122.787
4,1LU1,0,A,1,CB,48.672,31.155,124.69
...,...,...,...,...,...,...,...,...
5852950,1R4U,0,A,1203,O,10.863,56.331,53.053
5852951,1R4U,0,A,1204,O,43.752,29.63,49.61
5852952,1R4U,0,A,1205,O,42.176,27.879,50.263
5852953,1R4U,0,A,1206,O,22.972,69.07,21.465


In [252]:
dst='../data/cleaned'
fn='/home/apfriend/ucsd/CURRENT/dsc190/project/data/raw/ABO-Blood-Type-Glycosyltransferases/1R4U.pdb'
os.path.basename(os.path.dirname(fn))

'ABO-Blood-Type-Glycosyltransferases'

In [253]:
os.path.basename(fn).replace('.pdb','.csv')

'1R4U.csv'

In [242]:
df=clean_pdb.extract_pdb_files(src)
df

  0%|          | 0/10 [00:00<?, ?it/s]

  0%|          | 0/1921 [00:00<?, ?it/s]

  0%|          | 0/2610 [00:00<?, ?it/s]

  0%|          | 0/2647 [00:00<?, ?it/s]

  0%|          | 0/2629 [00:00<?, ?it/s]

  0%|          | 0/2616 [00:00<?, ?it/s]

  0%|          | 0/512 [00:00<?, ?it/s]

  0%|          | 0/5883 [00:00<?, ?it/s]

  0%|          | 0/2430 [00:00<?, ?it/s]

  0%|          | 0/12621 [00:00<?, ?it/s]

  0%|          | 0/92388 [00:00<?, ?it/s]

Completed data cleaning in 2 seconds


Unnamed: 0,protein_id,model_id,chain_id,residue_id,atom_name,atom_coord_x,atom_coord_y,atom_coord_z
0,1LU1,0,A,1,N,50.942,31.718,123.911
1,1LU1,0,A,1,CA,49.554,31.356,123.485
2,1LU1,0,A,1,C,48.946,32.417,122.588
3,1LU1,0,A,1,O,49.163,33.608,122.787
4,1LU1,0,A,1,CB,48.672,31.155,124.69
...,...,...,...,...,...,...,...,...
126252,1VSZ,0,V,2014,CA,-74.575,184.24,-261.312
126253,1VSZ,0,V,2015,CA,-72.879,184.2,-257.947
126254,1VSZ,0,V,2016,CA,-70.939,181.357,-259.472
126255,1VSZ,0,V,2017,CA,-70.585,183.722,-262.457


In [249]:
fp='/home/apfriend/ucsd/CURRENT/dsc190/project/data/raw/Adenovirus/3IYN.pdb'
parser=PDBParser()
structure=parser.get_structure('3IYN', fp)
clean_pdb.get_protein_data(structure)

array([['3IYN', '0', 'A', ..., '103.751', '21.714', '331.425'],
       ['3IYN', '0', 'A', ..., '103.818', '22.336', '330.109'],
       ['3IYN', '0', 'A', ..., '104.855', '23.454', '330.079'],
       ...,
       ['3IYN', '0', 'T', ..., '86.846', '40.081', '404.699'],
       ['3IYN', '0', 'T', ..., '86.081', '39.297', '405.301'],
       ['3IYN', '0', 'T', ..., '85.613', '42.156', '404.01']], dtype='<U7')