# Argininosuccinate Synthetase (ASS1 or ARGSS) Case Study

Last edited on 2/28/2016 by Laurence. Removed the function writing a FASTA file from Nathan's spreadsheet so all FASTA files are called by the refseq ID from Anna's BioEntrez section. Also modified protein label on PV display.

Last edited by Anna. Added Gene ID--> Accession number

Key Information: 
    * Uniprot ID: P00966
    * Entrez ID:445
    * Perturbation: Low levels of ARGSS expression observed (Bowles, Int J Cancer 2008)
    

   

## BioEntrez

In [1]:
from Bio import Entrez
entrez_email = raw_input("What is an e-mail you would like to provide to query the Entrez database?   :")
Entrez.email = entrez_email
gene_entrez_id = raw_input('Enter Gene Entrez ID:')
handle = Entrez.efetch(db="gene", id=gene_entrez_id ,rettype="fasta", retmode="xml")
records=Entrez.read(handle)
#records[0].keys()
gene_entrez_name = records[0]["Entrezgene_gene"]["Gene-ref"]["Gene-ref_locus"]
gene_entrez_location = records[0]["Entrezgene_gene"]["Gene-ref"]["Gene-ref_maploc"]
gene_entrez_syn = records[0]["Entrezgene_gene"]["Gene-ref"]["Gene-ref_syn"]
gene_entrez_summary=records[0]["Entrezgene_summary"]
print "\nName: %s\n\nLocation: %s\n\nSynonyms: %s \n\nSummary: %s" %(gene_entrez_name, gene_entrez_location, gene_entrez_syn, gene_entrez_summary)

What is an e-mail you would like to provide to query the Entrez database?   :lac018@ucsd.edu
Enter Gene Entrez ID:445

Name: ASS1

Location: 9q34.1

Synonyms: ['ASS', 'CTLN1'] 

Summary: The protein encoded by this gene catalyzes the penultimate step of the arginine biosynthetic pathway. There are approximately 10 to 14 copies of this gene including the pseudogenes scattered across the human genome, among which the one located on chromosome 9 appears to be the only functional gene for argininosuccinate synthetase. Mutations in the chromosome 9 copy of this gene cause citrullinemia. Two transcript variants encoding the same protein have been found for this gene. [provided by RefSeq, Aug 2012]


#### Entrez Gene ID to Accession Numbers

In [2]:
import mygene
mg = mygene.MyGeneInfo()
mg.getgene(gene_entrez_id, 'name,symbol,refseq')

{u'_id': u'445',
 u'name': u'argininosuccinate synthase 1',
 u'refseq': {u'genomic': [u'NC_000009', u'NC_018920', u'NG_011542'],
  u'protein': [u'NP_000041',
   u'NP_446464',
   u'XP_005272257',
   u'XP_011517007',
   u'XP_016870218'],
  u'rna': [u'NM_000050',
   u'NM_054012',
   u'XM_005272200',
   u'XM_011518705',
   u'XM_017014729']},
 u'symbol': u'ASS1'}

In [3]:
import pandas as pd
GP = pd.read_csv('DF_GEMPRO.csv', index_col=0)
# forcing gene IDs to be read as strings
GP['m_gene_original'] = GP['m_gene_original'].astype(str)
GP['m_gene_entrez'] = GP['m_gene_entrez'].astype(str)
GP['m_gene_isoform'] = GP['m_gene_isoform'].astype(str)
GEM_PRO_available_refseq = GP[GP.m_gene_entrez == gene_entrez_id]['u_refseq']
print "These are the refseq IDs compatible with our workflow"
pd.DataFrame(GEM_PRO_available_refseq)

These are the refseq IDs compatible with our workflow


Unnamed: 0,u_refseq
3049,NP_446464
3050,


In [4]:
acc = raw_input('Enter the refseq ID (u_refseq) of the sequence you would like to download:   ')
from Bio import SeqIO
from Bio import Entrez
Entrez.email = entrez_email
temp = Entrez.efetch(db="nucleotide",rettype="gb",id=acc)
out = open(acc+".faa",'w')
gbseq = SeqIO.read(temp, "genbank")
SeqIO.write(gbseq,out,"fasta")
temp.close()
out.close()
print(gbseq)

Enter the refseq ID (u_refseq) of the sequence you would like to download:   NP_446464
ID: NP_446464.1
Name: NP_446464
Description: argininosuccinate synthase [Homo sapiens].
Number of features: 10
/comment=REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence was derived from DB496935.1 and BC013224.1.
Summary: The protein encoded by this gene catalyzes the penultimate
step of the arginine biosynthetic pathway. There are approximately
10 to 14 copies of this gene including the pseudogenes scattered
across the human genome, among which the one located on chromosome
9 appears to be the only functional gene for argininosuccinate
synthetase. Mutations in the chromosome 9 copy of this gene cause
citrullinemia. Two transcript variants encoding the same protein
have been found for this gene. [provided by RefSeq, Aug 2012].
Transcript Variant: This variant (2) lacks an exon in the 5' UTR,
compared to variant 1. Variants 1 and 2 encode the same protein.
Publicatio

## Visualizing the protein and its mutation

In [5]:
def get_pdb_seq(structure):
    '''
    Takes in a Biopython structure object and returns a list of the structure's sequences
    :param structure: Biopython structure object
    :return: Dictionary of sequence strings with chain IDs as the key
    '''
    
    structure_seqs = {}
    
    # loop over each chain of the PDB
    for chain in structure[0]:
        
        chain_it = iter(chain) 
        
        chain_seq = ''
        tracker = 0
        
        # loop over the residues
        for res in chain.get_residues():
            # NOTE: you can get the residue number too
            res_num = res.id[1]
            
            # double check if the residue name is a standard residue
            # if it is not a standard residue (ie. selenomethionine),
            # it will be filled in with an X on the next iteration)
            if Polypeptide.is_aa(res, standard=True):
                full_id = res.get_full_id()
                end_tracker = full_id[3][1]
                i_code = full_id[3][2]
                aa = Polypeptide.three_to_one(res.get_resname())
                
                # tracker to fill in X's
                if end_tracker != (tracker + 1):# and first == False:
                    if i_code != ' ':
                        chain_seq += aa
                        tracker = end_tracker + 1
                        continue
                    else:
                        chain_seq += 'X'*(end_tracker - tracker - 1)
                        
                chain_seq += aa
                tracker = end_tracker
                
            else:
                continue

        structure_seqs[chain.get_id()] = chain_seq

    return structure_seqs

import os.path
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC

def write_fasta_file(sequence, fileout):
    '''
    This writes a fasta file for a SeqRecord object. It also checks if the file exists already and returns the filename.
    
    Input: sequence - Biopython SeqRecord object, identification - ID of the sequence.
    Output: Filename of fasta file
    '''
    
    outfile = "%s" % fileout
    if os.path.isfile(outfile):
        print 'FASTA file already exists %s' % outfile
        return outfile
    else:
        SeqIO.write(sequence, outfile, "fasta")
        return outfile
    
import os.path
from Bio.Emboss.Applications import NeedleCommandline

def run_alignment(fasta1_id, fasta1, fasta2_id, fasta2):
    '''
    Runs the needle alignment program and writes the result to a file. Returns the filename. Standard gap inputs are used.
    
    Input:  fasta1 - fasta file name ("reference" sequence)
            fasta2 - fasta file name (what you're interested in aligning)
    Output: alignment_file - file name of alignment
    '''

    alignment_file = "%s_%s_align.txt" % (fasta1_id, fasta2_id)
    
    if os.path.isfile(alignment_file):
        print 'Alignment %s file already exists' % alignment_file
        return alignment_file

    else:
        print '**RUNNING ALIGNMENT FOR %s AND %s**' % (fasta1_id, fasta2_id)
        needle_cline = NeedleCommandline(asequence=fasta1, bsequence=fasta2, gapopen=10, gapextend=0.5, outfile=alignment_file)
        stdout, stderr = needle_cline()
        return alignment_file

    
import numpy as np
from Bio import AlignIO
from collections import defaultdict

def get_alignment_allpos_df(alignment_file, a_seq_id=None, b_seq_id=None):
    '''
    Takes in a needle alignment file and returns a pandas dataframe of the results
    Input: alignment_file - the path to the alignment file, 
            a_seq_id - optional ID of the reference sequence, 
            b_seq_id - optional ID of the second sequence
    Output: alignment_df - a pandas dataframe of the alignment results
    '''
    alignments = list(AlignIO.parse(alignment_file, "emboss"))

    appender = defaultdict(dict)
    idx = 0
    for alignment in alignments:
    #         if not switch:
        if not a_seq_id:
            a_seq_id = list(alignment)[0].id
        a_seq = str(list(alignment)[0].seq)
        if not b_seq_id:
            b_seq_id = list(alignment)[1].id
        b_seq = str(list(alignment)[1].seq)

        a_idx = 1
        b_idx = 1

        for i, (a,b) in enumerate(zip(a_seq,b_seq)):
            if a == b and a != '-' and b != '-':
                aa_flag = 'match'
            if a != b and a == '-' and b != '-':
                aa_flag = 'insertion'
            if a != b and a != '-' and b == '-':
                aa_flag = 'deletion'
            if a != b and a != '-' and b == 'X':
                aa_flag = 'unresolved'
            if a != b and b != '-' and a == 'X':
                aa_flag = 'unresolved'
            elif a != b and a != '-' and b != '-':
                aa_flag = 'mutation'
                
            appender[idx]['Uniprot_ID'] = a_seq_id
            appender[idx]['Structure'] = b_seq_id
            appender[idx]['type'] = aa_flag
            
            if aa_flag == 'match' or aa_flag == 'unresolved' or aa_flag == 'mutation':
                appender[idx]['Uniprot_sequence'] = a
                appender[idx]['Uniprot_sequence_position'] = a_idx
                appender[idx]['PDB_sequence'] = b
                appender[idx]['PDB_sequence_position'] = b_idx
                a_idx += 1
                b_idx += 1

            if aa_flag == 'deletion':
                appender[idx]['Uniprot_sequence'] = a
                appender[idx]['Uniprot_sequence_position'] = a_idx
                a_idx += 1

            if aa_flag == 'insertion':
                appender[idx]['PDB_sequence'] = b
                appender[idx]['PDB_sequence_position'] = b_idx
                b_idx += 1
            
            idx += 1

    alignment_df = pd.DataFrame.from_dict(appender, orient='index')
    alignment_df = alignment_df[['Uniprot_ID', 'Structure', 'type', 'Uniprot_sequence', 'Uniprot_sequence_position', 'PDB_sequence', 'PDB_sequence_position']].fillna(value=np.nan)
    
    return alignment_df

def get_pdb_seq2(structure):
    '''
    Takes in a Biopython structure object and returns a list of the structure's sequences
    :param structure: Biopython structure object
    :return: Dictionary of sequence strings with chain IDs as the key
    '''
    
    structure_seqs = {}
    
    # loop over each chain of the PDB
    for chain in structure[0]:
        
        chain_it = iter(chain) 
        
        chain_seq = []
        tracker = 0
        
        # loop over the residues
        for res in chain.get_residues():
            # NOTE: you can get the residue number too
            res_num = res.id[1]
            
            # double check if the residue name is a standard residue
            # if it is not a standard residue (ie. selenomethionine),
            # it will be filled in with an X on the next iteration)
            # TODO: except when it's at the beginning or end...
            if Polypeptide.is_aa(res, standard=True):
                full_id = res.get_full_id()
                end_tracker = full_id[3][1]
                i_code = full_id[3][2]
                aa = Polypeptide.three_to_one(res.get_resname())
                
                # tracker to fill in X's
                if end_tracker != (tracker + 1):
                    if i_code != ' ':
                        chain_seq.append((aa,end_tracker))
                        tracker = end_tracker + 1
                        continue
                    else:
                        xes = 'X'*(end_tracker - tracker - 1)
                        for x in xes:
                            chain_seq.append((x,end_tracker))
                        
                chain_seq.append((aa,end_tracker))
                tracker = end_tracker
                
            else:
                continue

        structure_seqs[chain.get_id()] = chain_seq

    return structure_seqs

class PDBViewer_options(object):
    '''
    Contributed by: Ali Ebrahim
    '''
    
    def __init__(self, f):
        self.pdb = open(f).read()

    def _repr_html_(self):
        div_id = str(uuid.uuid4())
        
        return """<div id="%s" style="width: 800px; height: 600px">
    <div>
    
        <!--testing static label-->
        <style>
            .static-label {
                position: absolute;
                background: #0000;
                text-align: right;
                z-index: 1;
                font-weight: bold;
                width: 800px;
            }
        </style>
        
        <script>
            require.config({
                paths: {
                    "pv": "//biasmv.github.io/pv/js/pv.min"
                }
            });
            
            require(["pv"], function(pv) {
                pdb = "%s";
                
                <!--append the static label to the parent-->
                var parent = document.getElementById('%s');
                var staticLabel = document.createElement('div');
                staticLabel.innerHTML = '%s';
                staticLabel.className = 'static-label';
                parent.appendChild(staticLabel);
                
                <!--load the structure-->
                structure = pv.io.pdb(pdb);
                
                // select a chain to display and see if user wants to only display one
                %s
                
                <!--choose atom to label-->
                var carbonAlpha = structure.atom("A.6.CA");
                
                // choose a ligand to color (later on), if want to see on both chains remove cname
                %s var residues = structure.select({%s %s});
                
                viewer = pv.Viewer(parent, {
                    width: '800',
                    height: '600',
                    antialias: true,
                    outline: true,
                    quality: 'medium',
                    style: 'hemilight',
                    background: 'white',
                    animateTime: 500,
                    selectionColor: '#f00'
                });
            
                
                <!--misc viewer functions-->
                viewer.fitParent();
                
                // add cartoon visualization
                viewer.cartoon('molecule', %s);
                
                // color the selected residues in red, and display as red lines
                %s viewer.spheres('residues', residues,  { color: pv.color.uniform('red') });
                
                // center on the structure
                viewer.centerOn(%s);
                
                <!--atom label options-->
                var options = {
                 fontSize : 26, fontColor: '#f22', backgroundAlpha : 0.4
                };
                
                <!--display the label-->
                viewer.label('label', carbonAlpha.qualifiedName(), carbonAlpha.pos(), options);
                
                <!--not sure how the auto zoom works-->
                viewer.autoZoom();
            });
        </script>
        """ % (div_id, self.pdb.replace("\n", "\\n"), div_id, protein_name, cnames_pv_var, is_res, chain_iso, rnums_script, structure_var, is_res, structure_var)

In [6]:
GP = pd.read_csv('DF_GEMPRO.csv', index_col=0)
# forcing gene IDs to be read as strings
GP['m_gene_original'] = GP['m_gene_original'].astype(str)
GP['m_gene_entrez'] = GP['m_gene_entrez'].astype(str)
GP['m_gene_isoform'] = GP['m_gene_isoform'].astype(str)
GP[GP.u_refseq == acc]

Unnamed: 0,m_gene_original,m_gene_entrez,m_gene_isoform,u_uniprot_acc,u_isoform_id,u_refseq,u_ensp,u_seq_len,u_seq,u_reviewed,...,ssb_p_aln_coverage,ssb_p_percent_seq_ident,ssb_p_no_deletions_in_pdb,ssb_p_aln_coverage_sim,ssb_si_score,ssb_rez_score,ssb_raw_score,ssb_above_cutoffs,ssb_rank,ssb_best_file
3049,445.1,445,1,P00966,P00966-1,NP_446464,ENSP00000253004,412.0,MSSKGSVVLAYSGGLDTSCILVWLKEQGYDVIAYLANIGQKEDFEE...,True,...,402.0,0.975728,True,402.0,1.565458,1.066667,2.632125,True,1.0,2nz2.pdb


In [7]:
# This extracts all chains present
#print "These chains are present in the pdb structure: %s" %(chain_strings)
#pdb_chain_choose = raw_input("Which chain are you interested in?   ")
chains_avail = GP[GP.u_refseq == acc].p_chains
chains_present = ""
for a in chains_avail:
    chains_present = a

# This automatically displays/chooses which chain to align as it is the "best"; a string of A,B,C is returned
best_pdb_chain = GP[GP.u_refseq == acc].p_chain_uniprot_map.values[0][2]

# load Biopython PDB packages

# PDBList to download PDBs
from Bio.PDB.PDBList import PDBList
pdbl = PDBList()

# PDBParser to load and work with files
from Bio.PDB.PDBParser import PDBParser
parser = PDBParser()

import urllib2
import uuid

pdb_name = raw_input("What is the pdb ID?   ")

# download pdb
pdb_file_path = pdbl.retrieve_pdb_file(pdb_name)

# we can put a raw input to name the structure as well
protein_name = raw_input("What is the name of the protein of interest?   ")
pdb_structure = parser.get_structure(protein_name, pdb_file_path)

# get the ligands within this file for display
# from: http://stackoverflow.com/questions/25718201/remove-heteroatoms-from-pdb
ligands = []

for residue in pdb_structure.get_residues():
    tags = residue.get_full_id()
    # tags contains a tuple with (Structure ID, Model ID, Chain ID, (Residue ID))
    # Residue ID is a tuple with (*Hetero Field*, Residue ID, Insertion Code)

    # Thus you're interested in the Hetero Field, that is empty if the residue
    # is not a hetero atom or have some flag if it is (W for waters, H, etc.)
    if tags[3][0] != " " and tags[3][0] != "W":
        ligands.append(tags[3][0].split('_')[1].strip())
    else:
        continue
        
#print(ligands)

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
from Bio.PDB import Polypeptide

# represented in a single string
pdb_sequence = get_pdb_seq(pdb_structure)
string_pdb_seq = pdb_sequence[best_pdb_chain]

# represented in a single string
pdb_sequence = get_pdb_seq(pdb_structure)
string_pdb_seq = pdb_sequence[best_pdb_chain]

# outputs a fasta file format
faa_out1 = '> '
faa_out2 = '%s pdb sequence fasta' %(pdb_name)
faa_out3 = '\n%s' %(string_pdb_seq)
faa_out = faa_out1 + faa_out2 + faa_out3

file = open(faa_out2 + '.faa', "w")
# note the fasta file name is named faa_out2
file.write(faa_out)
file.close()

What is the pdb ID?   2nz2
Structure exists: '/Users/LAURENCE/sd_venv/pipeline_1/nz/pdb2nz2.ent' 
What is the name of the protein of interest?   ARGSS


### Runs sequence alignment of refseq and pdb amino acid sequences

In [8]:
# this gets the current directory

import os 
current_fp = os.getcwd()

# this instructs where to look for the corresponding sequence files

SEQUENCE_FILES = current_fp
REFSEQ_FILES = current_fp
PDB_SEQ_FILES = current_fp

# 1. get the refseq sequence file
seq_id = acc
seq_fasta = os.path.join(REFSEQ_FILES, acc + '.faa')

if os.path.exists(seq_fasta):
    print('found refseq fasta file {}'.format(seq_fasta))
    
# 2. get the pdb sequence file
pdb_id = GP[GP.u_refseq == acc].ssb_best_file.values[0].strip('.pdb')
pdb_fasta = os.path.join(PDB_SEQ_FILES, faa_out2 + '.faa')

if os.path.exists(pdb_fasta):
    print('found pdb fasta file {}'.format(pdb_fasta))
    
# 3. run the alignment using the function above
os.chdir('/tmp/')
alignment_filename = run_alignment(seq_id, seq_fasta, pdb_id, pdb_fasta)

found refseq fasta file /Users/LAURENCE/sd_venv/pipeline_1/NP_446464.faa
found pdb fasta file /Users/LAURENCE/sd_venv/pipeline_1/2nz2 pdb sequence fasta.faa
Alignment NP_446464_2nz2_align.txt file already exists


In [9]:
!cat $alignment_filename

########################################
# Program: needle
# Rundate: Wed 13 Jul 2016 11:54:32
# Commandline: needle
#    -outfile NP_446464_2nz2_align.txt
#    -asequence "/Users/LAURENCE/Desktop/Senior Design/NP_446464.faa"
#    -bsequence "/Users/LAURENCE/Desktop/Senior Design/Untitled Folder/2nz2 pdb sequence fasta.faa"
#    -gapopen 10
#    -gapextend 0.5
# Align_format: srspair
# Report_file: NP_446464_2nz2_align.txt
########################################

#
# Aligned_sequences: 2
# 1: NP_446464.1
# 2: 2nz2
# Matrix: EBLOSUM62
# Gap_penalty: 10.0
# Extend_penalty: 0.5
#
# Length: 501
# Identity:     402/501 (80.2%)
# Similarity:   402/501 (80.2%)
# Gaps:          89/501 (17.8%)
# Score: 2100.0
# 
#

NP_446464.1        1 MSSKGSVVLAYSGGLDTSCILVWLKEQGYDVIAYLANIGQKEDFEEARKK     50
                     ...|||||||||||||||||||||||||||||||||||||||||||||||
2nz2               1 XXXKGSVVLAYSGGLDTSCILVWLKEQGYDVIAYLANIGQKEDFEEARKK     50

NP_446464.1       5

In [10]:
structure = parser.get_structure('someprotein', pdb_file_path)
my_structure_sequence = get_pdb_seq2(structure)
from Bio.PDB import Selection
get_alignment_allpos_df(alignment_filename).head(30) #How many rows does the user want to see?

Unnamed: 0,Uniprot_ID,Structure,type,Uniprot_sequence,Uniprot_sequence_position,PDB_sequence,PDB_sequence_position
0,NP_446464.1,2nz2,mutation,M,1.0,X,1
1,NP_446464.1,2nz2,mutation,S,2.0,X,2
2,NP_446464.1,2nz2,mutation,S,3.0,X,3
3,NP_446464.1,2nz2,match,K,4.0,K,4
4,NP_446464.1,2nz2,match,G,5.0,G,5
5,NP_446464.1,2nz2,match,S,6.0,S,6
6,NP_446464.1,2nz2,match,V,7.0,V,7
7,NP_446464.1,2nz2,match,V,8.0,V,8
8,NP_446464.1,2nz2,match,L,9.0,L,9
9,NP_446464.1,2nz2,match,A,10.0,A,10


In [11]:
my_mutation_resnum = int(raw_input("What is the corresponding mutation on the PDB structure?   "))
# let's get the info from the structure
my_mutation_residue = structure[0][best_pdb_chain][my_mutation_resnum]
#print my_mutation_residue
# we can use the Selection class to select all atoms of this residue
# 'A' here stands for ATOM (http://biopython.org/DIST/docs/api/Bio.PDB.Selection-module.html)
atom_list = Selection.unfold_entities(my_mutation_residue, 'A')

# then you can format this information for PV:
#for a in atom_list:
    #print('{}.{}.{}').format('A',my_mutation_resnum,a.id)

What is the corresponding mutation on the PDB structure?   10


In [22]:
print "These are the chains present in the structure:   " + chains_present
chain_display = raw_input("Would you like to display all chains (answer with yes or no)?   ")
if chain_display.upper() == 'YES':
    cnames_pv_var = ''
    structure_var = 'structure'
    chain_iso = ''
elif chain_display.upper() == 'NO':
    cnames = raw_input("Type in the chain you would like to display:   ")
    cnames_pv = "cname: '" + cnames + "'"
    cnames_pv_var = "var chain = structure.select({cname: '" + cnames + "'})"
    structure_var = 'chain'
else:
    print " "
bind_site_avail = raw_input("Is there a binding or active site you would like to display (answer with yes or no)?   ")

if bind_site_avail.upper() == 'YES':
    chain_res_disp_choice = raw_input("Would you like to display the site on all chains (answer with yes or no)?   ")
    print "These are the chains present in the structure:   " + chains_present
    if chain_res_disp_choice.upper() == 'YES':
        chain_iso = ''
        is_res = ''
    elif chain_res_disp_choice.upper() == 'NO':
        chain_res_disp_chain_choice = raw_input("Which chain would you like to display the site on?   ")
        is_res = ''
        chain_iso = ''
    else:
        print ""
elif bind_site_avail.upper() == 'NO':
    site_start = 0
    site_end = 0
    rnums_script = ''
    is_res = '//'
else:
    print ""
    

These are the chains present in the structure:   ['A']
Would you like to display all chains (answer with yes or no)?   no
Type in the chain you would like to display:   A
Is there a binding or active site you would like to display (answer with yes or no)?   yes
Would you like to display the site on all chains (answer with yes or no)?   no
These are the chains present in the structure:   ['A']
Which chain would you like to display the site on?   A


### Insert workaround here to display this image

In [13]:
from IPython.display import Image
binding_url = ('http://www.rcsb.org/pdb/explore/remediatedChain.do?structureId=2NZ2&params.annotationsStr=Site%20Record,DSSP&chainId=A')
Image(url = binding_url)

In [28]:
binding_site_list = raw_input("List the sequences of the active sites (ie 6, 7, 120-134):   ")
binding_site_list_strip = binding_site_list.replace(' ','')
mylist = binding_site_list_strip.split(',')
indexx = 0
numberz = ''
for number in mylist:
    if '-' not in mylist[indexx]:
        numberz += ', ' + mylist[indexx]
    if '-' in mylist[indexx]:
        indexxx = 0
        for everydigit in mylist[indexx]:
            if everydigit == '-':
                hyphen_pos = indexxx
            indexxx = indexxx + 1
        rnums_seq_script = mylist[indexx][0:hyphen_pos]
        num_counter = mylist[indexx][0:hyphen_pos]
        numberz += ', '+ num_counter
        for y in range(int(mylist[indexx][(hyphen_pos + 1):]) - int(mylist[indexx][0:hyphen_pos])):
            num_counter = str(int(num_counter) + 1)
            numberz += ', ' + str(num_counter)
    indexx = indexx + 1
tidied_output = numberz[2:]
rnums_script = "rnums : [" + tidied_output + "]"
print rnums_script

List the sequences of the active sites (ie 6, 7, 120-134):   300-315
rnums : [300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315]


In [29]:
PDBViewer_options(pdb_file_path)

## Is there an associated Mutation to be studied?
#### User provides own mutation. For instance, we can analyze a mutation obtained from COSMIC:

Mutation S6F http://cancer.sanger.ac.uk/cosmic/gene/samples?coords=AA%3AAA&src=gene&end=413&mut=substitution_missense&ln=ASS1&all_data=&id=60232&seqlen=413&start=1

Two mutation assessors can be used to assess the effect of the mutation:
    * PROVEAN http://provean.jcvi.org/seq_submit.php
    * SIFT http://sift.bii.a-star.edu.sg/www/SIFT_seq_submit2.html

In [30]:
!cat $acc".faa"

>NP_446464.1 argininosuccinate synthase [Homo sapiens].
MSSKGSVVLAYSGGLDTSCILVWLKEQGYDVIAYLANIGQKEDFEEARKKALKLGAKKVF
IEDVSREFVEEFIWPAIQSSALYEDRYLLGTSLARPCIARKQVEIAQREGAKYVSHGATG
KGNDQVRFELSCYSLAPQIKVIAPWRMPEFYNRFKGRNDLMEYAKQHGIPIPVTPKNPWS
MDENLMHISYEAGILENPKNQAPPGLYTKTQDPAKAPNTPDILEIEFKKGVPVKVTNVKD
GTTHQTSLELFMYLNEVAGKHGVGRIDIVENRFIGMKSRGIYETPAGTILYHAHLDIEAF
TMDREVRKIKQGLGLKFAELVYTGFWHSPECEFVRHCIAKSQERVEGKVQVSVLKGQVYI
LGRESPLSLYNEELVSMNVQGDYEPTDATGFININSLRLKEYHRLQSKVTAK


In [36]:
assessor_choice = raw_input("Which mutation assessor would you like to use (answer with PROVEAN or SIFT)?   ")
if assessor_choice.upper() == 'SIFT':
    mut_loc_SAA = raw_input("Which amino acid position is it located in? ")
    mut_init = raw_input("Which amino acid was it initially? ")
    mut_out = raw_input("Which amino acid was it changed to? ")
    answer = "this is your input for Step 2: %s%s%s" %(mut_init, mut_loc_SAA, mut_out)
elif assessor_choice.upper() == 'PROVEAN':
    mut_type = raw_input("What type of mutation is it (Single Amino Acid [SAA], Deletion [Del], Insertion [In], or Indel)?   ")
    if mut_type.upper() == 'SAA':
        mut_loc_SAA = raw_input("Which amino acid position is it located in? ")
        mut_init = raw_input("Which amino acid was it initially? ")
        mut_out = raw_input("Which amino acid was it changed to? ")
        answer = "this is your input for Step 2: %s%s%s" %(mut_init, mut_loc_SAA, mut_out)
    elif mut_type.upper() == 'DEL':
        mut_loc = raw_input("Which amino acid position was deleted? ")
        mut_init = raw_input("Which amino acid was it initially? ")
        answer = "this is your input for Step 2: %s%sdel" %(mut_init, mut_loc)
    elif mut_type.upper() == 'IN':
        mut_insert = raw_input("Which amino acid(s) does the insertion consist of? ")
        mut_aa1 = raw_input("Which amino acid was it inserted behind of? ")
        mut_aa1_pos =raw_input("What AA position was it inserted in? ")
        mut_aa2 = raw_input("Which amino acid is the insertion in front of? ")
        mut_aa2_pos =raw_input("What AA position was it inserted behind of? ")
        answer = "this is your input for Step 2: %s%s_%s%sins%s" %(mut_aa1, mut_aa1_pos, mut_aa2, mut_aa2_pos, mut_insert)
    elif mut_type.upper() == 'INDEL':
        mut_num_t = raw_input("Is there only one deletion or a range of deletions (answer with 'one' or 'range')   ")
        if mut_num_t.upper() == 'ONE':
            mut_del1 = raw_input("What is the AA that is being deleted?   ")
            mut_del1_pos = raw_input("What position is that AA in?   ")
            mut_indel = raw_input("What AAs are being inserted (can insert more than one, ie MKSS)   ")
            answer = "this is your input for Step 2: %s%sdelins%s" %(mut_del1, mut_del1_pos, mut_indel)
        elif mut_num_t.upper() == 'RANGE':
            mut_del1 = raw_input("What is the first AA that is being deleted?   ")
            mut_del1_pos = raw_input("What position is that AA in?   ")
            mut_del2 = raw_input("What is the last AA that is being deleted?   ")
            mut_del2_pos = raw_input("What position is that AA in?   ")
            mut_indel = raw_input("What AAs are being inserted (can insert more than one, ie MKSS)   ")
            answer = "This is your input for Step 2: %s%s_%s%sdelins%s" %(mut_del1, mut_del1_pos, mut_del2, mut_del2_pos, mut_indel)
        else:
            print "tbd"	
    else:
        redo = raw_input("please input SAA, Del, In, or Indel")
else:
    print "Please redo"
print '\n\n\nFirst open the mutation assessor website of choice'
print "copy and paste the above amino acid sequence corresponding to the refseq ID into the browser"
print "%s" %(answer)

Which mutation assessor would you like to use (answer with PROVEAN or SIFT)?   PROVEAN
What type of mutation is it (Single Amino Acid [SAA], Deletion [Del], Insertion [In], or Indel)?   Del
Which amino acid position was deleted? 15
Which amino acid was it initially? V



First open the mutation assessor website of choice
copy and paste the above amino acid sequence corresponding to the refseq ID into the browser
this is your input for Step 2: V15del


## Cobrapy and ESCHER Map

In [2]:
import os 
current_fp = os.getcwd()

In [4]:
import escher
import escher.urls
import cobra
import cobra.test
import json
from IPython.display import HTML
from cobra import Model, Reaction, Metabolite
from cobra.flux_analysis.parsimonious import optimize_minimal_flux

b = escher.Builder(map_json = (current_fp + '/RECON1.Central.json'))
b.display_in_notebook(scroll_behavior = 'pan')

  warn("Install lxml for faster SBML I/O")
  warn("cobra.io.sbml requires libsbml")


In [7]:
model = cobra.io.load_matlab_model(current_fp + '/RECON1', 'RECON1')


### Run this to setup RECON1 properly

In [8]:
#creat ethanalamine demand
#metabolite is etha_c

DM_etha_c = Reaction('DM_etha_c')
DM_etha_c.name = DM_etha_c

etha_c = model.metabolites.get_by_id('etha_c')

DM_etha_c.add_metabolites({etha_c: -1})
model.add_reaction(DM_etha_c)

#create a transporter for this

Tyr_ggnt = Reaction('Tyr_ggnt')
Tyr_ggnt.name = Tyr_ggnt

Tyr_ggn_e = model.metabolites.get_by_id('Tyr_ggn_e')
Tyr_ggn_c = model.metabolites.get_by_id('Tyr_ggn_c')

Tyr_ggnt.add_metabolites({Tyr_ggn_e: -1, Tyr_ggn_c: 1})
model.add_reaction(Tyr_ggnt)

DM_etha_c.lower_bound = -1
Tyr_ggnt.lower_bound = -1000

except_EX_names = '''DM_etha_c, EX_peplys_e, EX_Tyr_ggn_e, EX_arg__L_e, EX_asn__L_e, EX_asp__L_e, EX_chol_e, EX_cl_e, EX_glc__D_e, EX_gln__L_e, EX_gly_e, EX_h_e, EX_h2o_e, EX_ile__L_e, EX_k_e, EX_leu__L_e, EX_lys__L_e, EX_na1_e, EX_nh4_e, EX_o2_e, EX_phe__L_e, EX_pi_e, EX_pro__L_e, EX_ser__L_e, EX_thr__L_e, EX_trp__L_e, EX_tyr__L_e, EX_val__L_e'''
except_EX_names_split = except_EX_names.split(', ')
print except_EX_names_split

# Changing all lower bounds to 0 except the given reactions above

for a in model.reactions[1188:1591]:
    a.lower_bound = 0

model.reactions.get_by_id('EX_yvite_e').lower_bound = 0
model.reactions.get_by_id('EX_10fthf5glu_e').lower_bound = 0
    
for b in model.reactions[1188:1591]:
    for c in range(len(except_EX_names_split)):  
        if b.id  == except_EX_names_split[c]:
            b.lower_bound = -1
            
biomass_NCI60 = Reaction('biomass_NCI60')
biomass_NCI60.name = biomass_NCI60

ala_L_c = model.metabolites.get_by_id('ala__L_c')
arg_L_c = model.metabolites.get_by_id('arg__L_c')
asn_L_c = model.metabolites.get_by_id('asn__L_c')
asp_L_c = model.metabolites.get_by_id('asp__L_c')
atp_c = model.metabolites.get_by_id('atp_c')
clpn_hs_c = model.metabolites.get_by_id('clpn_hs_c')
ctp_c = model.metabolites.get_by_id('ctp_c')
dag_hs_c = model.metabolites.get_by_id('dag_hs_c')
datp_c = model.metabolites.get_by_id('datp_c')
dctp_c = model.metabolites.get_by_id('dctp_c')
dgtp_c = model.metabolites.get_by_id('dgtp_c')
dttp_c = model.metabolites.get_by_id('dttp_c')
gln_L_c = model.metabolites.get_by_id('gln__L_c')
glu_L_c = model.metabolites.get_by_id('glu__L_c')
gly_c = model.metabolites.get_by_id('gly_c')
glygn2_c = model.metabolites.get_by_id('glygn2_c')
gtp_c = model.metabolites.get_by_id('gtp_c')
h2o_c = model.metabolites.get_by_id('h2o_c')
hdca_c = model.metabolites.get_by_id('hdca_c')
hdcea_c = model.metabolites.get_by_id('hdcea_c')
ile_L_c = model.metabolites.get_by_id('ile__L_c')
leu_L_c = model.metabolites.get_by_id('leu__L_c')
lpchol_hs_c = model.metabolites.get_by_id('lpchol_hs_c')
lys_L_c = model.metabolites.get_by_id('lys__L_c')
mag_hs_c = model.metabolites.get_by_id('mag_hs_c')
ocdca_c = model.metabolites.get_by_id('ocdca_c')
ocdcea_c = model.metabolites.get_by_id('ocdcea_c')
pa_hs_c = model.metabolites.get_by_id('pa_hs_c')
pail_hs_c = model.metabolites.get_by_id('pail_hs_c')
pchol_hs_c = model.metabolites.get_by_id('pchol_hs_c')
pe_hs_c = model.metabolites.get_by_id('pe_hs_c')
phe_L_c = model.metabolites.get_by_id('phe__L_c')
ps_hs_c = model.metabolites.get_by_id('ps_hs_c')
ser_L_c = model.metabolites.get_by_id('ser__L_c')
sphmyln_hs_c = model.metabolites.get_by_id('sphmyln_hs_c')
thr_L_c = model.metabolites.get_by_id('thr__L_c')
trp_L_c = model.metabolites.get_by_id('trp__L_c')
tyr_L_c = model.metabolites.get_by_id('tyr__L_c')
utp_c = model.metabolites.get_by_id('utp_c')
val_L_c = model.metabolites.get_by_id('val__L_c')
pro_L_m = model.metabolites.get_by_id('pro__L_m')
chsterol_r = model.metabolites.get_by_id('chsterol_r')
xolest_hs_r = model.metabolites.get_by_id('xolest_hs_r')
adp_c = model.metabolites.get_by_id('adp_c')
h_c = model.metabolites.get_by_id('h_c')
pi_c = model.metabolites.get_by_id('pi_c')

#remove ocdca and ocdcea
# added xolest value to cholesterol in biomass and remove xolest

biomass_NCI60.add_metabolites({ala_L_c: -0.587929, arg_L_c: -0.380280, asn_L_c: -0.323313, asp_L_c: -0.261396, 
                               atp_c: -35.033540, clpn_hs_c: -0.000624, ctp_c: -0.033435, dag_hs_c: -0.001032, 
                               datp_c: -0.014557, dctp_c: -0.009770, dgtp_c: -0.009748, dttp_c: -0.014546,
                               gln_L_c: -0.319051, glu_L_c: -0.387401, gly_c: -0.504294, glygn2_c: -0.034479,
                               gtp_c: -0.055967, h2o_c: -35.000000, hdca_c: -0.008293, hdcea_c: -0.003315, 
                               ile_L_c: -0.319813, leu_L_c: -0.548692, lpchol_hs_c: -0.002470, lys_L_c: -0.552717, 
                               mag_hs_c: -0.001456, pa_hs_c: -0.010645,
                               pail_hs_c: -0.005016, pchol_hs_c: -0.022878, pe_hs_c: -0.018211, phe_L_c: -0.170743,
                               ps_hs_c: -0.006808, ser_L_c: -0.385852, sphmyln_hs_c: -0.010215, thr_L_c: -0.378004, 
                               trp_L_c: -0.039847, tyr_L_c: -0.150141, utp_c: -0.063323, val_L_c: -0.385554,
                               pro_L_m: -0.237850, chsterol_r: -0.054102,  adp_c: 35.000000, 
                               h_c: 35.000000, pi_c: 35.000000})

model.add_reaction(biomass_NCI60)
my_objective = model.reactions.get_by_id('biomass_NCI60')
model.change_objective(my_objective)

# Adding DM_atp_c reaction

DM_atp_c = Reaction('DM_atp_c')
DM_atp_c.name = DM_atp_c

atp_c = model.metabolites.get_by_id('atp_c')
adp_c = model.metabolites.get_by_id('adp_c')
h2o_c = model.metabolites.get_by_id('h2o_c')
h_c = model.metabolites.get_by_id('h_c')
pi_c = model.metabolites.get_by_id('pi_c')


DM_atp_c.add_metabolites({atp_c: -1, h2o_c: -1, adp_c: 1, pi_c: 1, h_c: 1})
model.add_reaction(DM_atp_c)

# Setting lower bounds for DM_atp_c as per Dan

model.reactions.get_by_id('DM_atp_c').lower_bound = 0.9*7.9939

model.reactions.get_by_id('EX_h_e').lower_bound = -1000
model.reactions.get_by_id('EX_o2_e').lower_bound = -1000
model.reactions.get_by_id('EX_fe2_e').lower_bound = -1000
model.reactions.get_by_id('EX_nh4_e').lower_bound = -1000

model.reactions.get_by_id('EX_glc__D_e').lower_bound = -0.4882

# constrained as per calculate fluxes 
model.reactions.get_by_id('biomass_NCI60').lower_bound = 0.0177
model.reactions.get_by_id('biomass_NCI60').upper_bound = 0.0177

model.reactions.get_by_id('EX_pi_e').lower_bound = -1000

# 6/16/16 Changing amino acid uptake constraints

model.reactions.get_by_id('EX_arg__L_e').lower_bound = -0.0022

model.reactions.get_by_id('EX_asn__L_e').lower_bound = -0.0058
model.reactions.get_by_id('EX_asp__L_e').lower_bound = -0.0081
model.reactions.get_by_id('EX_chol_e').lower_bound = -0.00062386
model.reactions.get_by_id('EX_gln__L_e').lower_bound = -0.1202
model.reactions.get_by_id('EX_gly_e').lower_bound = -0.0019
model.reactions.get_by_id('EX_ile__L_e').lower_bound = -0.0081
model.reactions.get_by_id('EX_leu__L_e').lower_bound = -0.0110
model.reactions.get_by_id('EX_lys__L_e').lower_bound = -0.0111
model.reactions.get_by_id('EX_phe__L_e').lower_bound = -0.0040

model.reactions.get_by_id('EX_pro__L_e').lower_bound = -0.0045

model.reactions.get_by_id('EX_ser__L_e').lower_bound = -0.0167
model.reactions.get_by_id('EX_trp__L_e').lower_bound = -0.0013
model.reactions.get_by_id('EX_tyr__L_e').lower_bound = -0.0068
model.reactions.get_by_id('EX_val__L_e').lower_bound = -0.0079

#6/24/16 lowerbound changes

model.reactions.get_by_id('FTHFLi').lower_bound = 0
model.reactions.get_by_id('FTHFLmi').lower_bound = 0

model.reactions.get_by_id('FUM').lower_bound = 0
model.reactions.get_by_id('FUM').upper_bound = 0

model.reactions.get_by_id('BILGLCURte').lower_bound = 0
model.reactions.get_by_id('BILDGLCURte').lower_bound = 0

h_i = Metabolite('h_i', formula='h', name='h_i', compartment='i')
model.add_metabolites(h_i)


# delete reaction
model.remove_reactions('ATPS4m')

# re-add the reaction while replacing h_c with h_i
ATPS4m = Reaction('ATPS4m')
ATPS4m.name = ATPS4m

adp_m = model.metabolites.get_by_id('adp_m')
h_i = model.metabolites.get_by_id('h_i')
pi_m = model.metabolites.get_by_id('pi_m')
h2o_m = model.metabolites.get_by_id('h2o_m')
h_m = model.metabolites.get_by_id('h_m')
atp_m = model.metabolites.get_by_id('atp_m')

ATPS4m.add_metabolites({adp_m: -1, h_i: -4, pi_m: -1, h2o_m: 1, h_m: 3, atp_m: 1})
model.add_reaction(ATPS4m)

model.remove_reactions('CYOOm3')

CYOOm2 = Reaction('CYOOm2')
CYOOm2.name = CYOOm2

h_m = model.metabolites.get_by_id('h_m')
o2_m = model.metabolites.get_by_id('o2_m')
focytC_m = model.metabolites.get_by_id('focytC_m')
h_i = model.metabolites.get_by_id('h_i')
h2o_m = model.metabolites.get_by_id('h2o_m')
ficytC_m = model.metabolites.get_by_id('ficytC_m')


CYOOm2.add_metabolites({h_m: -8, o2_m: -1, focytC_m: -4, h_i: 4, h2o_m: 2, ficytC_m: 4})
model.add_reaction(CYOOm2)

# delete reaction
model.remove_reactions('NADH2_u10m')

# re-add the reaction while replacing h_c with h_i
NADH2_u10m = Reaction('NADH2_u10m')
NADH2_u10m.name = NADH2_u10m

q10_m = model.metabolites.get_by_id('q10_m')
h_m = model.metabolites.get_by_id('h_m')
nadh_m = model.metabolites.get_by_id('nadh_m')
nad_m = model.metabolites.get_by_id('nad_m')
h_i = model.metabolites.get_by_id('h_i')
q10h2_m = model.metabolites.get_by_id('q10h2_m')



NADH2_u10m.add_metabolites({q10_m: -1, h_m: -5, nadh_m: -1, nad_m: 1, h_i: 4, q10h2_m: 1})
model.add_reaction(NADH2_u10m)

# delete reaction
model.remove_reactions('CYOR_u10m')

# re-add the reaction while replacing h_c with h_i
CYOR_u10m = Reaction('CYOR_u10m')
CYOR_u10m.name = NADH2_u10m

ficytC_m = model.metabolites.get_by_id('ficytC_m')
h_m = model.metabolites.get_by_id('h_m')
q10h2_m = model.metabolites.get_by_id('q10h2_m')
q10_m = model.metabolites.get_by_id('q10_m')
focytC_m = model.metabolites.get_by_id('focytC_m')
h_i = model.metabolites.get_by_id('h_i')


CYOR_u10m.add_metabolites({ficytC_m: -2, h_m: -2, q10h2_m: -1, q10_m: 1, focytC_m: 2, h_i: 4})
model.add_reaction(CYOR_u10m)

['DM_etha_c', 'EX_peplys_e', 'EX_Tyr_ggn_e', 'EX_arg__L_e', 'EX_asn__L_e', 'EX_asp__L_e', 'EX_chol_e', 'EX_cl_e', 'EX_glc__D_e', 'EX_gln__L_e', 'EX_gly_e', 'EX_h_e', 'EX_h2o_e', 'EX_ile__L_e', 'EX_k_e', 'EX_leu__L_e', 'EX_lys__L_e', 'EX_na1_e', 'EX_nh4_e', 'EX_o2_e', 'EX_phe__L_e', 'EX_pi_e', 'EX_pro__L_e', 'EX_ser__L_e', 'EX_thr__L_e', 'EX_trp__L_e', 'EX_tyr__L_e', 'EX_val__L_e']




In [9]:
p_solution = cobra.flux_analysis.parsimonious.optimize_minimal_flux(model)
print('Growth rate: %.5f' % p_solution.f)

Growth rate: 0.01770


In [10]:
b = escher.Builder(map_json = (current_fp + '/RECON1.Central.json'),
                   reaction_data=p_solution.x_dict,
                   # color and size according to the absolute value
                   reaction_styles=['color', 'size', 'abs', 'text'],
                   # change the default colors
                   reaction_scale=[{'type': 'min', 'color': 'red', 'size': 4},
                                   {'type': 'mean', 'color': 'green', 'size': 20},
                                   {'type': 'max', 'color': 'blue', 'size': 40}],
                   # only show the primary metabolites
                   hide_secondary_metabolites=True)
b.display_in_notebook(scroll_behavior='pan')

#### Normalizing reaction fluxes for better ESCHER visualization

In [11]:
### Normalizing the values in solution.x_dict
import math
normalized_x_dict = {}
for a, b in p_solution.x_dict.iteritems():
    if b == 0:
        p_solution.x_dict[a]==0
    elif b != 0:
        p_solution.x_dict[a] = math.log10(abs(b)*1000000)
    else:
        print "error"

In [12]:
### Remapping
b = escher.Builder(map_json = (current_fp + '/RECON1.Central.json'),
                   reaction_data=p_solution.x_dict,
                   # color and size according to the absolute value
                   reaction_styles=['color', 'size', 'abs', 'text'],
                   # change the default colors
                   reaction_scale=[{'type': 'min', 'color': 'red', 'size': 4},
                                   {'type': 'mean', 'color': 'green', 'size': 20},
                                   {'type': 'max', 'color': 'blue', 'size': 40}],
                   # only show the primary metabolites
                   hide_secondary_metabolites=True)
b.display_in_notebook(scroll_behavior='pan')

### Deleting ARGSS (SIFT/PROVEAN also indicates mutation is deleterious -- treat as gene ko)

In [13]:
model_modified = model.copy()
# for example, delete a reaction
model_modified.reactions.get_by_id('ARGSS').lower_bound = 0
model_modified.reactions.get_by_id('ARGSS').upper_bound = 0



### Finding the flux of EX_arg

1) Set EX_arg as objective 2) Set its lower bound to -1000 in the KO case and then maximize the flux 3) the flux value should be the value to constrain it to

In [14]:
my_objective = model_modified.reactions.get_by_id('EX_arg__L_e')
model_modified.change_objective(my_objective)

model_modified.reactions.get_by_id('EX_arg__L_e').lower_bound = -1000

arg_flux_solution = model_modified.optimize()
model_modified.reactions.get_by_id('EX_arg__L_e').lower_bound = arg_flux_solution.f

### Newly optimized growth rate with gene knockout

In [15]:
my_objective = model_modified.reactions.get_by_id('biomass_NCI60')
model_modified.change_objective(my_objective)

p_solution_m = model_modified.optimize()
p_solution_m

<Solution 0.02 at 0x109aa0990>

In [16]:
p_solution_m = model_modified.optimize()
print('Growth rate: %.5f' % p_solution_m.f)

Growth rate: 0.01770
