<a href="https://colab.research.google.com/github/alibekk93/IDP_analysis/blob/RAPID/getting_proteomes.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting UniProt proteomes for Tempura species

## Setup

In [49]:
import pandas as pd
import numpy as np
from tqdm import tqdm

Loading Tempura dataset

In [140]:
# tempura = pd.read_csv('/content/200617_TEMPURA.csv')
tempura = pd.read_csv('/content/tempura_bacteria_uniprot.csv', index_col=0)

Only keeping bacteria with available assembly or accession numbers

In [141]:
tempura = tempura[tempura['superkingdom']=='Bacteria']
tempura.dropna(subset='assembly_or_accession', inplace=True)
tempura.reset_index(drop=True, inplace=True)

## Getting UniProt IDs

Tepura has NCBI taxonomy IDs, but we need UniProt proteome IDs. We can get them using UniProt REST API search

In [50]:
uniprot_jsons = []
failures = []

# loop through the taxonomy IDs and retrieve proteome data
for tax_id in tqdm(tempura['taxonomy_id']):
  # define the UniProt API URL
  url = f'https://rest.uniprot.org/proteomes/stream?format=json&query=%28%28taxonomy_id%3A{tax_id}%29%29'
  # send an HTTP GET request to the UniProt API
  response = requests.get(url)
  # Check if the request was successful
  if response.status_code == 200:
    # save JSON
    uniprot_jsons.append(response.json())
  else:
    failures.append(tax_id)
    uniprot_jsons.append({})

100%|██████████| 893/893 [08:24<00:00,  1.77it/s]


In many cases we get mre than one search result for one taxonomy ID. We need to check each search result and only keep the UniProt ID that has the same taxonomy ID as Tempura

In [91]:
# initiate empty list to save UniProt IDs
uniprot_ids = []

# iterate through JSON results
for i, jsn in enumerate(uniprot_jsons):
  # get results
  results = jsn['results']
  # get candidate UniProt IDs and corresponding taxonomy IDs
  u_ids = [r['id'] for r in results]
  t_ids = [r['taxonomy']['taxonId'] for r in results]
  # make a dictionary of candidate IDs
  results_dict = {k:v for k, v in zip(t_ids, u_ids)}
  # get actual taxonomy ID
  taxonomy_id = tempura.loc[i, 'taxonomy_id']
  # save correct UniProt ID
  try:
    uniprot_ids.append(results_dict[taxonomy_id])
  except:
    # no correct ID found
    uniprot_ids.append(None)

We can now drop any Tempura rows with no available UniProt IDs and store the result

In [103]:
tempura['uniprot_id'] = uniprot_ids

In [115]:
tempura.dropna(subset='uniprot_id', inplace=True)

In [117]:
tempura.reset_index(drop=True, inplace=True)

In [137]:
tempura.to_csv('tempura_bacteria_uniprot.csv')

## Downloading UniProt proteomes

In [143]:
!mkdir proteomes

In [146]:
tempura.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 680 entries, 0 to 679
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   genus_and_species      680 non-null    object 
 1   taxonomy_id            680 non-null    int64  
 2   strain                 680 non-null    object 
 3   superkingdom           680 non-null    object 
 4   phylum                 680 non-null    object 
 5   class                  678 non-null    object 
 6   order                  673 non-null    object 
 7   family                 662 non-null    object 
 8   genus                  675 non-null    object 
 9   assembly_or_accession  680 non-null    object 
 10  Genome_GC              651 non-null    float64
 11  Genome_size            680 non-null    float64
 12  16S_accssion           680 non-null    object 
 13  16S_GC                 680 non-null    float64
 14  Tmin                   680 non-null    float64
 15  Topt_a

In [165]:
failures = []

for i in tqdm(tempura.index):
  # make file path and get UniProt ID
  species = tempura.loc[i, 'genus_and_species'].replace(' ', '_')
  fasta_file_path = f'/content/proteomes/{species}.fasta'
  id = tempura.loc[i, 'uniprot_id']
  # define the UniProt API URL to retrieve FASTA data
  url = f'https://rest.uniprot.org/uniprotkb/stream?format=fasta&query=%28%28proteome%3A{id}%29%29'
  # send an HTTP GET request to the UniProt API to get FASTA data
  response = requests.get(url)
  # check if the request was successful
  if response.status_code == 200:
    # save the FASTA data to a file
    with open(fasta_file_path, 'w') as fasta_file:
      fasta_file.write(response.text)
  else:
    failures.append(id)

100%|██████████| 680/680 [56:33<00:00,  4.99s/it]


In [166]:
failures

[]

In [None]:
!zip -r /content/proteomes.zip /content/proteomes -i '*.fasta'
from google.colab import files
files.download('/content/proteomes.zip')

## Filter to remove empty / short FASTA files

[771875, 93466]

In [51]:
failures = []
species = []

for id in tempura.loc[10:30, 'taxonomy_id']:
  try:
    Entrez.email = 'arkug@uottawa.ca'
    taxonomy_id = id
    handle = Entrez.efetch(db="taxonomy", id=taxonomy_id, retmode="xml")
    record = Entrez.read(handle)
    organism_name = record[0]["ScientificName"].replace(' ', '_')
    species.append(organism_name)
    handle.close()
    handle = Entrez.esearch(db='protein', term=f'txid{taxonomy_id}[Organism]', retmax=100000)
    record = Entrez.read(handle)
    handle.close()
    id_list = record['IdList']
    handle = Entrez.efetch(db='protein', id=id_list, rettype='fasta', retmode='text')
    records = SeqIO.parse(handle, 'fasta')
    SeqIO.write(records, f'{organism_name}.fasta', 'fasta')
    handle.close()
  except:
    failures.append(id)
    pass

In [67]:
taxonomy_id = '526'
url = f'https://rest.uniprot.org/proteomes/search?compressed=true&format=list&query=%28%28taxonomy_id%3A1609559%29%29&size=5'
response = requests.get(url)

In [70]:
response.content

b'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x00\x0b\r0\x00\x02s\x03S\x0bs.\x00\x90\xf8\xde\xe9\x0c\x00\x00\x00'

In [68]:
with open(f"{taxonomy_id}.txt", "wb") as f:
    f.write(response.content)

In [18]:
import requests

# List of taxonomy IDs
taxonomy_ids = [562, 1609559]  # Add your taxonomy IDs here

# Initialize an empty DataFrame to store the proteome data
proteome_data = pd.DataFrame(columns=['Entry', 'Entry name', 'Protein names', 'Gene names', 'Organism'])

# Loop through the taxonomy IDs and retrieve proteome data
for tax_id in taxonomy_ids:
    # Define the UniProt API URL
    url = f'https://rest.uniprot.org/proteomes/stream?format=json&query=%28%28taxonomy_id%3A{tax_id}%29%29'

    # Send an HTTP GET request to the UniProt API
    response = requests.get(url)

#     # Check if the request was successful
#     if response.status_code == 200:
#         # Parse the response text and append it to the DataFrame
#         data = response.text.strip().split('\n')
#         data = [line.split('\t') for line in data]
#         proteome_data = proteome_data.append(pd.DataFrame(data, columns=proteome_data.columns))
#     else:
#         print(f"Failed to retrieve data for taxonomy ID {tax_id}")

# # Save the proteome data to a CSV file
# proteome_data.to_csv('proteome_data.csv', index=False)


In [30]:
json = response.json()

In [34]:
json['results'][0]['id']

'UP000070587'

In [37]:
json['results'][0]['taxonomy']['scientificName']

'Pyrococcus kukulkanii'

In [42]:
json['results'][0]

{'id': 'UP000070587',
 'taxonomy': {'scientificName': 'Pyrococcus kukulkanii', 'taxonId': 1609559},
 'modified': '2022-06-23',
 'proteomeType': 'Other proteome',
 'strain': 'NCB100',
 'components': [{'name': 'Chromosome',
   'description': 'Pyrococcus sp. NCB100',
   'proteinCount': 1997,
   'genomeAnnotation': {'source': 'ENA/EMBL'},
   'proteomeCrossReferences': [{'database': 'GenomeAccession',
     'id': 'CP010835'},
    {'database': 'Biosample', 'id': 'SAMN03323778'}]}],
 'citations': [{'id': '27189596',
   'citationType': 'journal article',
   'authors': ['Callac N.',
    'Oger P.',
    'Lesongeur F.',
    'Rattray J.E.',
    'Vannier P.',
    'Michoud G.',
    'Beauverger M.',
    'Gayet N.',
    'Rouxel O.',
    'Jebbar M.',
    'Godfroy A.'],
   'citationCrossReferences': [{'database': 'PubMed', 'id': '27189596'},
    {'database': 'DOI', 'id': '10.1099/ijsem.0.001160'}],
   'title': 'Pyrococcus kukulkanii sp. nov., a hyperthermophilic, piezophilic archaeon isolated from a deep-

In [52]:
failures

[2265, 1104324, 985052, 229980, 2180, 1273541, 94694, 2271, 187880, 187878]

In [50]:
!zip -r /content/fasta_files.zip /content -i '*.fasta'
from google.colab import files
files.download('/content/fasta_files.zip')

  adding: content/Pyrobaculum_aerophilum.fasta (deflated 53%)
  adding: content/Pyrococcus_furiosus.fasta (deflated 54%)
  adding: content/Pyrolobus_fumarii.fasta (deflated 48%)
  adding: content/Pyrococcus_yayanosii.fasta (deflated 48%)
  adding: content/Pyrobaculum_islandicum.fasta (deflated 50%)
  adding: content/Pyrococcus_abyssi.fasta (deflated 51%)
  adding: content/Hyperthermus_butylicus.fasta (deflated 49%)
  adding: content/Pyrococcus_horikoshii.fasta (deflated 59%)
  adding: content/Pyrococcus_kukulkanii.fasta (deflated 48%)
  adding: content/Escherichia_coli.fasta (deflated 45%)
  adding: content/Thermoproteus_uzoniensis.fasta (deflated 50%)
  adding: content/Methanopyrus_kandleri.fasta (deflated 52%)


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>