## Cell Collective loader

This is a notebook that downloads all models uploaded to the [Cell Collective](https://research.cellcollective.org/) and creates local [SBML](https://en.wikipedia.org/wiki/SBML)  and [BooleanNet](https://github.com/ialbert/booleannet) files. It also collects all molecular species in all models in a table (dataframe) with associated models, links, etc.

You will need the following packages:

* Colomoto Jupyter [https://github.com/colomoto/colomoto-jupyter]
* GINsim python [https://github.com/GINsim/GINsim-python]
* Pandas [https://pandas.pydata.org/getting_started.html]

In [None]:
import cellcollective #https://github.com/colomoto/colomoto-jupyter
import biolqm #https://github.com/GINsim/GINsim-python
import requests
import json
from urllib.request import urlretrieve
import glob
import pandas as pd

In [None]:
#a simple function creating a permanent local file from the retrieved model file
def download_local(url, path, model_id, suffix='sbml'):
    filename = path+str(model_id)+'.'+suffix
    filename, _ = urlretrieve(url, filename=filename)
    return filename

Output folders

In [None]:
sbmls_path='sbmls/'
boolean_models_path='boolean_models/'

#if they don't exist we create them
import os
if not os.path.exists(sbmls_path):
    os.makedirs(sbmls_path)
if not os.path.exists(boolean_models_path):
    os.makedirs(boolean_models_path)


Getting the model ids from the cell collective website

In [None]:
headers = {
  "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36"
}

url = "https://research.cellcollective.org/api/model"

r = requests.get(url, headers=headers)
data = r.json()
model_name_dict={}
for i in range(len(data['data'])):
    if 'model' in data['data'][i].keys():
        model_name_dict[data['data'][i]['model']['id']]=data['data'][i]['model']['name']
        

In [None]:
#print(model_name_dict)

This is manually curated list from which all model ids will be skipped from downloading and all subsequent analysis.

In [None]:
exception_list=[126843,3511,118235,15088,36604]

The download script checks the sbml folder for models already downloaded and skips them (it still downloads newly uploaded models). If set this to True to ignore the contents of the sbml folder.

In [None]:
download_all_again=False

In [None]:

species_dict={}
df=pd.DataFrame()

for model_id in model_name_dict:
    print('Checking',model_id)
    if model_id in exception_list:
        print('Model id is in exception list')
        continue
    downloaded_model_paths=glob.glob(sbmls_path+'*.sbml')
    if not download_all_again:
        downloaded_models=[int(i.split('/')[-1].split('.')[0]) for i in downloaded_model_paths]
    else:
        downloaded_models=[]

    if model_id not in downloaded_models:
        url='https://research.cellcollective.org/api/model/%d/export/version/1?type=SBML'%model_id
        try:
            sbml = cellcollective.load(url)
        except Exception as e:
            print(model_id, str(e))
            continue
        model_name=sbml.dom.getElementsByTagName('model')[0].getAttribute('name')
        print(model_name)

        #I download the file again locally because the colomoto biolqm.load does not work with the temporal download initiated by the the cellcollective script.
        filename = download_local(url,sbmls_path,model_id)
        sbml.localfile=filename
        #save to boolean net
        lqm = cellcollective.to_biolqm(sbml)
        biolqm.save(lqm, "%s%d_%s.booleannet"%(boolean_models_path,model_id,model_name.replace(' ','_')), "booleannet")

    sbml = cellcollective.load(sbmls_path+str(model_id)+'.sbml')
    for s in sbml.species:
        if sbml.species_uniprotkb(s)!=None:
            uniprot=sbml.species_uniprotkb(s).data
        else:
            uniprot=None
            
        row={'species':s.strip(),
             'model_id':model_id,
             'model_name':model_name_dict[model_id],
             'uniprot_info':uniprot,
             'ncbi_gene_info':sbml.species_ncbi_gene(s),
             'link_to_model':'https://research.cellcollective.org/?dashboard=true#module/%d:1/'%model_id}
        df=df.append(row,ignore_index=True)
        if s in species_dict:
            species_dict[s].append(model_id)
        else:
            species_dict[s]=[model_id]
    

In [None]:
#a dictionary mapping molecular species to model_ids of models containing them
print(species_dict)

There exists several python interfaces to programmatically query information from these databases:
- using NBCI Gene ID: https://github.com/biocommons/eutils
- using UniProt ID: https://github.com/jdrudolph/uniprot

In [None]:
df = df.reindex(['species','model_id','model_name','link_to_model','uniprot_info','ncbi_gene_info'], axis=1)

In [None]:
df = df.sort_values('species')
df = df.reset_index(drop=True)
df.to_excel('cell_collective_species_data.xlsx')

In [None]:
df