# Introduction

I was asked to help find the age metadata associated with our ENCODE publication set https://www.encodeproject.org/publication-data/ENCSR574CRQ/

In this notebook I'll demonstraight two ways of retrieving that information.

First we make sure needed dependnecies are installed.
(The latest versions of pip are trying harder not to conflict with system installed packages, and to get it to install I needed --break-system-packages. Older versions of pip and python probably won't need that)

In [1]:
import sys

!{sys.executable} -m pip install --user --no-deps --break-system-packages encoded_client pandas tqdm



Next we import dependencies we need to use.

In [2]:
from encoded_client.encoded import ENCODED
from tqdm import tqdm
import pandas

Pandas can directly read urls, and the encode publication set has a file with a list of all the files and what experiments they're from. Unfortunately it lacks much of the biosample level metadata. But at least given the File dataset, we can go retrieve more information.

In [3]:
file_metadata = pandas.read_csv("https://www.encodeproject.org/documents/ab75e52f-64d9-4c39-aea0-15372479049d/@@download/attachment/ENCSR574CRQ_metadata.tsv", sep="\t")
file_metadata


Unnamed: 0,File accession,File dataset,File type,File format,File output type,Assay term name,Biosample term id,Biosample term name,Biosample type,File target,...,Project,Lab,md5sum,dbxrefs,File download URL,Assembly,File status,Derived from,S3 URL,Size
0,ENCFF920CNZ,ENCSR304RDL,fastq,fastq,reads,RNA-seq,UBERON:0001890,forebrain,tissue,,...,ENCODE,"Barbara Wold, Caltech",23d4d1fcdbfe3e477896f77e5208b3c0,SRA:SRR4421602,/files/ENCFF920CNZ/@@download/ENCFF920CNZ.fast...,,released,,https://encode-public.s3.amazonaws.com/2016/07...,2221837463
1,ENCFF320FJX,ENCSR304RDL,fastq,fastq,reads,RNA-seq,UBERON:0001890,forebrain,tissue,,...,ENCODE,"Barbara Wold, Caltech",7d3a3b102641187fee176b020d27223c,SRA:SRR4421603,/files/ENCFF320FJX/@@download/ENCFF320FJX.fast...,,released,,https://encode-public.s3.amazonaws.com/2016/07...,2193429461
2,ENCFF528EVC,ENCSR304RDL,fastq,fastq,reads,RNA-seq,UBERON:0001890,forebrain,tissue,,...,ENCODE,"Barbara Wold, Caltech",edf87120238bbfbf117482ae6172c4a4,SRA:SRR4421600,/files/ENCFF528EVC/@@download/ENCFF528EVC.fast...,,released,,https://encode-public.s3.amazonaws.com/2016/07...,1969887894
3,ENCFF663SNC,ENCSR304RDL,fastq,fastq,reads,RNA-seq,UBERON:0001890,forebrain,tissue,,...,ENCODE,"Barbara Wold, Caltech",882e38113300a00cc2e7e3f0fdcf24a8,SRA:SRR4421601,/files/ENCFF663SNC/@@download/ENCFF663SNC.fast...,,released,,https://encode-public.s3.amazonaws.com/2016/07...,1989101437
4,ENCFF133MSL,ENCSR304RDL,bam,bam,alignments,RNA-seq,UBERON:0001890,forebrain,tissue,,...,ENCODE,ENCODE Processing Pipeline,bfda8d47bf8ac1e72746e14a7765d597,,/files/ENCFF133MSL/@@download/ENCFF133MSL.bam,mm10,released,"ENCFF320FJX, ENCFF920CNZ, ENCFF533JRE",https://encode-public.s3.amazonaws.com/2016/08...,5174602843
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1194,ENCFF584AIQ,ENCSR739PEB,bigWig,bigWig,signal of all reads,RNA-seq,UBERON:0002369,adrenal gland,tissue,,...,ENCODE,ENCODE Processing Pipeline,bcb98dcbc0caa48642592b9f91705bde,,/files/ENCFF584AIQ/@@download/ENCFF584AIQ.bigWig,mm10,released,ENCFF752FZT,https://encode-public.s3.amazonaws.com/2016/08...,187310904
1195,ENCFF465YOS,ENCSR739PEB,tsv,tsv,gene quantifications,RNA-seq,UBERON:0002369,adrenal gland,tissue,,...,ENCODE,ENCODE Processing Pipeline,9277ac793ae1285e953f409dfd1f87d4,,/files/ENCFF465YOS/@@download/ENCFF465YOS.tsv,mm10,released,"ENCFF717NFD, ENCFF077IZN",https://encode-public.s3.amazonaws.com/2016/08...,8644586
1196,ENCFF396NRQ,ENCSR739PEB,tsv,tsv,transcript quantifications,RNA-seq,UBERON:0002369,adrenal gland,tissue,,...,ENCODE,ENCODE Processing Pipeline,a9c42437793f4ad5221ddc2d0f2e93d8,,/files/ENCFF396NRQ/@@download/ENCFF396NRQ.tsv,mm10,released,"ENCFF717NFD, ENCFF077IZN",https://encode-public.s3.amazonaws.com/2016/08...,16414585
1197,ENCFF415XCZ,ENCSR739PEB,tsv,tsv,transcript quantifications,RNA-seq,UBERON:0002369,adrenal gland,tissue,,...,ENCODE,ENCODE Processing Pipeline,5a89b79b33780ee293ad00a676725a35,,/files/ENCFF415XCZ/@@download/ENCFF415XCZ.tsv,mm10,released,"ENCFF717NFD, ENCFF522TMY",https://encode-public.s3.amazonaws.com/2016/08...,16393312


Because many files are associated with an experiment, we need to remove duplicate experiment accessions from the File dataset column.

In [4]:
experiments = set(file_metadata["File dataset"])

# Web method

The encode portal does have a report view which returns a tsv file that you can customize with more metadata information, unfortunately I don't know of a direct way to go from a publication set, to the report.

So here we first construct a search url from all the experiment accession ids.

In [5]:
base = "https://www.encodeproject.org/search/?type=Experiment&accession="
url = base + "&accession=".join(experiments)

print(url)

https://www.encodeproject.org/search/?type=Experiment&accession=ENCSR343YLB&accession=ENCSR362AIZ&accession=ENCSR968QHO&accession=ENCSR750YSX&accession=ENCSR285WZV&accession=ENCSR541XZK&accession=ENCSR420QTO&accession=ENCSR173PJN&accession=ENCSR448MXQ&accession=ENCSR636CWO&accession=ENCSR185LWM&accession=ENCSR331XCE&accession=ENCSR508GWZ&accession=ENCSR096STK&accession=ENCSR337FYI&accession=ENCSR848GST&accession=ENCSR537GNQ&accession=ENCSR307BCA&accession=ENCSR823VEE&accession=ENCSR039ADS&accession=ENCSR943LKA&accession=ENCSR848HOX&accession=ENCSR216NEG&accession=ENCSR691OPQ&accession=ENCSR115TWD&accession=ENCSR304RDL&accession=ENCSR559TRB&accession=ENCSR764OPZ&accession=ENCSR792RJV&accession=ENCSR020DGG&accession=ENCSR928OXI&accession=ENCSR921PRX&accession=ENCSR526SEX&accession=ENCSR370SFB&accession=ENCSR284YKY&accession=ENCSR719NAJ&accession=ENCSR557RMA&accession=ENCSR017JEG&accession=ENCSR438XCG&accession=ENCSR150CUE&accession=ENCSR178GUS&accession=ENCSR049UJU&accession=ENCSR466KZY&

From there on the Experiment search page you can click on "Report", then on "Columns" to select the columns of interest if you'd like to change them, and then finally on download TSV to retrieve a table with more of the experiment metadata.

# Programatic method

encoded_client is my utility that provides some wrappers to access the encode json api. There's also a package called encode_utils which is similar.

The actual request is https://www.encodeproject.org/experiments/ENCSR727FHP/?format=json (iterating over all the different experiment accession ids from the file metadata.

In [6]:
server = ENCODED("www.encodeproject.org")

This chunk of code requests the experiment json objects, which are pretty detailed and include information about the experiment, and the replicate objects that link to the library and biosample objects that make up the experiment.

The different objects hold different kinds of information, and in this case since we're looking for 

In [7]:
experiment_metadata = []
for accession in tqdm(experiments):
    experiment = server.get_json(accession)
    for replicate in experiment["replicates"]:
        library = replicate["library"]
        biosample = library["biosample"]
        biosample_ontology = biosample["biosample_ontology"]
        experiment_metadata.append({
            "experiment": experiment["accession"], 
            "biosample": biosample["accession"], 
            "biosample_term_name": biosample_ontology["term_name"], 
            "mouse_life_stage": biosample["mouse_life_stage"], 
            "age": biosample['model_organism_age'], 
            "age_units": biosample['model_organism_age_units'],
        })

experiment_metadata = pandas.DataFrame(experiment_metadata)

100%|█████████████████████████████████████████████████| 78/78 [00:58<00:00,  1.34it/s]


In [8]:
experiment_metadata

Unnamed: 0,experiment,biosample,biosample_term_name,mouse_life_stage,age,age_units
0,ENCSR343YLB,ENCBS825LGT,midbrain,embryonic,14.5,day
1,ENCSR343YLB,ENCBS849BXE,midbrain,embryonic,14.5,day
2,ENCSR362AIZ,ENCBS804WMS,forebrain,postnatal,0,day
3,ENCSR362AIZ,ENCBS033OJJ,forebrain,postnatal,0,day
4,ENCSR968QHO,ENCBS844FSC,limb,embryonic,10.5,day
...,...,...,...,...,...,...
151,ENCSR579FCW,ENCBS173QYQ,spleen,postnatal,0,day
152,ENCSR932TRU,ENCBS110AZU,intestine,embryonic,14.5,day
153,ENCSR932TRU,ENCBS475HQA,intestine,embryonic,14.5,day
154,ENCSR946HWC,ENCBS186LJI,skeletal muscle tissue,postnatal,0,day


In [9]:
experiment_metadata.to_csv("ENCSR574CRQ_biosample.tsv", sep="\t", index=None)