Submitting various things for end of grant.

In [1]:
import numpy
import pandas
import sys


Install my ENCODE DCC support library. It will either look for authentication tokens in the environment variables DCC_API_KEY and DCC_SECRET_KEY or in an entry in a .netrc file with the line

machine www.encodeproject.org login DCC_API_KEY password DCC_SECRET_KEY, or just set `server.auth = (DCC_API_KEY, DCC_SECRET_KEY)` after creating the object.

In [2]:
!$sys.executable -m pip install --user encoded_client



In [3]:
from encoded_client.encoded import ENCODED

In [4]:
# live server & control file

server = ENCODED('www.encodeproject.org')
spreadsheet_name = "https://docs.google.com/spreadsheets/d/1sRCE-Cydci0xkg3fTpGJWdRgKEjVCvpX/export?format=xlsx"


In [5]:
datasets = pandas.read_excel(spreadsheet_name, sheet_name="datasets")

print("How many records are there total", datasets.shape[0])

How many records are there total 478


In [6]:
print("How many records lack an analyst", datasets[pandas.isnull(datasets["primary analysist"])].shape[0])

How many records lack an analyst 103


While trying to understand this, lets see what files are available on a processed scATAC-seq file.

In [7]:
experiment = server.get_json("ENCSR237BTV")
for f in experiment["files"]:
    print(experiment["accession"], experiment["assay_term_name"], f["accession"], f["output_type"])

ENCSR237BTV single-nucleus ATAC-seq ENCFF011AHT reads
ENCSR237BTV single-nucleus ATAC-seq ENCFF321JCZ reads
ENCSR237BTV single-nucleus ATAC-seq ENCFF102ITI reads
ENCSR237BTV single-nucleus ATAC-seq ENCFF946ZML reads
ENCSR237BTV single-nucleus ATAC-seq ENCFF180LFF reads
ENCSR237BTV single-nucleus ATAC-seq ENCFF868YVX reads
ENCSR237BTV single-nucleus ATAC-seq ENCFF763QQR reads
ENCSR237BTV single-nucleus ATAC-seq ENCFF314WHS reads
ENCSR237BTV single-nucleus ATAC-seq ENCFF978HAU index reads
ENCSR237BTV single-nucleus ATAC-seq ENCFF326KKX index reads
ENCSR237BTV single-nucleus ATAC-seq ENCFF923ONK index reads
ENCSR237BTV single-nucleus ATAC-seq ENCFF428KHS index reads
ENCSR237BTV single-nucleus ATAC-seq ENCFF449WUA filtered reads
ENCSR237BTV single-nucleus ATAC-seq ENCFF381INN filtered reads
ENCSR237BTV single-nucleus ATAC-seq ENCFF853WXT unfiltered alignments
ENCSR237BTV single-nucleus ATAC-seq ENCFF163UJY alignments
ENCSR237BTV single-nucleus ATAC-seq ENCFF231IRS fragments


With that we can write a function to determine if a particular experiment has a file that indicates it was processed.

In [8]:
def is_processed(accession):
    final_file = {
        "single-nucleus ATAC-seq": "fragments",
        "single-cell RNA sequencing assay": 'sparse gene count matrix of all reads',
    }
    seen_output_types = []
    experiment = server.get_json(accession)
    status = False
    output_types = set()
    for f in experiment["files"]:
        output_types.add(f["output_type"])
        
    target_output_type = final_file.get(experiment["assay_term_name"])
    if target_output_type is None:
        print("Unsupported assay type {}".format(experiment["assay_term_name"]))
        print("Seen {}".format(output_types))
    elif target_output_type in output_types:
        status = True
        
    return {
        "is_processed": status,
        "status": experiment["status"],
        "assay_term_name": experiment["assay_term_name"],
        "default_analysis": experiment["default_analysis"],
        "biosample_summary": experiment["biosample_summary"],
    }

Query the ENCODE DCC portal to see if experiments lacking an analyst, have been processed.

In [9]:
processed = {}
for i, row in datasets[pandas.isnull(datasets["primary analysist"])].iterrows():
    processed[row[" ID"]] = is_processed(row[" ID"])
    
processed = pandas.DataFrame(processed).T
processed.head()

Unnamed: 0,is_processed,status,assay_term_name,default_analysis,biosample_summary
ENCSR237BTV,True,released,single-nucleus ATAC-seq,/analyses/ENCAN277WEW/,Homo sapiens A549 nuclear fraction
ENCSR624OTN,True,released,single-nucleus ATAC-seq,/analyses/ENCAN615PWI/,Homo sapiens A549 nuclear fraction
ENCSR773HYY,True,released,single-nucleus ATAC-seq,/analyses/ENCAN738VCU/,Homo sapiens A673 nuclear fraction
ENCSR719QGN,True,released,single-nucleus ATAC-seq,/analyses/ENCAN692ASW/,Homo sapiens activated B cell nuclear fraction...
ENCSR675DJV,True,released,single-nucleus ATAC-seq,/analyses/ENCAN520KVN/,"Homo sapiens activated naive CD4-positive, alp..."


Report any experiment accessions that lack processed data.

In [10]:
processed[processed["is_processed"] == False]

Unnamed: 0,is_processed,status,assay_term_name,default_analysis,biosample_summary


Double check that all queried experiments have processed data.

In [11]:
numpy.all(processed["is_processed"])

True

In [12]:
processed[processed["status"] != "released"]

Unnamed: 0,is_processed,status,assay_term_name,default_analysis,biosample_summary
ENCSR076QTT,True,in progress,single-nucleus ATAC-seq,/analyses/ENCAN399JOZ/,Homo sapiens HCT116 genetically modified (inse...
ENCSR840IWM,True,in progress,single-nucleus ATAC-seq,/analyses/ENCAN544EUT/,Homo sapiens HCT116 genetically modified (inse...
ENCSR672VKC,True,in progress,single-nucleus ATAC-seq,/analyses/ENCAN754EDL/,Homo sapiens HCT116 genetically modified (inse...
ENCSR587XSM,True,in progress,single-nucleus ATAC-seq,/analyses/ENCAN836EBA/,Homo sapiens HCT116 genetically modified (inse...
ENCSR694GIR,True,in progress,single-nucleus ATAC-seq,/analyses/ENCAN288TVZ/,Homo sapiens HCT116 genetically modified (inse...
ENCSR270OJH,True,in progress,single-nucleus ATAC-seq,/analyses/ENCAN443NEB/,Homo sapiens HCT116 genetically modified (inse...
