# Introduction

Brian wants some of the LRGASP protocols to be adjusted.

Sorry that this change on the ENCODE Portal is coming so last minute, but I am hoping that it won't be too much headache.  
We need to swap out the PDF protocol titled " Long Read wet lab protocol v3" that is currently attached to the LRGASP 
experiments (only! no other long read experiments yet, I am still checking with Heidi and Ali) below on the Portal, and replace it with the protocol attached to this email.  

Note:  when doing this, please leave attached to the experiments the other 2 protocols that are attached now on the portal
(FASTQs and the "protocol to add 5' cap ...").



In [1]:
import pandas
from pathlib import Path
from io import StringIO
import os
import sys

EC = str(Path("~/proj/encoded_client").expanduser())
if EC not in sys.path:
    sys.path.append(EC)
from encoded_client.encoded import ENCODED, HTTPError

In [2]:
#server = ENCODED("www.encodeproject.org")
server = ENCODED("test.encodedcc.org")

In [3]:
def update_documents(seen_documents, experiment_id, obj):
    for d in obj.get("documents", []):
        seen_documents.setdefault(d, set()).add(experiment_id)

def get_all_documents(query):
    seen_documents = {}
    for row in query["@graph"]:
        experiment_id = row["@id"]
        experiment = server.get_json(experiment_id)
        update_documents(seen_documents, experiment_id, experiment)
        for rep in experiment["replicates"]:
            library = rep["library"]
            update_documents(seen_documents, experiment_id, library)
            if "biosample" in rep["library"]:
                biosample = rep["library"]["biosample"]
                update_documents(seen_documents, experiment_id, biosample)
    return seen_documents

def describe_documents(seen_documents):
    records = []
    for d in seen_documents:
        document = server.get_json(d)
        records.append({
            "id": "https://www.encodeproject.org{}".format(d),
            "used": len(seen_documents[d]),
            "filename": document["attachment"]["download"],
            "description": document["description"],
        })

    return pandas.DataFrame(records)


def build_experiment_query(table):
    base_url = "https://www.encodeproject.org/search/?type=Experiment&"
    values = table["Experiment Accession"].values
    return base_url + "&".join(["accession={}".format(x) for x in values])


In [4]:
mortazavi = pandas.read_csv(StringIO("""Lab	species	sample name	Experiment Accession
Mortazavi	Homo sapiens	endodermal cell	ENCSR127HKN
Mortazavi	Homo sapiens	H1 ES	ENCSR271KEJ
Mortazavi	Mus musculus	F121-9	ENCSR172GXL
Mortazavi	Homo sapiens	WTC11	ENCSR507JOF
Mortazavi	Homo sapiens	technical sample	ENCSR731MFY"""), sep="\t")
mortazavi

Unnamed: 0,Lab,species,sample name,Experiment Accession
0,Mortazavi,Homo sapiens,endodermal cell,ENCSR127HKN
1,Mortazavi,Homo sapiens,H1 ES,ENCSR271KEJ
2,Mortazavi,Mus musculus,F121-9,ENCSR172GXL
3,Mortazavi,Homo sapiens,WTC11,ENCSR507JOF
4,Mortazavi,Homo sapiens,technical sample,ENCSR731MFY


In [5]:
build_experiment_query(mortazavi)

'https://www.encodeproject.org/search/?type=Experiment&accession=ENCSR127HKN&accession=ENCSR271KEJ&accession=ENCSR172GXL&accession=ENCSR507JOF&accession=ENCSR731MFY'

In [6]:
mortazavi_seen = get_all_documents(server.get_json(build_experiment_query(mortazavi)))
mortazavi_docs = describe_documents(mortazavi_seen)

mortazavi_docs

Unnamed: 0,id,used,filename,description
0,https://www.encodeproject.org/documents/81af56...,5,ENCODE Long Read RNA-Seq Analysis Pipeline v3 ...,This document describes 1) the steps used to g...
1,https://www.encodeproject.org/documents/e90954...,5,ENCODE_Protocol_Spikeins_capping_v1.pdf,Protocol to add 5’ cap structures to exogenous...
2,https://www.encodeproject.org/documents/35c950...,5,Long read cDNA prep with Maxima H no exo SIRV4...,Library protocol for LRGASP long read RNA-seq.


In [7]:
wold = pandas.read_csv(StringIO("""Lab	species	sample name	Experiment Accession
Wold	Homo sapiens	endodermal cell	ENCSR266XAJ
Wold	Homo sapiens	H1 ES	ENCSR588EJX
Wold	Mus musculus	F121-9	ENCSR982PLD
Wold	Homo sapiens	WTC11	ENCSR673UKZ
Wold	Homo sapiens	technical sample	ENCSR154RVC"""), sep="\t")
wold

Unnamed: 0,Lab,species,sample name,Experiment Accession
0,Wold,Homo sapiens,endodermal cell,ENCSR266XAJ
1,Wold,Homo sapiens,H1 ES,ENCSR588EJX
2,Wold,Mus musculus,F121-9,ENCSR982PLD
3,Wold,Homo sapiens,WTC11,ENCSR673UKZ
4,Wold,Homo sapiens,technical sample,ENCSR154RVC


In [8]:
wold["Experiment Accession"].values

array(['ENCSR266XAJ', 'ENCSR588EJX', 'ENCSR982PLD', 'ENCSR673UKZ',
       'ENCSR154RVC'], dtype=object)

In [9]:
build_experiment_query(wold)

'https://www.encodeproject.org/search/?type=Experiment&accession=ENCSR266XAJ&accession=ENCSR588EJX&accession=ENCSR982PLD&accession=ENCSR673UKZ&accession=ENCSR154RVC'

In [10]:

wold_seen = get_all_documents(server.get_json(build_experiment_query(wold)))
wold_docs = describe_documents(wold_seen)

wold_docs

Unnamed: 0,id,used,filename,description
0,https://www.encodeproject.org/documents/334823...,5,Norgen_Animal-Tissue-RNA-Purification-Kit-Inse...,Norgen Animal Tissue RNA purification kit user...
1,https://www.encodeproject.org/documents/e90954...,5,ENCODE_Protocol_Spikeins_capping_v1.pdf,Protocol to add 5’ cap structures to exogenous...
2,https://www.encodeproject.org/documents/35c950...,5,Long read cDNA prep with Maxima H no exo SIRV4...,Library protocol for LRGASP long read RNA-seq.
3,https://www.encodeproject.org/documents/6936da...,5,nextera-dna-flex-library-prep-reference-guide-...,Library protocol for LRGASP total RNA-seq.


In [11]:
j = pandas.read_csv(
    "https://www.encodeproject.org/report.tsv?type=Experiment&accession=ENCSR127HKN&accession=ENCSR271KEJ&accession=ENCSR172GXL&accession=ENCSR507JOF&accession=ENCSR731MFY&accession=ENCSR266XAJ&accession=ENCSR588EJX&accession=ENCSR982PLD&accession=ENCSR673UKZ&accession=ENCSR154RVC&field=%40id&field=replicates.library.accession&field=replicates.library.documents&assay_title=total+RNA-seq", 
    sep="\t",
    converters={
        "replicates.library.accession": lambda x: x.split(","),
        "replicates.library.documents": lambda x: x.split(","),
    },
    skiprows=1)

j_seen = {}
for i, row in j.iterrows():
    for rep in row["replicates.library.documents"]:
        j_seen.setdefault(rep, set()).add(row["ID"])
        
j_docs = describe_documents(j_seen)
j_docs

Unnamed: 0,id,used,filename,description
0,https://www.encodeproject.org/documents/334823...,5,Norgen_Animal-Tissue-RNA-Purification-Kit-Inse...,Norgen Animal Tissue RNA purification kit user...
1,https://www.encodeproject.org/documents/e90954...,5,ENCODE_Protocol_Spikeins_capping_v1.pdf,Protocol to add 5’ cap structures to exogenous...


In [12]:
j_docs["filename"].values

array(['Norgen_Animal-Tissue-RNA-Purification-Kit-Insert-PI25700-8-M14.pdf',
       'ENCODE_Protocol_Spikeins_capping_v1.pdf'], dtype=object)

In [13]:
j_seen

{'/documents/33482357-bcb8-43d0-ac76-d58ff785b710/': {'/experiments/ENCSR154RVC/',
  '/experiments/ENCSR266XAJ/',
  '/experiments/ENCSR588EJX/',
  '/experiments/ENCSR673UKZ/',
  '/experiments/ENCSR982PLD/'},
 '/documents/e909542d-44c0-4bee-9aac-4d41a0b768db/': {'/experiments/ENCSR154RVC/',
  '/experiments/ENCSR266XAJ/',
  '/experiments/ENCSR588EJX/',
  '/experiments/ENCSR673UKZ/',
  '/experiments/ENCSR982PLD/'}}

In [15]:
fixes = {
    "mortizavi": "https://test.encodedcc.org/search/?type=Experiment&accession=ENCSR127HKN&accession=ENCSR271KEJ&accession=ENCSR172GXL&accession=ENCSR507JOF&accession=ENCSR731MFY",
    "wold": "https://test.encodedcc.org/search/?type=Experiment&accession=ENCSR266XAJ&accession=ENCSR588EJX&accession=ENCSR982PLD&accession=ENCSR673UKZ&accession=ENCSR154RVC",
    "hl60": "https://test.encodedcc.org/search/?type=Experiment&replicates.library.accession=ENCLB045HKM&replicates.library.accession=ENCLB463QEM&replicates.library.accession=ENCLB268CWN&replicates.library.accession=ENCLB909APR&replicates.library.accession=ENCLB529IWA&replicates.library.accession=ENCLB054OEH&replicates.library.accession=ENCLB559VQN&replicates.library.accession=ENCLB101LYE&replicates.library.accession=ENCLB455DNY&replicates.library.accession=ENCLB493RCQ&replicates.library.accession=ENCLB622SOZ&replicates.library.accession=ENCLB825EPI&replicates.library.accession=ENCLB611DPF&replicates.library.accession=ENCLB605OTC",
}


seen = {k: get_all_documents(server.get_json(fixes[k])) for k in fixes}
docs = {k: describe_documents(seen[k]) for k in seen}

for k in docs:
    print(k)
    print(docs[k])


mortizavi
                                                  id  used  \
0  https://www.encodeproject.org/documents/81af56...     5   
1  https://www.encodeproject.org/documents/e90954...     5   
2  https://www.encodeproject.org/documents/35c950...     5   

                                            filename  \
0  ENCODE Long Read RNA-Seq Analysis Pipeline v3 ...   
1            ENCODE_Protocol_Spikeins_capping_v1.pdf   
2  Long read cDNA prep with Maxima H no exo SIRV4...   

                                         description  
0  This document describes 1) the steps used to g...  
1  Protocol to add 5’ cap structures to exogenous...  
2     Library protocol for LRGASP long read RNA-seq.  
wold
                                                  id  used  \
0  https://www.encodeproject.org/documents/334823...     5   
1  https://www.encodeproject.org/documents/e90954...     5   
2  https://www.encodeproject.org/documents/35c950...     5   
3  https://www.encodeproject.org/documents/

In [17]:
docs["mortizavi"]

Unnamed: 0,id,used,filename,description
0,https://www.encodeproject.org/documents/81af56...,5,ENCODE Long Read RNA-Seq Analysis Pipeline v3 ...,This document describes 1) the steps used to g...
1,https://www.encodeproject.org/documents/e90954...,5,ENCODE_Protocol_Spikeins_capping_v1.pdf,Protocol to add 5’ cap structures to exogenous...
2,https://www.encodeproject.org/documents/35c950...,5,Long read cDNA prep with Maxima H no exo SIRV4...,Library protocol for LRGASP long read RNA-seq.


In [18]:
docs["wold"]

Unnamed: 0,id,used,filename,description
0,https://www.encodeproject.org/documents/334823...,5,Norgen_Animal-Tissue-RNA-Purification-Kit-Inse...,Norgen Animal Tissue RNA purification kit user...
1,https://www.encodeproject.org/documents/e90954...,5,ENCODE_Protocol_Spikeins_capping_v1.pdf,Protocol to add 5’ cap structures to exogenous...
2,https://www.encodeproject.org/documents/35c950...,5,Long read cDNA prep with Maxima H no exo SIRV4...,Library protocol for LRGASP long read RNA-seq.
3,https://www.encodeproject.org/documents/6936da...,5,nextera-dna-flex-library-prep-reference-guide-...,Library protocol for LRGASP total RNA-seq.


In [19]:
docs["hl60"]

Unnamed: 0,id,used,filename,description
0,https://www.encodeproject.org/documents/6d583a...,7,ENCODE Long Read RNA-Seq Analysis Pipeline v3....,"""This document describes 1) the steps used to ..."
1,https://www.encodeproject.org/documents/77db75...,7,ENCODE_protocol_2020pacbio_Final.pdf,This protocol describes an optimized method fo...
2,https://www.encodeproject.org/documents/ae021f...,7,Macrophage_differentiation_and_M1_M2_activatio...,HL-60 M0/M1/M2 differentiation protocol
