# Thousands of seqspecs

- github https://github.com/detrout/igvf-seq-spec-demo/
- Presentation https://woldlab.caltech.edu/~diane/igvf-seq-spec-demo/submitting-seqspecs.slides.html#/
- Notebook https://woldlab.caltech.edu/~diane/igvf-seq-spec-demo/submitting-seqspecs.html#

# Outline

This is a simplification of my figuring out how to submit seqspec files to the IGVF DACC.

- [Python environment setup](#Setup)
- [Jinja Templating](#Jinja)
- [Seqspec template function](#Template)
- [Find datasets missing seqspecs](#Find-datasets-missing-seqspecs)
- [Generating a seqspec](#Generating-a-seqspec)
- [Seqspec submission functions](#Seqspec-Submission-functions)
- [Create seqspec objects for remaining fastqs](#Create-seqspec-objects-for-remaining-fastqs)

# Why seqspec?

In ENCODE there were some experiments where the details of the barcode structure were never provided so when the person who did the experiment moved on there was no way to reprocess those samples.

And thus seqspec

However no one wants to generate thousands of seqspec, so I started working on automating seqspec generation.

Seqspec needs a fairly significant amount of information which can come from a local LIMS or spreadsheets, to make a more general example, I decided to try to generate seqspec files from information posted on the portal.

For me, this required that my measurements set have the following properties.

- protocols
- platform id
- sequencing kit information
- submitted fastqs

With that information,

- a templating engine
- the IGVF portal API
- and a few hard coded variables

I was able to create valid seqspecs files and attach them to the measurement set and the fastqs associated with a single sequencing run.

One measurement run will have several fastqs attached to it organized by illumina read type and sequencing run.  The number of different read types will vary depending on the assay.

<table>
    <thead>
        <tr><td>sequencing_run</td><td><b>R1</b></td><td><b>R2</b></td><td><b>I1</b></td><td><b>[R3/I2]...</b></td></tr>
    </thead>
    <tbody>
        <tr>
            <td><b>1</b>
            <td>run1/Sublibrary_7_S7_L001_R1_001.fastq.gz</td>
            <td>run1/Sublibrary_7_S7_L001_R2_001.fastq.gz</td>
            <td>run1/Sublibrary_7_S7_L001_I1_001.fastq.gz</td>
            <td>...</td>
        </tr>
        <tr>
            <td><b>2</b>
            <td>run1/Sublibrary_7_S7_L002_R1_001.fastq.gz</td>
            <td>run1/Sublibrary_7_S7_L002_R2_001.fastq.gz</td>
            <td>run1/Sublibrary_7_S7_L002_I1_001.fastq.gz</td>
            <td>...</td>
        </tr>
        <tr>
            <td><b>3</b>
            <td>run2/Sublibrary_7_S7_L002_R1_001.fastq.gz</td>
            <td>run2/Sublibrary_7_S7_L002_R2_001.fastq.gz</td>
            <td>run2/Sublibrary_7_S7_L002_I1_001.fastq.gz</td>
            <td>...</td>
        </tr>
    </tbody>
</table>

you'll need a seqspec for each set of files that should be treated as a single set.

The rest of this notebook is about grouping those runs and generating a seqspec for each one.

# Setup

We will need to import a variety of standard python components, boto3, and jinja

Navigate down for the detailed code blocks

In [1]:
import enum
import gzip
import hashlib
from io import StringIO, BytesIO
import json
from jsonschema import Draft4Validator
import logging
import numpy
import os
from pathlib import Path
import pandas
import requests
import sys
from urllib.parse import urlparse, urljoin
import yaml

In [2]:
try:
    import boto3
except ImportError:
    !{sys.executable} -m pip install --user boto3
    import boto3
    
from botocore.exceptions import ClientError
    

In [3]:
try:
    from jinja2 import Environment
except ImportError:
    !{sys.executable} -m pip install --user jinja2
    from jinja2 import Environment

from jinja2 import FileSystemLoader, select_autoescape, Undefined, StrictUndefined, make_logging_undefined

logger = logging.getLogger(__name__)
LoggingUndefined = make_logging_undefined(
    logger=logger,
    base=Undefined
)

env = Environment(
    loader=FileSystemLoader("templates"),
    autoescape=select_autoescape(),
    undefined=LoggingUndefined,
)

# Import seqspec validator

To make sure our seqspecs are valid before submitting, I imported the parts seqspec necessary for running them from code instead of using the command line interface.

I have the repository checked out into ~/proj/seqspec. This block should either import it for me, or install it if someone elese runs it.

See below for details

In [4]:
try:
    import seqspec
except ImportError:
    # change ~/proj/seqspec to whever you have seqspec cloned to.
    seqspec_root = Path("~/proj/seqspec").expanduser()
    if seqspec_root.exists() and str(seqspec_root) not in sys.path:
        sys.path.append(str(seqspec_root))
    else:
        # Once seqspec is updated
        #!{sys.executable} -m pip install --user seqspec
        # Currently the IGVF pipelines need a development branch
        # On linux systems with newer versions of python you might need --break-system-packages or to 
        # use a virtualenv or conda See https://peps.python.org/pep-0668/ for discussion 
        !{sys.executable} -m python3 -m pip install --user git+https://github.com/pachterlab/seqspec.git#e1ae2eb7e3fe18a3e1b73c4f97dc803d68e76a69
    import seqspec

Import pieces of seqspec that we need for this notebook.

In [5]:
from seqspec.Assay import Assay
from seqspec.Region import Region
from seqspec.Region import Onlist
from seqspec.utils import load_spec_stream
from seqspec.seqspec_index import run_index, get_index
from seqspec.seqspec_print import run_print_sequence_spec, run_print_library_tree, run_print_library_png
from seqspec.seqspec_onlist import run_list_onlists, run_onlist_read, run_find_by_type

## Utilities for seqspec validation

Some helper functions because the seqspec library wasn't really written for this use case.

In [6]:
def seqspec_validate(schema, spec):
    """Validate a yaml object against a json schema
    """
    validator = Draft4Validator(schema)

    for idx, error in enumerate(validator.iter_errors(spec), 1):
        print(f"[{idx}] {error.message}")

def load_spec(filename):
    with open(filename, "rt") as instream:
        data = yaml.load(instream, Loader=yaml.Loader)
        for r in data.assay_spec:
            r.set_parent_id(None)
    return data

schema_path = seqspec_root / "seqspec"/ "schema" / "seqspec.schema.json"
with open(schema_path, "rt") as instream:
    seqspec_schema = json.load(instream)        

# Import portal API

I have my own API <a href="https://github.com/detrout/encoded_client">encoded_client</a> for interacting with the IGVF database server (which is very much like the old ENCODE database server) it is similar in purpose to <a href="https://igvf-utils.readthedocs.io/">igvf-utils</a>.

See below for details

In [7]:
try:
    from encoded_client import encoded
except ImportError:
    encoded_root = Path("~/proj/encoded_client").expanduser()
    if encoded_root.exists() and str(encoded_root) not in sys.path:
        sys.path.append(str(encoded_root))
    else:
        !{sys.executable} -m pip install --user encoded_client
        
    from encoded_client import encoded

from encoded_client.encoded import filter_aws_credentials
from encoded_client.submission import parse_s3_url

encoded_client will pull submitter credentials from either DCC_API_KEY and DCC_SECRET_KEY or from a .netrc file loaded from your home directory. (replacing the {DCC_API_KEY} and {DCC_SECRET_KEY} strings with your specific values.)

The format of a .netrc file is a plain text file with records of the format:

<pre>machine api.sandbox.igvf.org login {DCC_API_KEY} password {DCC_SECRET_KEY}</pre>

Or api.data.igvf.org

(it's also possible to list the fields on separate lines, but I think it's easier to read when they're on one line)

or after creating the server object call:

<pre>server.username = "{DCC_API_KEY}"
server.password = "{DCC_SECRET_KEY}"</pre>



# Variables

We need to specify which server we're using and our award and submitting lab ids.

In [8]:
#server_name = "api.data.igvf.org"
server_name = "api.sandbox.igvf.org"
award = "/awards/HG012077/"
lab = "/labs/ali-mortazavi/"

server = encoded.ENCODED(server_name)
igvf_validator = encoded.DCCValidator(server)

# check below for a dictionary to help cache protal objects

Simple object to cache object lookups

In [9]:
class CachedTerms:
    def __init__(self, server):
        self.server = server
        self._cache = {}
        
    def __getitem__(self, key):
        if key in self._cache:
            return self._cache[key]
        
        obj = self.server.get_json(key)
        if obj is not None:
            self._cache[key] = obj
            return obj
        
server_cache = CachedTerms(server)

# Jinja

[Jinjia](https://jinja.palletsprojects.com/) is a fairly popular templating language for python.

it supports conditionals, loops, variable substitutions, and even function calls. However for seqspec we just need to be able to substitute in variables we collect from elsewhere.

Here's an example of one of the template read blocks showing some of the variables that will get replaced.

<pre>
- !Read
  read_id: {{ read1_accession }}.fastq.gz
  name: Read 1 fastq for {{ read1_accession }}
  modaility: rna
  primer_id: truseq_read1
  min_len: {{ read1_min_length }}
  max_len: {{ read1_max_length }}
  strand: pos
</pre>

# Filter ProtocolsIO urls

The ProtocolsIO URLS are a bit long, so add a mapping to a short name.

In [10]:
class ProtocolsIO(enum.StrEnum):
    splitseq_100k_v2 = "https://www.protocols.io/view/evercode-wt-v2-2-1-eq2lyj9relx9/v1"
    splitseq_1M_v2 = "https://www.protocols.io/view/evercode-wt-mega-v2-2-1-8epv5xxrng1b/v1"
    ont_library_prep = "https://www.protocols.io/view/ont-library-prep-for-split-seq-cdna-eq2lyj1xmlx9/v1"
    splitseq_single_index = "https://www.protocols.io/view/evercode-single-index-pcr-5jyl82k9rl2w/v1"
    splitseq_dual_index = "https://www.protocols.io/view/evercode-dual-index-pcr-yxmvmeqe5g3p/v1"
    

def get_protocol_used_for_index(protocols):
    # Python 3.12 enums supports in directly
    protocols = [ProtocolsIO(p) for p in protocols if p in ProtocolsIO.__members__.values()]

    if len(protocols) == 0:
        raise ValueError(f"Unable to find information by {protocols}")

    return protocols

# Template variables

First build up lists of barcodes onlists needed for this protocol the names will be passed to the template. The combinitorial barcoding schemes like parse or shareseq need to specify several barcode files.

In [11]:
def get_barcodes_from_protocols(protocols):
    if ProtocolsIO.splitseq_100k_v2 in protocols:
        return {
        # onlist1_n96_v4
        "barcode_1_url": "https://data.igvf.org/tabular-files/IGVFFI0924TKJO/@@download/IGVFFI0924TKJO.txt.gz",
        "barcode_1_location": "remote",
        "barcode_1_md5": "6d5016e63f121b6a64fb3907dd83f358",
        "barcode_2_url": "https://data.igvf.org/tabular-files/IGVFFI1138MCVX/@@download/IGVFFI1138MCVX.txt.gz",
        "barcode_2_location": "remote",
        "barcode_2_md5": "1452e8ef104e6edf686fab8956172072",        
        "barcode_3_url": "https://data.igvf.org/tabular-files/IGVFFI1138MCVX/@@download/IGVFFI1138MCVX.txt.gz",
        "barcode_3_location": "remote",
        "barcode_3_md5": "1452e8ef104e6edf686fab8956172072",
        }
    elif ProtocolsIO.splitseq_1M_v2 in protocols:
        return {
            # need to add barcode urls
            #onlist1_n192_v4
            "barcode_1_location": "remote",        
            "barcode_1_url": "https://data.igvf.org/tabular-files/IGVFFI2591OFQO/@@download/IGVFFI2591OFQO.txt.gz",
            "barcode_1_md5": "5c3b70034e9cef5de735dc9d4f3fdbde",
            "barcode_2_location": "remote",        
            "barcode_2_url": "https://data.igvf.org/tabular-files/IGVFFI1138MCVX/@@download/IGVFFI1138MCVX.txt.gz",
            "barcode_2_md5": "1452e8ef104e6edf686fab8956172072",
            "barcode_3_location": "remote",
            "barcode_3_url": "https://data.igvf.org/tabular-files/IGVFFI1138MCVX/@@download/IGVFFI1138MCVX.txt.gz",
            "barcode_3_md5": "1452e8ef104e6edf686fab8956172072",
        }
    else:
        raise ValueError("Unrecognized barcode protocol")

Information about library kits, organized by protocol and for us single versus dual illumina index.

In [12]:
def get_library_kit_from_protocols(protocols):
    if ProtocolsIO.ont_library_prep in protocols:
        return {
            "library_protocol": "Any",
            "library_kit": "cDNA Exome Capture v1.0.1",
            "sequence_kit": "ONT Ligation Sequencing Kit V14",
        }

    if ProtocolsIO.splitseq_100k_v2 in protocols:
        parse_kit_name = "Evercode WT v2.0.1"
    elif ProtocolsIO.splitseq_1M_v2 in protocols:
        parse_kit_name = "Evercode WT Mega v2.0.1"
    else:
        raise ValueError("Missing splitseq protocol")
    if ProtocolsIO.splitseq_single_index in protocols:
        illumina_kit_name = "single index"
    elif ProtocolsIO.splitseq_dual_index in protocols:
        illumina_kit_name = "dual index"
    else:
        raise ValueError("Missing illumina protocol")
    context = {
        "library_protocol": "Any",
        "library_kit": f"{parse_kit_name} {illumina_kit_name}"
    }
    return context

Using protocol to select which seqspec template to use

In [13]:
def get_templates_by_protocols(protocols):
    if ProtocolsIO.splitseq_single_index in protocols:
        return "parse-wt-mega-v2-single-index-libspec-1.yaml.j2"
    elif ProtocolsIO.splitseq_dual_index in protocols:
        return "parse-wt-mega-v2-dual-index-libspec-1.yaml.j2"
    elif ProtocolsIO.ont_library_prep in protocols:
        return "parse-wt-mega-v2-nanopore.yaml.j2"
    else:
        raise ValueError("Unrecognized protocols {}".format(protocols))
    

Search for measurement sets from our lab and that are missing seqspecs.

In [14]:
query = f"/search/?type=MeasurementSet&lab.title=Ali+Mortazavi%2C+UCI"\
         "&audit.NOT_COMPLIANT.category=missing+sequence+specification"

graph = server.get_json(query, limit=2000)
to_process = [x["@id"] for x in graph["@graph"]]
# some test server datasets were set up in ways incompatble with this notebook.
if server.server == "api.sandbox.igvf.org":
    problems = set(["/measurement-sets/TSTDS36584014/", "/measurement-sets/TSTDS51545328/", "/measurement-sets/TSTDS95340953/", "/measurement-sets/TSTDS43282126/", "/measurement-sets/TSTDS76216718/"])
    to_process = [x for x in to_process if x not in problems]
print(len(to_process))
to_process

20


['/measurement-sets/TSTDS25106370/',
 '/measurement-sets/TSTDS32005063/',
 '/measurement-sets/TSTDS30294230/',
 '/measurement-sets/TSTDS95237342/',
 '/measurement-sets/TSTDS48877173/',
 '/measurement-sets/TSTDS69666634/',
 '/measurement-sets/TSTDS95760802/',
 '/measurement-sets/TSTDS07432728/',
 '/measurement-sets/TSTDS34582101/',
 '/measurement-sets/TSTDS72923185/',
 '/measurement-sets/TSTDS84503921/',
 '/measurement-sets/TSTDS12663199/',
 '/measurement-sets/TSTDS90515305/',
 '/measurement-sets/TSTDS70002954/',
 '/measurement-sets/TSTDS06772346/',
 '/measurement-sets/TSTDS02882566/',
 '/measurement-sets/TSTDS74497326/',
 '/measurement-sets/TSTDS10802686/',
 '/measurement-sets/TSTDS48294221/',
 '/measurement-sets/TSTDS09179588/']

Just as an aside searches and end-points like https://api.sandbox.igvf.org/measurement-sets/ return a json-ld collection.

<pre>
{
  "@id": "/search/?type=Measuremen…le=Ali+Mortazavi%2C+UCI",
  "@graph": [ ... ],
  "@type": ["Search"],
  "total": 27
}
</pre>

For many searches the objects in the @graph may only be a subset of all the attributes, so I frequently do a search and then request the fully detailed object.

<code format="python">    for row in response["@graph"]:
       detail = server.get_json(row["@id"])
       ... do stuff
</code>

In [33]:
#measurement_test_id = "/measurement-sets/TSTDS32005063/"
measurement_test_id = "/measurement-sets/TSTDS12663199/"  #Nanopore

## Processing measurement sets

The measurement_set lists all the files attached to it in a single ["files"] list. However we need to process seqspecs by sequencing_run, so we'll need to find the fastqs and group them by sequencing run.

And as a feature I should implement if there's already seqspec configuration files attached to the measurement set should detect that and warn about it.

Build up a data structure of reads organized by sequencing run.

In [34]:
def get_sequence_files(measurement_set):
    paired_files = {}
    for file in measurement_set["files"]:        
        sequence = server.get_json(file["@id"])
        file_format = sequence["file_format"]
        if file_format in ("fastq",):
            sequence_run = sequence["sequencing_run"]
            illumina_read_type = sequence.get("illumina_read_type")
            paired_files.setdefault(sequence_run, {})[illumina_read_type] = sequence
        
    return paired_files

measurement = server.get_json(measurement_test_id)
protocols = measurement["protocols"]
run_files = get_sequence_files(measurement)
for run in sorted(run_files):
    for read in sorted(run_files[run]):
        print(run, read, run_files[run][read]["submitted_file_name"])

1 None igvf_b01/nanopore/igvfb01_13H_lig-ss_1.fastq.gz
2 None igvf_b01/nanopore/igvfb01_13H_lig-ss_2.fastq.gz


# Generating a seqspec

Introducing functions to build the seqspec from information in the notebook and what's been posted to the portal.

I made sure to include the sequencing platform, sequencing_kit, and protocols in my measurement set objects so I could reuse them when generating seqspecs

# Rendering logic

- First find protocol based variables and templates
- Then merge values from a dictionary indexed by illumina_read_type that represents one sequencing_run row generated get_sequence_files()

<pre>
{"R1": {"@id": ... },
 "R2": {"@id": ....}}
</pre>

- render the template to a string
- load the string into the seqspec library
- call update_spec to fix the joined regions
- validate the seqspec
- if validation passes return the seqspec text using to_YAML()

Broken into smaller functions for discussion.

We need to define the variables that will be passed to the Jinja templating engine.

In [35]:
def generate_illumina_seqspec_context(protocols, run_files):
    platform = run_files["R1"]["sequencing_platform"]    
    context = {
        "read1_accession": run_files["R1"]["accession"],
        "read1_url": server.prepare_url(run_files["R1"]["href"]),
        "read1_min_length": run_files["R1"]["minimum_read_length"],
        "read1_max_length": run_files["R1"]["maximum_read_length"],

        "read2_accession": run_files["R2"]["accession"],
        "read2_url": server.prepare_url(run_files["R2"]["href"]),
        "read2_min_length": run_files["R2"]["minimum_read_length"],
        "read2_max_length": run_files["R2"]["maximum_read_length"],

        "sequence_kit": run_files["R1"]["sequencing_kit"],
        "sequence_protocol": server_cache[platform]["term_name"],
    }

    context.update(get_barcodes_from_protocols(protocols))
    context.update(get_library_kit_from_protocols(protocols))
    
    return context

In [36]:
def generate_nanopore_seqspec_context(protocols, run_files):
    platform = run_files[None]["sequencing_platform"]    
    context = {
        "read1_accession": run_files[None]["accession"],
        "read1_url": server.prepare_url(run_files[None]["href"]),
        "read1_min_length": run_files[None]["minimum_read_length"],
        "read1_max_length": run_files[None]["maximum_read_length"],

        "sequence_kit": run_files[None]["sequencing_kit"],
        "sequence_protocol": server_cache[platform]["term_name"],
    }

    context.update(get_barcodes_from_protocols(protocols))
    context.update(get_library_kit_from_protocols(protocols))
    
    return context

In [37]:
def generate_seqspec_context(protocols, run_files):
    if ProtocolsIO.splitseq_single_index in protocols:
        return generate_illumina_seqspec_context(protocols, run_files)
    elif ProtocolsIO.splitseq_single_index in protocols:
        return generate_illumina_seqspec_context(protocols, run_files)
    elif ProtocolsIO.ont_library_prep in protocols:
        return generate_nanopore_seqspec_context(protocols, run_files)
    else:
        raise ValueError("Unknown protocol {}".format(protocols))

With a filled in context dictionary and we can lookup our template, render it, and test that it's valid.

In [38]:
def generate_seqspec_for_run(protocols, run_files):
    protocol = get_protocol_used_for_index(protocols)
    template_name = get_templates_by_protocols(protocols)

    context = generate_seqspec_context(protocol, run_files)
    template = env.get_template(template_name)
    example_yaml = template.render(context)
    
    # validate the generated seqspec file.
    example_spec = load_spec_stream(StringIO(example_yaml))
    example_spec.update_spec()
    seqspec_validate(seqspec_schema, example_spec.to_dict())

    return example_spec.to_YAML()

And an example of generate_seqspec_for_run being called

In [39]:
print(generate_seqspec_for_run(protocols, run_files[1]))



[1] 36090 is greater than the maximum of 2048
[2] 35871 is greater than the maximum of 2048
!Assay
seqspec_version: 0.1.1
assay_id: Evercode-WT-mega-nanpore-v2
doi: https://docs.google.com/presentation/d/17yKh6xE5b9Mo4DaXx5uPvFZOIHOW0kbK-QsU2ECwx8c/edit#slide=id.g29abb1440dc_0_500
date: 13 August 2024
name: Parse Evercode mega WT v2 using nanopore
description: split-pool ligation-based ONT transcriptome sequencing
modalities:
- rna
lib_struct: ''
library_protocol: Any
library_kit: cDNA Exome Capture v1.0.1
sequence_protocol: ONT GridION X5
sequence_kit: ONT Ligation Sequencing Kit V14
sequence_spec:
- !Read
  read_id: TSTFI39351339.fastq.gz
  name: Fastq for TSTFI39351339
  modality: rna
  primer_id: ont-top
  min_len: 11
  max_len: 36025
  strand: pos
library_spec:
- !Region
  region_id: rna
  region_type: rna
  name: rna
  sequence_type: joined
  sequence: TTTTTTTTCCTGTACTTCGTTCAGTTACGTATTGCTAAGCAGTGGTATCAACGCAGAGTGAATGGGXXXXXXXXXXXXXXXXXNNNNNNNNGTGGCCGATGTTTCGCATCGGCGTACGACTNNNNNNNN

# Test seqspec

Check barcode locations for your tool.

```
seqspec index -m rna -r TSTFI09417350.fastq.gz,TSTFI18957175.fastq.gz -t kb TSTDS32005063.yaml
1,10,18,1,48,56,1,78,86:1,0,10:0,0,140
```
check generated barcode list is correct.

```
seqspec onlist -m rna -f multi -r barcode -s region-type TSTDS32005063.yaml
/woldlab/loxcyc/home/diane/proj/igvf-seq-spec-demo/onlist_joined.txt
diane@trog:~/proj/seqspec$ head onlist_joined.txt 
AACGTGAT AACGTGAT CATTCCTA
AAACATCG AAACATCG CTTCATCA
ATGCCTAA ATGCCTAA CCTATATC
```


# Seqspec Submission functions

Post a seqspec object to the portal and capture the upload credentials from the response

In [23]:
def create_seqspec_metadata_object(seqspec_metadata):
    """Post a seq spec metadata object to the portal
    """
    response = server.post_json("configuration_file", seqspec_metadata)
    if response["status"] == "success":
        graph = response["@graph"]
        if len(graph) != 1:
            print("Strange number of result objects {}".format(len(graph)))        

        print("Upload of {} succeeded". format(graph[0]["@id"]))
        seqspec_metadata.update({
            "@id": graph[0]["@id"],
            "accession": graph[0]["accession"],
            "uuid": graph[0]["uuid"],
        })
    else:
        print(filter_aws_credentials(response))
        raise RuntimeError("Unable to create metadata object")
        
    return graph[0].get("upload_credentials")

Refactor credential refresh logic for shorter upload function

In [24]:
def refresh_credentials(credentials, seqspec_metadata):
    if credentials is None:
        print("retreving new credentials")
        response = server.post_json("{}/@@upload".format(seqspec_metadata["@id"]), {})
        print(filter_aws_credentials(response))
        if response["status"] == "success":
            assert len(response["@graph"]) == 1, "upload_seqspec_file Unexpected graph length {}".format(len(response["@graph"]))
            graph = response["@graph"]
            credentials = graph[0]["upload_credentials"]
            return credentials
        else:
            raise ValueError("Unable to get credentials for {}".format(seqspec_metadata["@id"]))
    else:
        return credentials

Using the credentials, the seqspec_metadata, and the seqspec text, post the seqspec to the portal. 

In [25]:
def upload_seqspec_file(credentials, seqspec_metadata, seqspec_contents):        
    """upload the seqspec contents as a file to s3"""
    if not isinstance(seqspec_contents, bytes):
        raise ValueError("seqspec_contents needs to be a gzipped byte array")

    credentials = refresh_credentials(credentials, seqspec_metadata)

    s3_client = boto3.client(
        's3', 
        aws_access_key_id=credentials["access_key"], 
        aws_secret_access_key=credentials["secret_key"], 
        aws_session_token=credentials["session_token"])

    bucket, target = parse_s3_url(credentials["upload_url"])
    s3_client.upload_fileobj(
        BytesIO(seqspec_contents),
        bucket,
        target)

A really irritating thing about gzip is that includes the current time in the compression header, which means by default each new compression run will get a new md5sum.

But I want to use the md5sum to see if I've already posted the file...

The python gzip utilities let you set the time, so you can force a reproducable time, so this code sets the gzip creation time to 1970 Jan 1 00:00 UTC.  (UNIX time 0).

`gzip.compress(buffer, mtime=0)`

Or for the gzip shell command

`gzip -n/--no-name filename`

Construct seqspec object, see if it's already posted (by md5sum), and if not create the object and post the seqspec.

- build list of file_ids this seqspec is for
- compress file
- calculate md5 of gzipped file
- create seqspec_object
- get_or_update_seqspec
  - search for seqspec object, if available do the following.
  - upload gzipped seqspec contents if its missing
  - update seqspec_object with @id, accession, uuid
- if md5 was not found, 
  - create the metadata object
  - upload compressed contents

In [26]:
def get_seqspec_of(sequencing_run_files):
    # Once the seqspec configuration file and metadata has been created and uploaded
    # attach the the configuration file to it's fastqs.
    seqspec_of = []
    for read in sequencing_run_files:
        # update the file record in case it changed
        seqspec_of.append(sequencing_run_files[read]["@id"])

    return seqspec_of

In [27]:
def create_seqspec_configuration_metadata(file_set, md5, seqspec_of):
    # Construct the configuration_file object for the DACC portal
    seqspec_metadata = {
        "award": award,
        "lab": lab,
        "md5sum": md5.hexdigest(),
        "file_format": "yaml",
        "file_set": file_set,
        "content_type": "seqspec",
        "seqspec_of": seqspec_of,
    }
    # Make sure the configuration_file passes the DACCs schema
    igvf_validator.validate(seqspec_metadata, "configuration_file")
    return seqspec_metadata

In [28]:
def get_or_update_seqspec(seqspec_metadata, seqspec_gzip, md5, dry_run=True):
    response = server.get_json("md5:{}".format(seqspec_metadata["md5sum"]))
    accession = response["accession"]
    uuid = response["uuid"]
    print("found object {} by {}. {}".format(
        response["@id"], seqspec_metadata["md5sum"], response["status"]))

    for k in seqspec_metadata:
        if seqspec_metadata[k] != response.get(k):
            print("{} {} differ".format(seqspec_metadata[k], response.get(k)))

    seqspec_metadata.update({
        "@id": response["@id"],
    })

    if response["status"] == "in progress":
        if dry_run:
            print("Would upload file contents {}".format(md5.hexdigest()))
        else:
            upload_seqspec_file(None, seqspec_metadata, seqspec_gzip)

    seqspec_metadata["accession"] = accession
    seqspec_metadata["uuid"] = uuid
    return seqspec_metadata

In [29]:
def create_and_upload_seqspec(seqspec_metadata, seqspec_data, md5, dry_run=True):
    if dry_run:
        print("Would create and upload {} for {}".format(
            md5.hexdigest(), seqspec_metadata["seqspec_of"]))
        for server_keys in ["@id", "accession", "uuid"]:
            seqspec_metadata[server_keys] = "would upload"
    else:
        credentials = create_seqspec_metadata_object(seqspec_metadata)
        upload_seqspec_file(credentials, seqspec_metadata, seqspec_data)  
    return seqspec_metadata

In [30]:
def register_seqspec(file_set, seqspec, sequencing_run_files, dry_run=True):
    """Create the seqspec objects and attach them to the fastqs
    
    Parameters:
    - file_set: id the seqspec should be attached to
    - seqspec a formatted seqspec file
    - set of file objects associated with one sequencing run
    - dry_run flag for if we should actually post data    
    """
    seqspec_of = get_seqspec_of(sequencing_run_files)
    # reproducibly compress seqspec text
    seqspec_gzip = gzip.compress(seqspec.encode("utf-8"), mtime=0)
    md5 = hashlib.md5(seqspec_gzip)
    seqspec_metadata = create_seqspec_configuration_metadata(
        file_set, md5, seqspec_of)
    # Search the portal for the md5sum of our seqspec file to see if 
    try:
        seqspec_metadata = get_or_update_seqspec(
            seqspec_metadata, seqspec_gzip, md5, dry_run=dry_run)
    except encoded.HTTPError as err:
        # If the file has not been submitted, and we're not in dry_run mode 
        # lets submit it
        if err.response.status_code == 404:
            seqspec_metadata = create_and_upload_seqspec(
                seqspec_metadata, seqspec_gzip, md5, dry_run=dry_run)
        else:
             print("Other HTTPError error {}".format(err.response.status_code))
    return seqspec_metadata

# Create seqspec objects for remaining fastqs

Now that we have a way to list all of the fastq sets, and a function to post everything to the portal, lets loop
loop through all of our measurement_sets and posting the seqspec configuration files for all the fastqs. (Change the to_process variable to be whatever you need updated.

In [31]:
# note for the tests we limit this to just one measurement set
# instead of all pending measurement sets
#to_process = [measurement_test_id]

seen_md5s = set()
generated_specs = {}
submitted_log = []
for measurement_id in to_process:
    print("processing", measurement_id)
    measurement = server.get_json(measurement_id)
    protocols = measurement["protocols"]
    run_files = get_sequence_files(measurement)
    
    for run_number in sorted(run_files):
        seqspec = generate_seqspec_for_run(protocols, run_files[run_number])
        generated_specs.setdefault(measurement_id, {}).setdefault(run_number, []).append(seqspec)
        current_md5 = hashlib.md5(seqspec.encode("utf-8")).hexdigest()
        assert current_md5 not in seen_md5s, "we generated the same seqspec file somehow"
        seen_md5s.add(current_md5)
        submitted_log.append(
            register_seqspec(
                measurement_id, seqspec, run_files[run_number], dry_run=True))
    

processing /measurement-sets/TSTDS25106370/


Error http status: 404 for https://api.sandbox.igvf.org/md5:e25906467a62eed63095b9922460cf56


Would create and upload e25906467a62eed63095b9922460cf56 for ['/sequence-files/TSTFI69373920/', '/sequence-files/TSTFI06343326/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:77ab9bb6579a4fd1c57dffea8982bd96


Would create and upload 77ab9bb6579a4fd1c57dffea8982bd96 for ['/sequence-files/TSTFI50888747/', '/sequence-files/TSTFI62937837/']
processing /measurement-sets/TSTDS32005063/
found object /configuration-files/TSTFI93269197/ by c3efb1fcbf386f5e85100d965eb2efc7. in progress
/awards/HG012077/ {'component': 'mapping', '@id': '/awards/HG012077/'} differ
/labs/ali-mortazavi/ {'@id': '/labs/ali-mortazavi/', 'title': 'Ali Mortazavi, UCI'} differ
Would upload file contents c3efb1fcbf386f5e85100d965eb2efc7
found object /configuration-files/TSTFI78410911/ by 8dd07dfbb88549d261cfe20e18c49d43. in progress
/awards/HG012077/ {'component': 'mapping', '@id': '/awards/HG012077/'} differ
/labs/ali-mortazavi/ {'@id': '/labs/ali-mortazavi/', 'title': 'Ali Mortazavi, UCI'} differ
Would upload file contents 8dd07dfbb88549d261cfe20e18c49d43
processing /measurement-sets/TSTDS30294230/


Error http status: 404 for https://api.sandbox.igvf.org/md5:7d1df0bebee74103926e378ec5f0db41


Would create and upload 7d1df0bebee74103926e378ec5f0db41 for ['/sequence-files/TSTFI13739934/', '/sequence-files/TSTFI64867362/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:99e1100be873c64a5d97cd74cc171890


Would create and upload 99e1100be873c64a5d97cd74cc171890 for ['/sequence-files/TSTFI14413084/', '/sequence-files/TSTFI81570304/']
processing /measurement-sets/TSTDS95237342/
found object /configuration-files/TSTFI72232735/ by 2cbdb926ccc80cb5231ceeafe251f164. in progress
/awards/HG012077/ {'component': 'mapping', '@id': '/awards/HG012077/'} differ
/labs/ali-mortazavi/ {'@id': '/labs/ali-mortazavi/', 'title': 'Ali Mortazavi, UCI'} differ
Would upload file contents 2cbdb926ccc80cb5231ceeafe251f164
found object /sequence-files/TSTFI93762898/ by d9001d493d57f2227863996fce6319d4. in progress
/awards/HG012077/ {'component': 'mapping', '@id': '/awards/HG012077/'} differ
/labs/ali-mortazavi/ {'@id': '/labs/ali-mortazavi/', 'title': 'Ali Mortazavi, UCI'} differ
yaml fastq differ
seqspec reads differ
['/sequence-files/TSTFI85205262/', '/sequence-files/TSTFI60184104/'] None differ
Would upload file contents d9001d493d57f2227863996fce6319d4
processing /measurement-sets/TSTDS48877173/


Error http status: 404 for https://api.sandbox.igvf.org/md5:016373d235aca782f436f48a4d9562fe


Would create and upload 016373d235aca782f436f48a4d9562fe for ['/sequence-files/TSTFI31335076/', '/sequence-files/TSTFI43642657/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:1c123dd1ee5788454596f79ba463e388


Would create and upload 1c123dd1ee5788454596f79ba463e388 for ['/sequence-files/TSTFI67467743/', '/sequence-files/TSTFI65524681/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:75e0191557eb0e1e57ceef580e7d33ea


Would create and upload 75e0191557eb0e1e57ceef580e7d33ea for ['/sequence-files/TSTFI28262146/', '/sequence-files/TSTFI56488795/']
processing /measurement-sets/TSTDS69666634/


Error http status: 404 for https://api.sandbox.igvf.org/md5:322ab2a0ed638b6fb1d385d8cf1138b6


Would create and upload 322ab2a0ed638b6fb1d385d8cf1138b6 for ['/sequence-files/TSTFI67193680/', '/sequence-files/TSTFI02934563/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:4c5b74a2e2564f3d9883bb206d4f7a0a


Would create and upload 4c5b74a2e2564f3d9883bb206d4f7a0a for ['/sequence-files/TSTFI85755043/', '/sequence-files/TSTFI64550424/']
processing /measurement-sets/TSTDS95760802/


Error http status: 404 for https://api.sandbox.igvf.org/md5:25b76ac9e67b8ff459c68fd935fab19f


Would create and upload 25b76ac9e67b8ff459c68fd935fab19f for ['/sequence-files/TSTFI12304250/', '/sequence-files/TSTFI73994529/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:94879770ab818902c10db9a02cefcd90


Would create and upload 94879770ab818902c10db9a02cefcd90 for ['/sequence-files/TSTFI97464264/', '/sequence-files/TSTFI24604241/']
processing /measurement-sets/TSTDS07432728/


Error http status: 404 for https://api.sandbox.igvf.org/md5:b3c623e995449f63027d14ec27f8bfea


Would create and upload b3c623e995449f63027d14ec27f8bfea for ['/sequence-files/TSTFI02165763/', '/sequence-files/TSTFI20418101/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:ac13dcb366e0b4d2b31ac1bc321bb4d5


Would create and upload ac13dcb366e0b4d2b31ac1bc321bb4d5 for ['/sequence-files/TSTFI91347763/', '/sequence-files/TSTFI55411036/']
processing /measurement-sets/TSTDS34582101/


Error http status: 404 for https://api.sandbox.igvf.org/md5:fb93c450b4f9546fde70f55c52ce9011


Would create and upload fb93c450b4f9546fde70f55c52ce9011 for ['/sequence-files/TSTFI61612395/', '/sequence-files/TSTFI25832476/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:d20e418fa9daf45635898d9e34206936


Would create and upload d20e418fa9daf45635898d9e34206936 for ['/sequence-files/TSTFI85921201/', '/sequence-files/TSTFI76281026/']
processing /measurement-sets/TSTDS72923185/


Error http status: 404 for https://api.sandbox.igvf.org/md5:40694f05d6019b9951e09fddf127dd55


Would create and upload 40694f05d6019b9951e09fddf127dd55 for ['/sequence-files/TSTFI29302178/', '/sequence-files/TSTFI86035976/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:0b82abe99eadc299db87027cbe9e6513


Would create and upload 0b82abe99eadc299db87027cbe9e6513 for ['/sequence-files/TSTFI13094701/', '/sequence-files/TSTFI31391738/']
processing /measurement-sets/TSTDS84503921/


Error http status: 404 for https://api.sandbox.igvf.org/md5:293db154dbdfef08cf5514877dcd7893


Would create and upload 293db154dbdfef08cf5514877dcd7893 for ['/sequence-files/TSTFI79419353/', '/sequence-files/TSTFI77049227/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:ee3e30844f36cc0b691e22cae88d1a3f


Would create and upload ee3e30844f36cc0b691e22cae88d1a3f for ['/sequence-files/TSTFI87250677/', '/sequence-files/TSTFI74696208/']
processing /measurement-sets/TSTDS12663199/


Error http status: 404 for https://api.sandbox.igvf.org/md5:c5f9f70b7e1fb0dfcfb705fde9da03e6


[1] 36090 is greater than the maximum of 2048
[2] 35871 is greater than the maximum of 2048
Would create and upload c5f9f70b7e1fb0dfcfb705fde9da03e6 for ['/sequence-files/TSTFI39351339/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:2160d20322e28ade237fc74169134d5c


[1] 68623 is greater than the maximum of 2048
[2] 68404 is greater than the maximum of 2048
Would create and upload 2160d20322e28ade237fc74169134d5c for ['/sequence-files/TSTFI83879286/']
processing /measurement-sets/TSTDS90515305/


Error http status: 404 for https://api.sandbox.igvf.org/md5:5a1241bb33ee5b5b6422ba569536448c


Would create and upload 5a1241bb33ee5b5b6422ba569536448c for ['/sequence-files/TSTFI50727311/', '/sequence-files/TSTFI51428917/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:0f897e41e4b5f8b6012fd75b59688bf5


Would create and upload 0f897e41e4b5f8b6012fd75b59688bf5 for ['/sequence-files/TSTFI77839459/', '/sequence-files/TSTFI05454359/']
processing /measurement-sets/TSTDS70002954/


Error http status: 404 for https://api.sandbox.igvf.org/md5:b760b935e4d7f4973e9e2cb5c9876208


Would create and upload b760b935e4d7f4973e9e2cb5c9876208 for ['/sequence-files/TSTFI50590157/', '/sequence-files/TSTFI86167097/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:8b7d0fe07321a42dbf8485451d329854


Would create and upload 8b7d0fe07321a42dbf8485451d329854 for ['/sequence-files/TSTFI95797070/', '/sequence-files/TSTFI88258911/']
processing /measurement-sets/TSTDS06772346/


Error http status: 404 for https://api.sandbox.igvf.org/md5:95592278bb9ba44a27ebdc103b7f8fc4


Would create and upload 95592278bb9ba44a27ebdc103b7f8fc4 for ['/sequence-files/TSTFI98557046/', '/sequence-files/TSTFI65221755/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:0bcf50d43f6b1021f3e4a873cd66b7a9


Would create and upload 0bcf50d43f6b1021f3e4a873cd66b7a9 for ['/sequence-files/TSTFI36328700/', '/sequence-files/TSTFI97039425/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:c476b16187948016697f1af3a3809fdd


Would create and upload c476b16187948016697f1af3a3809fdd for ['/sequence-files/TSTFI22707531/', '/sequence-files/TSTFI80176314/']
processing /measurement-sets/TSTDS02882566/


Error http status: 404 for https://api.sandbox.igvf.org/md5:6569a02de9a22af9e25365c7fb8867ec


Would create and upload 6569a02de9a22af9e25365c7fb8867ec for ['/sequence-files/TSTFI08350418/', '/sequence-files/TSTFI98735915/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:c8c2279b8c1c1521dba0956653f91aae


Would create and upload c8c2279b8c1c1521dba0956653f91aae for ['/sequence-files/TSTFI97424054/', '/sequence-files/TSTFI66828776/']
processing /measurement-sets/TSTDS74497326/


Error http status: 404 for https://api.sandbox.igvf.org/md5:ff5b82f45c1529cd6ef4b915e5c38fbe


Would create and upload ff5b82f45c1529cd6ef4b915e5c38fbe for ['/sequence-files/TSTFI98410226/', '/sequence-files/TSTFI19155564/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:4600d21c2e151c9dec94c214138a2a6a


Would create and upload 4600d21c2e151c9dec94c214138a2a6a for ['/sequence-files/TSTFI71787912/', '/sequence-files/TSTFI20299554/']
processing /measurement-sets/TSTDS10802686/


Error http status: 404 for https://api.sandbox.igvf.org/md5:8fb152e8661bb52cfa1ac6cd775bd482


Would create and upload 8fb152e8661bb52cfa1ac6cd775bd482 for ['/sequence-files/TSTFI82139201/', '/sequence-files/TSTFI74890649/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:5317fee00618903e857d277295aeff17


Would create and upload 5317fee00618903e857d277295aeff17 for ['/sequence-files/TSTFI18564333/', '/sequence-files/TSTFI81418517/']
processing /measurement-sets/TSTDS48294221/


Error http status: 404 for https://api.sandbox.igvf.org/md5:e15c551f6a952eb83a68e76ebe5f1046


Would create and upload e15c551f6a952eb83a68e76ebe5f1046 for ['/sequence-files/TSTFI64736825/', '/sequence-files/TSTFI23342133/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:e514a442d79e24097fffc1ec01345f49


Would create and upload e514a442d79e24097fffc1ec01345f49 for ['/sequence-files/TSTFI24016690/', '/sequence-files/TSTFI50184298/']
processing /measurement-sets/TSTDS09179588/


Error http status: 404 for https://api.sandbox.igvf.org/md5:d6d391515e22fbe6b916c37eb4a15f21


Would create and upload d6d391515e22fbe6b916c37eb4a15f21 for ['/sequence-files/TSTFI26976387/', '/sequence-files/TSTFI99353688/']


Error http status: 404 for https://api.sandbox.igvf.org/md5:f87ddeda99aa978666dee6fc2fafa7f0


Would create and upload f87ddeda99aa978666dee6fc2fafa7f0 for ['/sequence-files/TSTFI94782122/', '/sequence-files/TSTFI47948343/']


Convert the results of creating these seqspecs into a pandas data frame to make it easier to save a table.

In [32]:
pandas.DataFrame(submitted_log)[["accession", "uuid", "file_set", "content_type", "file_format", "md5sum", "award", "lab"]]


Unnamed: 0,accession,uuid,file_set,content_type,file_format,md5sum,award,lab
0,would upload,would upload,/measurement-sets/TSTDS25106370/,seqspec,yaml,e25906467a62eed63095b9922460cf56,/awards/HG012077/,/labs/ali-mortazavi/
1,would upload,would upload,/measurement-sets/TSTDS25106370/,seqspec,yaml,77ab9bb6579a4fd1c57dffea8982bd96,/awards/HG012077/,/labs/ali-mortazavi/
2,TSTFI93269197,8b1b68be-a34e-423d-b34c-96674598d289,/measurement-sets/TSTDS32005063/,seqspec,yaml,c3efb1fcbf386f5e85100d965eb2efc7,/awards/HG012077/,/labs/ali-mortazavi/
3,TSTFI78410911,f533c6b2-e94a-4e91-bbef-e6bc0ee914be,/measurement-sets/TSTDS32005063/,seqspec,yaml,8dd07dfbb88549d261cfe20e18c49d43,/awards/HG012077/,/labs/ali-mortazavi/
4,would upload,would upload,/measurement-sets/TSTDS30294230/,seqspec,yaml,7d1df0bebee74103926e378ec5f0db41,/awards/HG012077/,/labs/ali-mortazavi/
5,would upload,would upload,/measurement-sets/TSTDS30294230/,seqspec,yaml,99e1100be873c64a5d97cd74cc171890,/awards/HG012077/,/labs/ali-mortazavi/
6,TSTFI72232735,26869245-c662-43b1-97b8-fcd9d8717d0a,/measurement-sets/TSTDS95237342/,seqspec,yaml,2cbdb926ccc80cb5231ceeafe251f164,/awards/HG012077/,/labs/ali-mortazavi/
7,TSTFI93762898,9bbdc837-cc61-4edc-8ff1-10f3a67d7d6b,/measurement-sets/TSTDS95237342/,seqspec,yaml,d9001d493d57f2227863996fce6319d4,/awards/HG012077/,/labs/ali-mortazavi/
8,would upload,would upload,/measurement-sets/TSTDS48877173/,seqspec,yaml,016373d235aca782f436f48a4d9562fe,/awards/HG012077/,/labs/ali-mortazavi/
9,would upload,would upload,/measurement-sets/TSTDS48877173/,seqspec,yaml,1c123dd1ee5788454596f79ba463e388,/awards/HG012077/,/labs/ali-mortazavi/
