# Thousands of seqspecs

- github https://github.com/detrout/igvf-seq-spec-demo/
- Presentation https://woldlab.caltech.edu/~diane/igvf-seq-spec-demo/submitting-seqspecs.slides.html#/
- Notebook https://woldlab.caltech.edu/~diane/igvf-seq-spec-demo/submitting-seqspecs.html#

# Outline

This is a simplification of my figuring out how to submit seqspec files to the IGVF DACC.

- [Python environment setup](#Setup)
- [Jinja Templating](#Jinja)
- [Seqspec template function](#Template)
- [Find datasets missing seqspecs](#Find-datasets-missing-seqspecs)
- [Generating a seqspec](#Generating-a-seqspec)
- [Seqspec submission functions](#Seqspec-Submission-functions)
- [Create seqspec objects for remaining fastqs](#Create-seqspec-objects-for-remaining-fastqs)

# Why seqspec?

In ENCODE there were some experiments where the details of the barcode structure were never provided so when the person who did the experiment moved on there was no way to reprocess those samples.

And thus seqspec

However no one wants to generate thousands of seqspec, so I started working on automating seqspec generation.

Seqspec needs a fairly significant amount of information which can come from a local LIMS or spreadsheets, to make a more general example, I decided to try to generate seqspec files from information posted on the portal.

For me, this required that my measurements set have the following properties.

- protocols
- platform id
- sequencing kit information
- submitted fastqs

With that information,

- a templating engine
- the IGVF portal API
- and a few hard coded variables

I was able to create valid seqspecs files and attach them to the measurement set and the fastqs associated with a single sequencing run.

One measurement run will have several fastqs attached to it organized by illumina read type and sequencing run.  The number of different read types will vary depending on the assay.

<table>
    <thead>
        <tr><td>sequencing_run</td><td><b>R1</b></td><td><b>R2</b></td><td><b>I1</b></td><td><b>[R3/I2]...</b></td></tr>
    </thead>
    <tbody>
        <tr>
            <td><b>1</b>
            <td>run1/Sublibrary_7_S7_L001_R1_001.fastq.gz</td>
            <td>run1/Sublibrary_7_S7_L001_R2_001.fastq.gz</td>
            <td>run1/Sublibrary_7_S7_L001_I1_001.fastq.gz</td>
            <td>...</td>
        </tr>
        <tr>
            <td><b>2</b>
            <td>run1/Sublibrary_7_S7_L002_R1_001.fastq.gz</td>
            <td>run1/Sublibrary_7_S7_L002_R2_001.fastq.gz</td>
            <td>run1/Sublibrary_7_S7_L002_I1_001.fastq.gz</td>
            <td>...</td>
        </tr>
        <tr>
            <td><b>3</b>
            <td>run2/Sublibrary_7_S7_L002_R1_001.fastq.gz</td>
            <td>run2/Sublibrary_7_S7_L002_R2_001.fastq.gz</td>
            <td>run2/Sublibrary_7_S7_L002_I1_001.fastq.gz</td>
            <td>...</td>
        </tr>
    </tbody>
</table>

you'll need a seqspec for each set of files that should be treated as a single set.

The rest of this notebook is about grouping those runs and generating a seqspec for each one.

# Setup

We will need to import a variety of standard python components, boto3, and jinja

Navigate down for the detailed code blocks

In [None]:
import enum
import gzip
import hashlib
from io import StringIO, BytesIO
import json
from jsonschema import Draft4Validator
import logging
from matplotlib.pyplot import show
import numpy
import os
from pathlib import Path
import pandas
import requests
import sys
from urllib.parse import urlparse, urljoin
import yaml

In [None]:
try:
    import boto3
except ImportError:
    !{sys.executable} -m pip install --user boto3
    import boto3
    
from botocore.exceptions import ClientError
    

In [None]:
try:
    from jinja2 import Environment
except ImportError:
    !{sys.executable} -m pip install --user jinja2
    from jinja2 import Environment

from jinja2 import FileSystemLoader, select_autoescape, Undefined, StrictUndefined, make_logging_undefined

logger = logging.getLogger(__name__)
LoggingUndefined = make_logging_undefined(
    logger=logger,
    base=Undefined
)

env = Environment(
    loader=FileSystemLoader("templates"),
    autoescape=select_autoescape(),
    undefined=LoggingUndefined,
)

# Import seqspec validator

To make sure our seqspecs are valid before submitting, I imported the parts seqspec necessary for running them from code instead of using the command line interface.

I have the repository checked out into ~/proj/seqspec. This block should either import it for me, or install it if someone elese runs it.

See below for details

In [None]:
try:
    import seqspec
except ImportError:
    # change ~/proj/seqspec to whever you have seqspec cloned to.
    seqspec_root = Path("~/proj/seqspec").expanduser()
    if seqspec_root.exists() and str(seqspec_root) not in sys.path:
        sys.path.append(str(seqspec_root))
    else:
        # Once seqspec is updated
        #!{sys.executable} -m pip install --user seqspec
        # Currently the IGVF pipelines need a development branch
        # On linux systems with newer versions of python you might need --break-system-packages or to 
        # use a virtualenv or conda See https://peps.python.org/pep-0668/ for discussion 
        !{sys.executable} -m python3 -m pip install --user \
            git+https://github.com/pachterlab/seqspec.git#c9520b49232ec9a488a32ac9aba4099878241fad
    import seqspec

Import pieces of seqspec that we need for this notebook.

In [None]:
from seqspec.Assay import Assay
from seqspec.Region import Region
from seqspec.Region import Onlist
from seqspec.utils import load_spec_stream
from seqspec.seqspec_index import run_index, get_index
from seqspec.seqspec_print import print_library_ascii, print_seqspec_png
from seqspec.seqspec_onlist import run_onlist_region, run_onlist_read

In [None]:
from seqspecgen.igvf import (
    get_barcode_info,
)
from seqspecgen.protocols import (
    filter_protocols_for_lookups, 
    get_library_kit_from_protocols, 
    get_template_from_protocols,
    ProtocolsIO,
)
from seqspecgen.util import seqspec_validate, generate_seqspec_tool_index

## Utilities for seqspec validation

Some helper functions because the seqspec library wasn't really written for this use case.

# Import portal API

I have my own API <a href="https://github.com/detrout/encoded_client">encoded_client</a> for interacting with the IGVF database server (which is very much like the old ENCODE database server) it is similar in purpose to <a href="https://igvf-utils.readthedocs.io/">igvf-utils</a>.

See below for details

In [None]:
try:
    from encoded_client import encoded
except ImportError:
    encoded_root = Path("~/proj/encoded_client").expanduser()
    if encoded_root.exists() and str(encoded_root) not in sys.path:
        sys.path.append(str(encoded_root))
    else:
        !{sys.executable} -m pip install --user encoded_client
        
    from encoded_client import encoded

from encoded_client.encoded import filter_aws_credentials
from encoded_client.submission import parse_s3_url

encoded_client will pull submitter credentials from either DCC_API_KEY and DCC_SECRET_KEY or from a .netrc file loaded from your home directory. (replacing the {DCC_API_KEY} and {DCC_SECRET_KEY} strings with your specific values.)

The format of a .netrc file is a plain text file with records of the format:

<pre>machine api.sandbox.igvf.org login {DCC_API_KEY} password {DCC_SECRET_KEY}</pre>

Or api.data.igvf.org

(it's also possible to list the fields on separate lines, but I think it's easier to read when they're on one line)

or after creating the server object call:

<pre>server.username = "{DCC_API_KEY}"
server.password = "{DCC_SECRET_KEY}"</pre>



# Variables

We need to specify which server we're using and our award and submitting lab ids.

In [None]:
server_name = "api.data.igvf.org"
#server_name = "api.sandbox.igvf.org"
award = "/awards/HG012077/"
lab = "/labs/ali-mortazavi/"

server = encoded.ENCODED(server_name)
igvf_validator = encoded.DCCValidator(server)

# check below for a dictionary to help cache protal objects

Simple object to cache object lookups

In [None]:
class CachedTerms:
    def __init__(self, server):
        self.server = server
        self._cache = {}
        
    def __getitem__(self, key):
        if key in self._cache:
            return self._cache[key]
        
        obj = self.server.get_json(key)
        if obj is not None:
            self._cache[key] = obj
            return obj
        
server_cache = CachedTerms(server)

# Jinja

[Jinjia](https://jinja.palletsprojects.com/) is a fairly popular templating language for python.

it supports conditionals, loops, variable substitutions, and even function calls. However for seqspec we just need to be able to substitute in variables we collect from elsewhere.

Here's an example of one of the template read blocks showing some of the variables that will get replaced.

<pre>
- !Read
  read_id: {{ R1_file_id }}.fastq.gz
  name: Read 1 fastq for {{ R1_file_id }}
  modaility: rna
  primer_id: truseq_read1
  min_len: {{ R1_min_length }}
  max_len: {{ R1_max_length }}
  strand: pos
</pre>

# Filter ProtocolsIO urls

The ProtocolsIO URLS are a bit long, so add a mapping to a short name.

# Template variables

First build up lists of barcodes onlists needed for this protocol the names will be passed to the template. The combinitorial barcoding schemes like parse or shareseq need to specify several barcode files.

In [None]:
class BarcodesFromProtocols:
    def __init__(self, server):
        self.onlist_n96_v4 = {}
        self.onlist_n96_v4.update(get_barcode_info(server, "barcode_1", "IGVFFI0924TKJO"))
        self.onlist_n96_v4.update(get_barcode_info(server, "barcode_2", "IGVFFI1138MCVX"))
        self.onlist_n96_v4.update(get_barcode_info(server, "barcode_3", "IGVFFI1138MCVX"))
        self.onlist_n192_v4 = {}
        self.onlist_n192_v4.update(get_barcode_info(server, "barcode_1", "IGVFFI2591OFQO"))
        self.onlist_n192_v4.update(get_barcode_info(server, "barcode_2", "IGVFFI1138MCVX"))
        self.onlist_n192_v4.update(get_barcode_info(server, "barcode_3", "IGVFFI1138MCVX"))

    def __call__(self, protocols):
        if ProtocolsIO.splitseq_100k_v2 in protocols:
            return self.onlist_n96_v4
        elif ProtocolsIO.splitseq_1M_v2 in protocols:
            return self.onlist_n192_v4
        else:
            raise ValueError("Unrecognized barcode protocol")
        
get_barcodes_from_protocols = BarcodesFromProtocols(server)

get_barcodes_from_protocols([ProtocolsIO.splitseq_1M_v2])

Information about library kits, organized by protocol and for us single versus dual illumina index.

get_library_kit_from_protocols(protocols): -> returns dictionary of library_protocol, library_kit, and sequence_kit settings

Using protocol to select which seqspec template to use

get_template_from_protocols(protocols) -> returns string of name of template file

Search for measurement sets from our lab and that are missing seqspecs.

In [None]:
# measurement sets missing seqspecs
query = f"/search/?type=MeasurementSet&lab.title=Ali+Mortazavi%2C+UCI"\
         "&audit.NOT_COMPLIANT.category=missing+sequence+specification"
# all seqspecs
query = f"/search/?type=MeasurementSet&lab.title=Ali+Mortazavi%2C+UCI&preferred_assay_title=Parse+SPLiT-seq"

graph = server.get_json(query, limit=2000)
to_process = [x["@id"] for x in graph["@graph"]]
# some test server datasets were set up in ways incompatble with this notebook.
if server.server == "api.sandbox.igvf.org":
    problems = set(["/measurement-sets/TSTDS36584014/", "/measurement-sets/TSTDS51545328/", "/measurement-sets/TSTDS95340953/", "/measurement-sets/TSTDS43282126/", "/measurement-sets/TSTDS76216718/"])
    to_process = [x for x in to_process if x not in problems]
print(len(to_process))
to_process

Just as an aside searches and end-points like https://api.sandbox.igvf.org/measurement-sets/ return a json-ld collection.

<pre>
{
  "@id": "/search/?type=Measuremen…le=Ali+Mortazavi%2C+UCI",
  "@graph": [ ... ],
  "@type": ["Search"],
  "total": 27
}
</pre>

For many searches the objects in the @graph may only be a subset of all the attributes, so I frequently do a search and then request the fully detailed object.

<code format="python">    for row in response["@graph"]:
       detail = server.get_json(row["@id"])
       ... do stuff
</code>

In [None]:
#measurement_test_id = "/measurement-sets/IGVFDS5205JRNX/"  # production B01_13A_illumina parse-wt-v2-single-index-libspec-3.yaml.j2
#measurement_test_id = "/measurement-sets/IGVFDS8381SGFM/"  # 014_13A parse-wt-v2-dual-index-libspec-3.yaml.j2
#measurement_test_id = "/measurement-sets/IGVFDS1479KDWW/"  # 015_67C parse-wt-mega-v2-dual-index-libspec-3.yaml.j2
#measurement_test_id = "/measurement-sets/IGVFDS4191KCRM/"  # 011_67H parse-wt-mega-v2-single-index-libspec-3.yaml.j2
measurement_test_id = "/measurement-sets/IGVFDS6433BHLG/"   # B01_003_25B uci-share-seq-rna.yaml.j2
#measurement_test_id = "/measurement-sets/IGVFDS5820QLWL/"   # B01_003_25B uci-share-seq-atac.yaml.j2

## Processing measurement sets

The measurement_set lists all the files attached to it in a single ["files"] list. However we need to process seqspecs by sequencing_run, so we'll need to find the fastqs and group them by sequencing run.

And as a feature I should implement if there's already seqspec configuration files attached to the measurement set should detect that and warn about it.

Build up a data structure of reads organized by sequencing run.

In [None]:
def get_sequence_files(measurement_set):
    paired_files = {}
    for file in measurement_set["files"]:        
        sequence = server.get_json(file["@id"])
        file_format = sequence["file_format"]
        if file_format in ("fastq",):
            sequence_run = sequence["sequencing_run"]
            illumina_read_type = sequence.get("illumina_read_type")
            paired_files.setdefault(sequence_run, {})[illumina_read_type] = sequence
        
    return paired_files

measurement = server.get_json(measurement_test_id)
protocols = measurement["protocols"]
run_files = get_sequence_files(measurement)
for run in sorted(run_files):
    for read in sorted(run_files[run]):
        print(run, read, run_files[run][read]["submitted_file_name"])

# Generating a seqspec

Introducing functions to build the seqspec from information in the notebook and what's been posted to the portal.

I made sure to include the sequencing platform, sequencing_kit, and protocols in my measurement set objects so I could reuse them when generating seqspecs

# Rendering logic

- First find protocol based variables and templates
- Then merge values from a dictionary indexed by illumina_read_type that represents one sequencing_run row generated get_sequence_files()

<pre>
{"R1": {"@id": ... },
 "R2": {"@id": ....}}
</pre>

- render the template to a string
- load the string into the seqspec library
- call update_spec to fix the joined regions
- validate the seqspec
- if validation passes return the seqspec text using to_YAML()

Broken into smaller functions for discussion.

We need to define the variables that will be passed to the Jinja templating engine.

In [None]:
def generate_illumina_seqspec_context(protocols, run_files):
    platform = run_files["R1"]["sequencing_platform"]
    platform_id = platform["@id"]
    platform = server.get_json(platform_id)
    platform_term_id = platform["term_id"]
    platform_term_name = platform["term_name"]
    sequence_protocol = f"{platform_term_name} ({platform_term_id})"
    context = {
        #"R1_accession": run_files["R1"]["accession"],
        #"R1_url": server.prepare_url(run_files["R1"]["href"]),
        #"R1_min_length": run_files["R1"]["minimum_read_length"],
        #"R1_max_length": run_files["R1"]["maximum_read_length"],

        #"R2_accession": run_files["R2"]["accession"],
        #"R2_url": server.prepare_url(run_files["R2"]["href"]),
        #"R2_min_length": run_files["R2"]["minimum_read_length"],
        #"R2_max_length": run_files["R2"]["maximum_read_length"],

        "sequence_kit": run_files["R1"]["sequencing_kit"],
        "sequence_protocol": sequence_protocol,
    }
    for key in ["R1", "R2", "R3", "I1", "i2"]:
        if key in run_files:
            row = run_files[key]
            context[f"{key}_file_id"] = row["accession"]
            context[f"{key}_file_name"] = Path(row["href"]).name
            context[f"{key}_file_type"] = row["file_format"]
            context[f"{key}_file_size"] = row["file_size"]
            context[f"{key}_url"] = server.prepare_url(row["href"])
            context[f"{key}_min_length"] = row["minimum_read_length"]
            context[f"{key}_max_length"] = row["maximum_read_length"]
            context[f"{key}_md5sum"] = row["md5sum"]

    #context.update(get_barcodes_from_protocols(protocols))
    context.update(get_library_kit_from_protocols(protocols))
    
    return context

In [None]:
generate_illumina_seqspec_context(protocols, run_files[1])

In [None]:
def generate_nanopore_seqspec_context(protocols, run_files):
    platform = run_files[None]["sequencing_platform"]    
    platform_id = platform["@id"]
    context = {
        #"read1_accession": run_files[None]["accession"],
        #"read1_url": server.prepare_url(run_files[None]["href"]),
        #"read1_min_length": run_files[None]["minimum_read_length"],
        #"read1_max_length": run_files[None]["maximum_read_length"],

        "sequence_kit": run_files[None]["sequencing_kit"],
        "sequence_protocol": server_cache[platform_id]["term_name"],
    }

    for key in ["R1", "R2", "R3", "I1", "i2"]:
        if key in run_files:
            context[f"{key}_file_id"] = run_files[key]["accession"]
            context[f"{key}_file_name"] = Path(row["href"]).name
            context[f"{key}_file_type"] = row["file_format"]
            context[f"{key}_file_size"] = row["file_size"]
            context[f"{key}_url"] = server.prepare_url(run_files[key]["href"])
            context[f"{key}_min_length"] = run_files[key]["minimum_read_length"]
            context[f"{key}_max_length"] = run_files[key]["maximum_read_length"]
            context[f"{key}_md5sum"] = row["md5sum"]
           
    context.update(get_barcodes_from_protocols(protocols))
    context.update(get_library_kit_from_protocols(protocols))
    
    return context

In [None]:
def generate_seqspec_context(protocols, run_files):
    if ProtocolsIO.splitseq_single_index in protocols:
        return generate_illumina_seqspec_context(protocols, run_files)
    elif ProtocolsIO.splitseq_dual_index in protocols:
        return generate_illumina_seqspec_context(protocols, run_files)
    elif ProtocolsIO.ont_library_prep in protocols:
        return generate_nanopore_seqspec_context(protocols, run_files)
    else:
        raise ValueError("Unknown seqspec context protocol {}".format(protocols))

With a filled in context dictionary and we can lookup our template, render it, and test that it's valid.

In [None]:
def generate_seqspec_for_run(protocols, run_files, verbose=False):
    protocol = filter_protocols_for_lookups(protocols)
    template_name = get_template_from_protocols(protocols)
    print(template_name)

    context = generate_seqspec_context(protocol, run_files)
    template = env.get_template(template_name)
    example_yaml = template.render(context)
    
    # validate the generated seqspec file.
    example_spec = load_spec_stream(StringIO(example_yaml))
    example_spec.update_spec()
    seqspec_validate(example_spec.to_dict())
    
    if verbose:
        # lets print the settings
        #print_library_ascii, print_seqspec_png
        print("tree")
        print(print_library_ascii(example_spec))
        print("tool settings:", generate_seqspec_tool_index(example_spec, run_files))
        onlist_files = run_onlist_read(example_spec, "rna", run_files["R2"]["accession"])
        print("onlist", [onlist.filename for onlist in onlist_files])
        show(print_seqspec_png(example_spec))
        print()

    return example_spec.to_YAML()

And an example of generate_seqspec_for_run being called

In [None]:
print(generate_seqspec_for_run(protocols, run_files[1], verbose=True))

# Test seqspec

Check barcode locations for your tool.

```
seqspec index -m rna -r TSTFI09417350.fastq.gz,TSTFI18957175.fastq.gz -t kb TSTDS32005063.yaml
1,10,18,1,48,56,1,78,86:1,0,10:0,0,140
```
check generated barcode list is correct.

```
seqspec onlist -m rna -f multi -r barcode -s region-type TSTDS32005063.yaml
/woldlab/loxcyc/home/diane/proj/igvf-seq-spec-demo/onlist_joined.txt
diane@trog:~/proj/seqspec$ head onlist_joined.txt 
AACGTGAT AACGTGAT CATTCCTA
AAACATCG AAACATCG CTTCATCA
ATGCCTAA ATGCCTAA CCTATATC
```


# Seqspec Submission functions

Post a seqspec object to the portal and capture the upload credentials from the response

In [None]:
def create_seqspec_metadata_object(seqspec_metadata):
    """Post a seq spec metadata object to the portal
    """
    response = server.post_json("configuration_file", seqspec_metadata)
    if response["status"] == "success":
        graph = response["@graph"]
        if len(graph) != 1:
            print("Strange number of result objects {}".format(len(graph)))        

        print("Upload of {} succeeded". format(graph[0]["@id"]))
        seqspec_metadata.update({
            "@id": graph[0]["@id"],
            "accession": graph[0]["accession"],
            "uuid": graph[0]["uuid"],
        })
    else:
        print(filter_aws_credentials(response))
        raise RuntimeError("Unable to create metadata object")
        
    return graph[0].get("upload_credentials")

Refactor credential refresh logic for shorter upload function

In [None]:
def refresh_credentials(credentials, seqspec_metadata):
    if credentials is None:
        print("retreving new credentials")
        response = server.post_json("{}/@@upload".format(seqspec_metadata["@id"]), {})
        print(filter_aws_credentials(response))
        if response["status"] == "success":
            assert len(response["@graph"]) == 1, "upload_seqspec_file Unexpected graph length {}".format(len(response["@graph"]))
            graph = response["@graph"]
            credentials = graph[0]["upload_credentials"]
            return credentials
        else:
            raise ValueError("Unable to get credentials for {}".format(seqspec_metadata["@id"]))
    else:
        return credentials

Using the credentials, the seqspec_metadata, and the seqspec text, post the seqspec to the portal. 

In [None]:
def upload_seqspec_file(credentials, seqspec_metadata, seqspec_contents):        
    """upload the seqspec contents as a file to s3"""
    if not isinstance(seqspec_contents, bytes):
        raise ValueError("seqspec_contents needs to be a gzipped byte array")

    credentials = refresh_credentials(credentials, seqspec_metadata)

    s3_client = boto3.client(
        's3', 
        aws_access_key_id=credentials["access_key"], 
        aws_secret_access_key=credentials["secret_key"], 
        aws_session_token=credentials["session_token"])

    bucket, target = parse_s3_url(credentials["upload_url"])
    s3_client.upload_fileobj(
        BytesIO(seqspec_contents),
        bucket,
        target)

A really irritating thing about gzip is that includes the current time in the compression header, which means by default each new compression run will get a new md5sum.

But I want to use the md5sum to see if I've already posted the file...

The python gzip utilities let you set the time, so you can force a reproducable time, so this code sets the gzip creation time to 1970 Jan 1 00:00 UTC.  (UNIX time 0).

`gzip.compress(buffer, mtime=0)`

Or for the gzip shell command

`gzip -n/--no-name filename`

Construct seqspec object, see if it's already posted (by md5sum), and if not create the object and post the seqspec.

- build list of file_ids this seqspec is for
- compress file
- calculate md5 of gzipped file
- create seqspec_object
- get_or_update_seqspec
  - search for seqspec object, if available do the following.
  - upload gzipped seqspec contents if its missing
  - update seqspec_object with @id, accession, uuid
- if md5 was not found, 
  - create the metadata object
  - upload compressed contents

In [None]:
def get_seqspec_of(sequencing_run_files):
    # Once the seqspec configuration file and metadata has been created and uploaded
    # attach the the configuration file to it's fastqs.
    seqspec_of = []
    for read in sequencing_run_files:
        # update the file record in case it changed
        seqspec_of.append(sequencing_run_files[read]["@id"])

    return seqspec_of

In [None]:
def create_seqspec_configuration_metadata(file_set, md5, seqspec_of):
    # Construct the configuration_file object for the DACC portal
    seqspec_metadata = {
        "award": award,
        "lab": lab,
        "md5sum": md5.hexdigest(),
        "file_format": "yaml",
        "file_set": file_set,
        "content_type": "seqspec",
        "seqspec_of": seqspec_of,
    }
    # Make sure the configuration_file passes the DACCs schema
    igvf_validator.validate(seqspec_metadata, "configuration_file")
    return seqspec_metadata

In [None]:
def get_or_update_seqspec(seqspec_metadata, seqspec_gzip, md5, dry_run=True):
    response = server.get_json("md5:{}".format(seqspec_metadata["md5sum"]))
    accession = response["accession"]
    uuid = response["uuid"]
    print("found object {} by {}. {}".format(
        response["@id"], seqspec_metadata["md5sum"], response["status"]))

    for k in seqspec_metadata:
        if seqspec_metadata[k] != response.get(k):
            print("{} {} differ".format(seqspec_metadata[k], response.get(k)))

    seqspec_metadata.update({
        "@id": response["@id"],
    })

    if response["status"] == "in progress":
        if dry_run:
            print("Would upload file contents {}".format(md5.hexdigest()))
        else:
            upload_seqspec_file(None, seqspec_metadata, seqspec_gzip)

    seqspec_metadata["accession"] = accession
    seqspec_metadata["uuid"] = uuid
    return seqspec_metadata

In [None]:
def create_and_upload_seqspec(seqspec_metadata, seqspec_data, md5, dry_run=True):
    if dry_run:
        print("Would create and upload {} for {}".format(
            md5.hexdigest(), seqspec_metadata["seqspec_of"]))
        for server_keys in ["@id", "accession", "uuid"]:
            seqspec_metadata[server_keys] = "would upload"
    else:
        credentials = create_seqspec_metadata_object(seqspec_metadata)
        upload_seqspec_file(credentials, seqspec_metadata, seqspec_data)  
    return seqspec_metadata

In [None]:
def register_seqspec(file_set, seqspec, sequencing_run_files, dry_run=True):
    """Create the seqspec objects and attach them to the fastqs
    
    Parameters:
    - file_set: id the seqspec should be attached to
    - seqspec a formatted seqspec file
    - set of file objects associated with one sequencing run
    - dry_run flag for if we should actually post data    
    """
    seqspec_of = get_seqspec_of(sequencing_run_files)
    # reproducibly compress seqspec text
    seqspec_gzip = gzip.compress(seqspec.encode("utf-8"), mtime=0)
    md5 = hashlib.md5(seqspec_gzip)
    seqspec_metadata = create_seqspec_configuration_metadata(
        file_set, md5, seqspec_of)
    # Search the portal for the md5sum of our seqspec file to see if 
    try:
        seqspec_metadata = get_or_update_seqspec(
            seqspec_metadata, seqspec_gzip, md5, dry_run=dry_run)
    except encoded.HTTPError as err:
        # If the file has not been submitted, and we're not in dry_run mode 
        # lets submit it
        if err.response.status_code == 404:
            seqspec_metadata = create_and_upload_seqspec(
                seqspec_metadata, seqspec_gzip, md5, dry_run=dry_run)
        else:
             print("Other HTTPError error {}".format(err.response.status_code))
    return seqspec_metadata

# Create seqspec objects for remaining fastqs

Now that we have a way to list all of the fastq sets, and a function to post everything to the portal, lets loop
loop through all of our measurement_sets and posting the seqspec configuration files for all the fastqs. (Change the to_process variable to be whatever you need updated.

In [None]:
known = set(to_process)

In [None]:
# note for the tests we limit this to just one measurement set
# instead of all pending measurement sets
#to_process = [measurement_test_id]

#to_process = """""".split()
#to_process = ["/measurement-sets/{}/".format(x) for x in to_process]
print(f"Remaining {len(known)}-{len(to_process)}={len(known)-len(to_process)}")
assert set(to_process).issubset(known)

seen_md5s = set()
generated_specs = {}
submitted_log = []
for measurement_id in to_process:
    print("processing", measurement_id)
    measurement = server.get_json(measurement_id)
    protocols = measurement["protocols"]
    #if ProtocolsIO.splitseq_dual_index in protocols:
    #    print(f"Skipping {measurement_id} don't trust dual index illumina seqspec yet.")
    #    continue
    run_files = get_sequence_files(measurement)
    
    for run_number in sorted(run_files):
        seqspec = generate_seqspec_for_run(protocols, run_files[run_number])
        generated_specs.setdefault(measurement_id, {}).setdefault(run_number, []).append(seqspec)
        current_md5 = hashlib.md5(seqspec.encode("utf-8")).hexdigest()
        assert current_md5 not in seen_md5s, "we generated the same seqspec file somehow"
        seen_md5s.add(current_md5)
        metadata = register_seqspec(measurement_id, seqspec, run_files[run_number], dry_run=True)
        metadata["template:skip"] = get_template_from_protocols(protocols)
        submitted_log.append(metadata)
    

In [None]:
%debug

Convert the results of creating these seqspecs into a pandas data frame to make it easier to save a table.

In [None]:
uploaded = pandas.DataFrame(submitted_log)[["accession", "uuid", "file_set", "content_type", "file_format", "md5sum", "template:skip", "award", "lab"]]
uploaded


In [None]:
set(uploaded["template:skip"])

In [None]:
uploaded.to_csv("/dev/shm/configuration_files.csv", index=False)