# Prepare GISAID accessions for batch download

This study uses over 20,000 HA sequences from GISAID. Although all accessions are provided in the tip attributes tables for the natural populations, downloading these sequences is still a non-trivial process. Users need to manually search for accessions through the GISAID web interface and download sequences in batches. Because the GISAID search field has a maximum length of 1000 characters, searching for all 20,000 accessions at once is not possible.

This notebook prepares a CSV file (importable into Excel) with batches of all 20,000+ accessions split into no more than 1000 characters per batch. Each batch is annotated by a numeric id and an expected number of sequences per batch, so the user can more easily track which sequences they have downloaded and that they have downloaded the correct number of sequences.

In [1]:
import pandas as pd

In [2]:
# Define the maximum length of the GISAID search field.
max_length = 1000

In [3]:
# Load tip attributes for both validation and test datasets.
tip_attributes_file = "../results/builds/natural/natural_sample_1_with_90_vpm_sliding/tip_attributes_with_weighted_distances.tsv"
test_tip_attributes_file = "../results/builds/natural/natural_sample_1_with_90_vpm_sliding_test_tree/tip_attributes_with_weighted_distances.tsv"

df = pd.read_table(tip_attributes_file)
test_df = pd.read_table(test_tip_attributes_file)

In [4]:
# Collect all distinct accessions across both datasets.
accessions = sorted(
    set.union(
        set(df["accession"].drop_duplicates().values),
        set(test_df["accession"].drop_duplicates().values)
    )
)

In [5]:
len(accessions)

20944

In [6]:
# Collect accessions into search strings that can be copied
# and pasted into the GISAID search field without exceeding
# the maximum field length.
batches = []
current_batch = []

for accession in accessions:
    # If adding the current accession to the current batch will not
    # exceed the maximum length allowed, append the accession.
    if len(" ".join(current_batch)) + len(accession) + 1 < max_length:
        current_batch.append(accession)
    else:
        # If we have exceeded the maximum length, store the current
        # batch and create a new one with the current accession.
        batches.append(current_batch)
        current_batch = [accession]

# Append the final batch.
if len(current_batch) > 0:
    batches.append(current_batch)

In [7]:
# Create strings for each list of accessions in a batch.
batch_strings = [
    " ".join(batch)
    for batch in batches
]

In [8]:
# Count the number of accessions per batch as a quality control check for the user.
sequences_per_batch = [
    len(batch)
    for batch in batches
]

In [9]:
batch_df = pd.DataFrame({
    "batch": list(range(len(batch_strings))),
    "sequences_in_batch": sequences_per_batch,
    "accessions": batch_strings,
})

In [10]:
batch_df.head()

Unnamed: 0,batch,sequences_in_batch,accessions
0,0,90,EPI1000654 EPI1000752 EPI1001376 EPI1001778 EP...
1,1,90,EPI1016405 EPI1016421 EPI1016437 EPI1016445 EP...
2,2,90,EPI1021970 EPI1021973 EPI1021998 EPI1022031 EP...
3,3,91,EPI1026974 EPI1026982 EPI1026990 EPI1026998 EP...
4,4,90,EPI1035183 EPI1035199 EPI1035207 EPI1035215 EP...


In [11]:
# Confirm that we got all of the accessions in this data frame.
assert batch_df["sequences_in_batch"].sum() == len(accessions)

In [12]:
# Save batches to a CSV file.
batch_df.to_csv(
    "../data/gisaid_batches.csv",
    index=False
)