---
kernelspec:
    name: python3
    display_name: python3
---

# Retrieving ASFV Genome Data

:::{admonition} Objective
Fetch all available ASFV genome assemblies from the NCBI database.
:::

```{code-cell} python
:tags: remove-input
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
```

## Querying the NCBI Assembly Database

[](#fetch-assemblies) retrieves all available genome assemblies from the assembly database of NCBI. Running the `efetch` command returns an XML file containing assembly metadata.

:::{code-block} bash
:label: fetch-assemblies
:caption: Retrieve ASFV assemblies from NCBI.
esearch -db assembly -query "African swine fever virus" \
    | efetch -format docsum > ./assets/flatfiles/assemblies.tsv
:::

## Parsing Metadata

[](#extract-fields) parses the XML file, extracts the listed fields, and saves the output to a csv file:

- AssemblyAccession
- AssemblyName
- AssemblyStatus
- AsmReleaseDate_GenBank


:::{code-block} bash
:label: extract-fields
:caption: Step 2: Extract fields from XML data.

cat ./assets/flatfiles/assemblies.tsv \
    | xtract -pattern DocumentSummary -sep "," -element AssemblyAccession,AssemblyName,AssemblyStatus,AsmReleaseDate_GenBank \
    | align-columns > ./assets/flatfiles/assemblies.csv
:::

We can view the first 10 rows by invoking the `head` command.

:::{code-cell} python
:label: view-assemblies

!head -n 10 ./assets/flatfiles/assemblies.csv
:::

## Filtering for Complete Assemblies

Only complete assemblies will be used to construct the pangenome graph.

```{code-block} bash
:label: filter-complete-genomes
:caption: Filter for accessions with complete genome assemblies.

OUT=./assets/flatfiles/complete_assemblies.csv
echo "accession,assembly_id,status,date_published" > ${OUT}
grep "Complete Genome" ./assets/flatfiles/assemblies.csv \
    | sed -e "s/complete genome,//g" >> ${OUT}
```

```{code-block} bash
:label: count-genomes
:caption: Count the number of assembled genome per year and visualize.

cat ./assets/flatfiles/assemblies.csv \
    | awk -F , '{print $4}' | awk -F / '{print $1}' \
    | sort -n -r | uniq -c | head -n -1 \
    | awk '{$1=$1;print}' | tr " " "," > ./assets/flatfiles/counts.csv
```

```{code-cell} python
:tags: remove-input
:label: genome-counts

genome_counts = pd.read_csv("./assets/flatfiles/counts.csv", header=None, names=["count", "year"]).sort_values(by="year", ascending=True)

fig, ax = plt.subplots(figsize=(10,6))
ax.bar(x=genome_counts["year"], height=genome_counts["count"])
ax.set_xlabel("Year")
ax.set_ylabel("Genome count")
ax.set_title("No. of Assembled ASFV Genomes Per Year") 

for rect in ax.patches:
    y_value = rect.get_height()
    x_value = rect.get_x() + rect.get_width() / 2
    space = 1
    label = "{:.0f}".format(y_value)
    ax.annotate(label, (x_value, y_value), xytext=(0, space), textcoords="offset points", ha='center', va='bottom')

plt.show()
```

ASFV assemblies can be aggregated by count across different years by running [](#count-genomes). As of writing there are a total of {eval}`genome_counts['count'].sum()` available ASFV genomes, 347 of which are complete assemblies.

## Downloading Genomes

Extract the first column of the parsed csv file to get a list of accessions. `datasets` will be used to request all assemblies from the NCBI server as seen in [](#download-all-genomes).

```{code-block} bash
:label: download-all-genomes
:caption: Request all assemblies from the NCBI server using `datasets`.

# Create an directory for genome assemblies
mkdir -p assets/genomes

# Extract all assembly accessions and remove header
tail -n +2 assets/flatfiles/complete_assembles.csv \
    | awk -F , '{ print $1 }' > assets/flatfiles/accessions.txt

# Download all assemblies from NCBI
datasets download genome accession --inputfile assets/flatfiles/accessions.txt \
    --include genome --filename assets/genomes.zip'
``` 

All downloaded genomes are stored in the zipped parent directory (`assets/genomes.zip`). This corresponds to about 20 Mb of FASTA files. All FASTA file contents are merged into a single, large FASTA file by running 

```{code-block} bash
:label: concat-all-fasta
:caption: Merge all FASTA files using `cat`.
# Unzip downloaded directory
unzip assets/genomes.zip -d assets/genomes

# Merge all FASTA files into a single file
cat assets/genomes/ncbi_datasets/data/**/*.fna > assets/genomes.fasta
```

As a sanity check, confirm that there are 347 header lines in `genomes.fasta`:

In [None]:
!cat assets/genomes.fasta | grep ">" | wc -l