A pipeline for analyzing non-synonymous mutations within the SARS-CoV-2 genome (i.e., hCov-19 or the virus causing COVID-19) relative to global location and temporal periods since the initial emergence in late 2019.
Uses sequence metadata as submitted to the GISAID EpiCoV(tm) project.
Sequence acknowledgements and additional information can be found at https://github.com/gcalab/files/tree/master/F1000Research.
The default data directory is ./data, local to the project.
To use a different data directory, specify the data directory in .env file.
See .env_template for an example. This file may be copied and renamed to .env.
Specify the GISAID sequence metadata version as the variable METADATA_VERSION) in.env.
The environment variable METADATA_VERSION is typically the GISAID download date,
but may be any string that matches the metadata file
The instructions assume the metadata_tsv_{METADATA_VERSION}.tar.xz archive from GISAID is uncompressed.
This should be placed ./{DATA_DIR}/external.
If the archive has already been decompressed, rename the resulting metadata.tsv file
to metadata_{METADATA_VERSION}.tsv and place in the ./{DATA_DIR}/raw directory
Create a virtualenv using your favorite python version manager.
Then,
pip install -r requirements.pywithin the main project directory,
pip install -e .Then make the covmut_seasonal/main.py executable
chmod +x ./covmut_seasonal/main.py`NOTE:
Tar and LZMA (*.xz) must be installed on the system in order to extract the file.
CoVMutSeasonal
├─ LICENSE MIT License
├─ covmut_seasonal Source Code Directory (i.e., ./src)
│ ├─ __init__.py
│ ├─ extract.py Extract from Source Data (`./{DATA_DIR}/external`)
│ ├─ settings.py General Project Code Settings
│ ├─ transform.py Dataset Transformations
│ └─ visualizations.py Figures, Tables, etc.
│
├─ data Optional (see "Setup")
│ ├─ external Source Data
│ ├─ interim Intermediate "Cache" (e.g., Preprocessed Data)
│ ├─ processed "Production Ready" Data for Analysis
│ └─ raw Raw Data "Working" Data
│
├─ notebooks Jupyter Notebooks
├─ reports Analyses and Documents
│ └─ figures Visualizations
│
├─ .env Environment Variables ("Dot-Env" )
├─ requirements.txt Required Python Packages
└─ setup.py Python Setup file
Adapted from "Data Science 'cookiecutter' Template".
-
Set
run_latest.shas an executable.chmod +x ./run_latest.sh
-
Run.
./run_latest.sh
If the metadata file is already decompressed and in {DATA_DIR}/raw/ comment out the first line.
Note: Refer to help documentation for an overview of optional arguments.
./covmut_seasonal/main.py --help-
Manually download
metadata_tsv_{year}_{month}_{day}.tar.xzfrom GISAID; store in./{DATA_DIR}/external -
Extract
metadata.tsvfile to./data/rawdirectory../covmut_seasonal/main.py extract metadata
This step can be skipped if the extracted file already exists as
./{DATA_DIR}/raw/metadata_{METADATA_VERSION}.tsv -
Transform Country Latitudes into climate zones.
./covmut_seasonal/main.py transform geoclimate
-
Transform extracted metadata into usable format.
./covmut_seasonal/main.py transform metadata
-
Get Substitution Counts
./covmut_seasonal/main.py count substitutions
-
Get Sequence Counts
./covmut_seasonal/main.py count sequences
-
Extract AA Substitutions as Sequence Features creating feature maps
./covmut_seasonal/main.py extract features
-
Count AA Substitutions per geo-temporal interval
./covmut_seasonal/main.py count geotemporal
-
Visualize geo-temporal Incidence of AA Substitutions
./covmut_seasonal/main.py visualize
Data files are written to the directory ./{DATA_DIR}/processed/{METADATA_VERSION}.
Figures are written to the directory ./reports/figures/{METADATA_VERSION}.
| Original Field | Expanded Fields | New Fieldname | DataType | Purpose |
|-------------------|-------------------|-------------------|-------------|-----------------|
| Accession ID | | accession_id | str | Index |
| Virus name | | sample_name | str | Validation |
| | Virus Code | sample_virus | | Extraneous Info |
| | Location Code | sample_location | | Extraneous Info |
| | Sample Code | sample_code | | Extraneous Info |
| | Year Code | sample_year | | Extraneous Info |
| AA Substitution | | aa_substitutions | set(str, …) | Required |
| Collection date | | collection_date | str | Validation |
| | Collection Period | collection_period | int | Required |
| | Collection Year | collection_year | int | Validation |
| | Collection Month | collection_month | int | Validation |
| | Collection Day | collection_day | int | Validation |
| Submission date | | submission_date | str | Validation |
| | Submission Period | submission_period | int | Required |
| | Submission Year | submission_year | int | Validation |
| | Submission Month | submission_month | int | Validation |
| | Submission Day | submission_day | int | Validation |
| Location | | location | str | Validation |
| | Continent | continent | str | Required |
| | Country | country | str | Required |
| | Region | region | str | Required |
| | Sub-Region | subregion | str | Required |
| Host | | host | str | Filter; Removed |
| Is complete? | | is_complete | bool | Filter; Removed |
| Is high coverage? | | is_high_coverage | bool | Filter; Removed |
| Clade | | clade | str | Extraneous Info |
| Variant | | variant | str | Extraneous Info |