<a href="https://colab.research.google.com/github/ZooPhy/zoophy-utils/blob/master/notebooks/SARS_CoV_2_GISAID_to_ZooPhy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Format GISAID's FASTA files for SARS-CoV-2 for Bayesian discrete virus phylogeography in the ZooPhy pipeline.

This notebook converts [GISAID](https://www.gisaid.org/) FASTA files that contain viral sequences for SARS-CoV-2 (a.k.a. COVID19, novel coronavirus, nCoV 2019, hCoV-19) to the FASTA format accepted by the [ZooPhy](https://zodo.asu.edu/zoophy/) pipeline for Discrete Virus Phylogeography.  

You can use this python notebook in Google Colab (by clicking the *Colab* link above) or use it within your Jupyter instance.

## Requirements
Here is what you require as inputs to this notebook:

1. ```FASTA files``` (GISAID sequence files in FASTA format compressed into a .zip file)
  * To download GISAID FASTA files, head over to [gisaid.org](https://www.gisaid.org/) and sign in. Sign up if you haven't already.
  * Click on ```EpiCoV``` tab to explore the novel coronavirus sequences uploaded by other researchers.
  * Filter selected entries based on your study (e.g. *Location*, *Sequence Length*, *Collection Date*), then click on individual sequence files and click on ```Download FASTA```. There doesn't seem to be an option to download multiple files in a single action.
  * Select these desired files into a single *.zip* file

2. ```Acknowledgement Table``` for the submissions. This is a file called *gisaid_acknowledge_table.xls* that available for download below the table of search results. It contains details about the sequences uploaded to GISAID. The notebook extracts the *Collection Date* and *Geographic Location* from this table.



# Upload required files

Run the following cell (click the play button) to import necessary libraries required for running this notebook and simple utility functions. It should also produce a button for uploading the ```.zip``` file and the table ```.xls``` file. Select both the files in this step.

In [0]:
import os
import zipfile
import pandas as pd
from os import listdir, remove, removedirs, makedirs
from os.path import join, exists, isdir, isfile
from google.colab import files
import shutil

def show_files():
    print("Current Directory Contents:", [x for x in os.listdir() if x[0] not in ['.', '_']])

def clear_file_system():
    files_to_delete = [x for x in os.listdir() if x not in ['.', '_']]
    while(files_to_delete):
        files_to_delete = [x for x in os.listdir() if x not in ['.', '_']]
        for x in files_to_delete:
            if isfile(x):
                remove(x)
            else:
                shutil.rmtree(x, ignore_errors=True)

# remove unnecessary files and upload GISAID files
clear_file_system()
print("Upload the .zip file containing fasta files and .xls file containing the Acknowledgement table")
uploaded = files.upload()
if uploaded and len(uploaded)>=2:
    show_files()
    zipf = [x for x in uploaded if x.endswith('zip')][0]
    print("\n\nzip file selected: '"+zipf+"'" if zipf else "Please upload file with .zip extension")
    xlsf = [x for x in uploaded if x.endswith('xls')][0]
    print("xls file selected: '"+xlsf+"'" if xlsf else "Please upload file with .xls extension")
else:
    print("\n\nPlease select both .zip and .xls files.")

Running the following cell extracts the fasta files from the .zip file into a new directory and extracts metadata from the .xls file.

In [0]:
ZIP_DIR = 'fasta_files'
# empty fasta directory if exists
if exists(ZIP_DIR):
    shutil.rmtree(ZIP_DIR, ignore_errors=True)
# create directory
makedirs(ZIP_DIR)
# unzip the zipfile
if zipf and xlsf:
    try:
        with zipfile.ZipFile(zipf, 'r') as zip_ref:
            filelist = [x for x in zip_ref.namelist() if x.strip() and x.strip()[0] not in ["_", "."] and x.strip().lower().endswith('fasta')]
            print("Extracting from", zipf, filelist)
            for fl in filelist:
                print("Extracting", fl)
                zip_ref.extract(fl, path=ZIP_DIR)
                if "/" in fl:
                    sdir, gf = fl.split("/")
                    print("Moving", join(join(ZIP_DIR, sdir), gf))
                    os.system("mv "+join(join(ZIP_DIR, sdir), gf)+" "+ZIP_DIR)
                    remove_list.append(join(ZIP_DIR, sdir))
    except Exception as e:
        print("Exception", e, " while extracting file", zipf)
    try:
        xdf = pd.read_excel(xlsf, header=2, index_col=0)
        print("Loaded Acknowledgement Table with ", list(xdf.columns))
    except Exception as e:
        print("Exception", e, " while processing ", xlsf)
else:
    print("It seems as though the zip and xls files don't exist in Runtime. Please run previous cells again.")


# Process and download FASTA file

Write in FASTA format and download the file. 

In [0]:
OUTPUT_FILE = 'zoophy.fasta'
MONTH_DICT = {"01":"Jan","02":"Feb","03":"Mar","04":"Apr","05":"May","06":"Jun","07":"Jul","08":"Aug","09":"Sep","10":"Oct","11":"Nov","12":"Dec"}
count_written = 0
with open(OUTPUT_FILE, "w") as ofile:
    for fastafile in listdir(ZIP_DIR):
        try:
            defn_added = False
            for line in open(join(ZIP_DIR, fastafile)):
                if line[0] == ">":
                    name = line.split("|")[0].strip()[1:]
                    accession = line.split("|")[-1].strip()
                    row = xdf.loc[accession]
                    location = ", ".join(reversed([x.strip() for x in row['Location'].split("/")]))
                    if len(row['Collection date'].split("-")) < 3:
                        continue
                    year, month, day = row['Collection date'].split("-")
                    month = MONTH_DICT[month]
                    collection_date = "-".join([day, month, year])
                    defn_line = ">"+"|".join([accession, location, collection_date])
                    print(defn_line.strip(), file=ofile)
                    count_written += 1
                    defn_added = True
                else:
                    if defn_added:
                        print(line.strip(), file=ofile)
            print("", file=ofile)
        except Exception as e:
            print("Exception", e, " while processing fasta file", fastafile, " Hence, skipping.")

print("Formatted ", count_written, "/", len(listdir(ZIP_DIR)), ". Skipped others due to missing data or formatting errors.")
# !grep EPI zoophy.fasta

files.download(OUTPUT_FILE)