# Download Files From GenBank

## Read file with biosample numbers

Note that in this project we had three different sequencing runs. The 2016 one that is contained in bioproject: PRJNA398660, while the samples for 2017 and 2019 are contained in bioproject: PRJNA699561 (but came from two different Illumina runs). We will process each of the runs separately and then merge the results.

The accession numbers of the BioSamples used in the study are in the file: `SraRunTable.txt` in the `maps` directory. 

The file contains the accession number of the SRA files, the BioSample accession number, the sample name that we will use in the analyses, and the group that the sample belongs to.

To download the sequences we are using NCBI's api, sra_tools. We need to have that installed: 

```
conda install -c bioconda sra-tools
```

This notebook will download all the files, rename them, gzip them, and create manifest files to be used in the Qiime2 analysis in the next notebook.

In [None]:
filein = open("maps/SraRunTable.txt",'r')

filein.readline()
group1 = {}
group2 = {}
group3 = {}

for line in filein:
    run, sample, name, group = line.strip().split()
    if group == 'data1':
        group1[name] = run
    elif group == 'data2':
        group2[name] = run
    elif group == 'data3':
        group3[name] = run

In [None]:
def download(accessions, location):
    import os
    #data
    for name in accessions:
        print("Downloading: {}".format(name))
        if not os.path.exists(location):
            s0 = os.system("mkdir {}".format(location))
            
        s1 = os.system("fastq-dump -O {folder} --split-files {acc}".format(folder=location, acc=accessions[name]))
        s2 = os.system("mv {folder}/{acc}_2.fastq {folder}/{name}_2.fastq".format(folder=location,acc=accessions[name],name=name))
        s3 = os.system("mv {folder}/{acc}_1.fastq {folder}/{name}_1.fastq".format(folder=location,acc=accessions[name],name=name))
        s4 = os.system("gzip {folder}/{name}_1.fastq".format(folder=location,name=name))
        s5 = os.system("gzip {folder}/{name}_2.fastq".format(folder=location,name=name))
        if s1 != 0 or s2 != 0 or s3 != 0 or s4 != 0 or s5 !=0:
            return("Error in trial: {} {} {} {} {} {}\t{}\n".format(s1,s2,s3,s4,s5,name, accessions[name]))
    return(s1,s2,s3,s4)

In [None]:
download(group1, "data")

In [None]:
download(group2, "data")

In [None]:
download(group3, "data")

In [None]:
def manifest(accessions,location,file):
    import os
    here = !pwd
    fileout = open(file,'w')
    fileout.write("sample-id\tabsolute-filepath\n")
    for name in accessions:
        sample = name
        path = "{}/{}/{}_1.fastq.gz".format(here[0], location, sample)
        print("{}/{}/{}_1.fastq.gz".format(here[0], location, sample))
        fileout.write("{sample}\t{path}\n".format(sample=sample,path=path))
    fileout.close()
    return(0)
    

In [None]:
manifest(group1, "data", "maps/manifest1.txt")
manifest(group2, "data", "maps/manifest2.txt")
manifest(group3, "data", "maps/manifest3.txt")