# Downloading data from Bonnal et. al.

### Purpose of the script
I need to download 63 `fastq` datasets. I don't want to download them in a browser.
1. I don't have the storage on my laptop
2. I need them on the FARM
3. Downloading from a browser is not recommended
4. I don't want to click a button 63 times


### Starting point
There is a file provided which contains the links to all samples. It looks like this: (this is a single line)

In [None]:
SQ_0047	ERS403382	cell	Istituto Nazionale Genetica Molecolare Romeo ed Enrica Invernizzi	
Homo sapiens	fresh specimen	wild type genotype	blood	primary cell	differentiated	
CD4 Th1	CD4+ CXCR3+	P-MTAB-37618	P-MTAB-37619	P-MTAB-37620	SQ_0047	polyA RNA	
PAIRED	TRANSCRIPTOMIC	ncRNA-Seq	RANDOM	364	5'-3'-3'-5'	P-MTAB-37621	
Istituto Nazionale Genetica Molecolare Romeo ed Enrica Invernizzi, Milan, Italy	SQ_0047	sequencing assay	
ERX397847	SQ_0047_R1.fastq.gz	SQ_0047_R1.fastq.gz	ERR431577	
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431577/ERR431577_1.fastq.gz	
    1938ff117e3b4c01435d73b17c1aa957	102	101	CD4 Th1

It's as easy as writing a script to sequentially (or parallell-y) use `curl` to download the data and put it in the right place. 

Thing is. I'm already tired of looking at all of that junk. I'm going to make it a lot smaller to play with it

`cat E-MTAB-2319.sdrf.txt| sed 's/ /_/g' | awk '{print $31 "\t" $11 "\t" $12 "\t" $29 "\t" $32} > downLoadLinks.txt`

That is way better. Now the file looks like this:
 

In [None]:
ERR431622	CD4_Th1 	CD4+_CXCR3+	SQ_0007_R1.fastq.gz 	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431622/ERR431622_1.fastq.gz
ERR431597	CD4_Th17	CD4+_CCR6+_CD161+_CXCR3        -	SQ_0011_R1.fastq.gz	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431597/ERR431597_1.fastq.gz

### File storage
My files are located on the farm in `/lustre/scratch117/cellgen/teamtrynka/Brie`. I am going to create a folder called `data/bonnal` to store the reads in there. 

## curl
I like curl. I can easily give the file a name and fill in the `url`. I'd need to use a loop to feed the required name and `url`. But that's easy enough
The curl command would go like this:

In [None]:
curl -o <filename> <ftp link>

##### Loop though
How do I get the computer to read through a line of the table, and feed one column to the filename and one to the ftp. hm. 

Ah yes. `while` loops. 

### `while` loops
Reads through a file, with the use of `read` you can easily assign columns letters, and then use those columns to do what what you want.

I made a tiny file:
`head -n5 downLoadLinks.txt | awk '{print $1 "\t" $2 "\t" $5}' > shortTable.txt` which produced this: 

In [None]:
Comment[ENA_RUN]        Characteristics[cell_type]      Comment[FASTQ_URI]
ERR431622	CD4_Th1     ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431622/ERR431622_1.fastq.gz
ERR431597	CD4_Th1     ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431597/ERR431597_1.fastq.gz
ERR431628	CD4_Th2     ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431628/ERR431628_1.fastq.gz
ERR431579	CD4_Th2     ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431579/ERR431579_1.fastq.gz

This code will happily read the files: 


In [None]:
#!/bin/bash

cd /Users/bh10/Documents/Rotation1/Data/RNASeq
while read -r a b c; do 
filename=$b"_"$a

echo Filename: ${filename}, linke:$c 
done < shortTable.txt

# Resulting in this: 
Filename: Characteristics[cell_type]_Comment[ENA_RUN], linke:Comment[FASTQ_URI]
Filename: CD4_Th1_ERR431622, linke:ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431622/ERR431622_1.fastq.gz
Filename: CD4_Th17_ERR431597, linke:ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431597/ERR431597_1.fastq.gz
Filename: CD4_Th2_ERR431628, linke:ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431628/ERR431628_1.fastq.gz
Filename: CD4_Th2_ERR431579, linke:ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431579/ERR431579_1.fastq.gz

Should probably remove that top line though..

## Translating code back to the original file
The columns containing the information I need are: 31, 11 and 32. Let's just check:

In [None]:
head E-MTAB-2319.sdrf.txt  | tail -n +2 | sed 's/ /_/g' | awk '{print $31 "\t" $11 "\t" $32}'

## PRODUCES: 
ERR431622	CD4_Th1 	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431622/ERR431622_1.fastq.gz
ERR431597	CD4_Th17	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431597/ERR431597_1.fastq.gz
ERR431628	CD4_Th2 	ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431628/ERR431628_1.fastq.gz

Note I included a `tail` command, which removes the first line. But yes it works. So I can now write the proper `while` loop.

## Writing proper loop
Okay so note that the numbers go to 32 in the colums. There is no way I am going to write a 32 character list of characters just so I can choose the wrong one. I am going to include a line in the script that makes a three column file first, then use that file to pull columns a, b and c. 

### Minor thing
So something important is specifying R1 or R2, which this code does not do. The but that makes it unique I thought was that E number, but I can also use the S number in column 29 (which contains the read number too) and that's allg.

In [None]:
## New output: 
Filename: CD4_Treg_SQ_0023_R1.fastq.gz, linke:ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431609/ERR431609_1.fastq.gz
Filename: CD4_Th1_SQ_0046_R1.fastq.gz, linke:ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431584/ERR431584_1.fastq.gz
Filename: CD4_Th1_SQ_0047_R1.fastq.gz, linke:ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR431/ERR431577/ERR431577_1.fastq.gz

Sweet. 

### Full code

In [None]:
#!/bin/bash

dataTable="E-MTAB-2319.sdrf.txt"
tempFile="tempTinyDownloadFile.txt"

cd /Users/bh10/Documents/Rotation1/Data/RNASeq


cat ${dataTable} | tail -n +2 | sed 's/ /_/g' | awk '{print $11 "\t" $29 "\t" $32}' > ${tempFile}

while read -r a b c; do 
filename=$a"_"$b

echo Filename: ${filename}, "link":$c  >> testereno.txt
done < ${tempFile}

echo "fin"

## Decision time

When I run this on the FARM, do I want to do it all at once, or sequentially. 
I think, I shouldn't call 126 jobs to the FARM yet. 
I don't want to make anyone mad at me. 
I might split them into a few batches. And do it that way. 
I will review how to submit as an array on a day where I can submit before 4:30pm.

Okay batches.

### Editing the code for batches. 

Because I want to split it. I have to remove that tail command, or I will lose random lines.

I divided the 126 read files into 6 groups of 21. `batchA.txt -> batchF.txt` 

The only two variables are then the data table and the temp table: 


In [None]:
#!/bin/bash

dataTable="batchA.txt"
tempFile="tempA.txt"

cd /Users/bh10/Documents/Rotation1/Data/RNASeq

cat ${dataTable} |  sed 's/ /_/g' | awk '{print $11 "\t" $29 "\t" $32}' > ${tempFile}

while read -r a b c; do 
filename=$a"_"$b

echo curl -o "./${filename}" $c
done < ${tempFile}

echo "fin"

Now to run on the FARM. 
