# Looping with Globs
It can be tedious and error prone to individually specify the files we want to loop over.  Regular expressions, in the form of bash globs allow us to automatically select groups of files to loop over.

## Shell Variables
Assign the variables in this notebook.

In [1]:
source bioinf_intro_config.sh
mkdir -p $TRIMMED $STAR_OUT

## Globs?
Previously we specified the files we wanted to loop over explicitly:

In [2]:
for FASTQ in 21_2019_P_M1_S21_L002_R1 21_2019_P_M1_S21_L001_R1
    do
        echo "RUNNING FASTQ: ${FASTQ}"
    done

RUNNING FASTQ: 21_2019_P_M1_S21_L002_R1
RUNNING FASTQ: 21_2019_P_M1_S21_L001_R1


This is appropriate in some situations, but often when there is a group of files that we want to work with, we can find a simple way to list these files without specifying them one-by-one.  For example, since the reads for each FASTQ are split across four lanes, we might want to analyze the data from all four lanes at once.  There are several ways that we can specify the FASTQs from all four lanes of FASTQ 21_2019:

1. The easiest is to use the `*` wildcard, which matches any number of any characters (including zero), so we can match the FASTQs from all four lanes like this: `$RAW_FASTQS/21_2019*`
2. `*` is easy and useful, but often it is better to be more specific, otherwise we might match something unintentionally. Since the only difference in the names between the lanes is the "L001", "L002", "L003", and "L004", we can narrow our match using `?`, which matches any single character: `21_2019_P_M1_S21_L00?_R1_001.fastq.gz`
3. The best approach is to specify exactly what characters we are allowing in that variable position.  Square brackets allow you to list specific characters that can match `[1234]` or a range `[1-4]`: `21_2019_P_M1_S21_L00[1-4]_R1_001.fastq.gz`

Globs can be confusing.  Keep in mind that here we are using the globs to search through all the file names in a directory and list the files with a name that matches a specific pattern.

In [3]:
echo $RAW_FASTQS/21_2019*

/data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L001_R1_001.fastq.gz /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L002_R1_001.fastq.gz /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L003_R1_001.fastq.gz /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L004_R1_001.fastq.gz


In [4]:
echo $RAW_FASTQS/21_2019_P_M1_S21_L00?_R1_001.fastq.gz

/data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L001_R1_001.fastq.gz /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L002_R1_001.fastq.gz /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L003_R1_001.fastq.gz /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L004_R1_001.fastq.gz


In [5]:
echo $RAW_FASTQS/21_2019_P_M1_S21_L00[1234]_R1_001.fastq.gz

/data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L001_R1_001.fastq.gz /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L002_R1_001.fastq.gz /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L003_R1_001.fastq.gz /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L004_R1_001.fastq.gz


In [6]:
echo $RAW_FASTQS/21_2019_P_M1_S21_L00[1-4]_R1_001.fastq.gz

/data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L001_R1_001.fastq.gz /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L002_R1_001.fastq.gz /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L003_R1_001.fastq.gz /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L004_R1_001.fastq.gz


## More Complex Globs

We can combine multiple wildcards in a glob to match a more complex set of filenames.  For example we could match "2019" samples 10 through 19, Lanes 1 through 4, with the following glob.

In [7]:
ls $RAW_FASTQS/1?_2019_*_L00[1-4]_R1_001.fastq.gz

[0m[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/10_2019_P_M1_S10_L001_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/10_2019_P_M1_S10_L002_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/10_2019_P_M1_S10_L003_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/10_2019_P_M1_S10_L004_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/11_2019_P_M1_S11_L001_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/11_2019_P_M1_S11_L002_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/11_2019_P_M1_S11_L003_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/11_2019_P_M1_S11_L004_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/12_2019_P_M1_S12_L001_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/12_2019_P_M1_S12_L002_R1_001.fastq.gz[0m
[01;31m/data/hts_2019_data/hts2019_pilot_rawdata/12_2019_P_M1_S12_L003_R1_0

## Looping over a glob
Now we can put together `for` loops with globs, to loop over all the lanes from library 21_2019

In [8]:
for FASTQ in $RAW_FASTQS/21_2019_P_M1_S21_L00[1-4]_R1_001.fastq.gz
    do
        echo "RUNNING FASTQ: ${FASTQ}"
    done

RUNNING FASTQ: /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L001_R1_001.fastq.gz
RUNNING FASTQ: /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L002_R1_001.fastq.gz
RUNNING FASTQ: /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L003_R1_001.fastq.gz
RUNNING FASTQ: /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L004_R1_001.fastq.gz


## String Manipulation

But we still have a bit of work to do.  Before when we were manually specifying each FASTQ, we looped over a *substring* of the full path, then added on the prefix and suffix, for example: `$TRIMMED/${FASTQ}_001.trim.fastq.gz`.  But now we are grabbing the full path, and we need to manipulate it so that we can generate output file names that are different than the input, and put the output in different directories.  We can do all this with the `basename` bash function. The simple form of `basename` removes the whole directory portion of a path and just returns the filename:

In [9]:
for FASTQ in $RAW_FASTQS/21_2019_P_M1_S21_L00[1-2]_R1_001.fastq.gz
    do
        echo "FULL PATH IS: ${FASTQ}"
        echo "basename gives us: $(basename ${FASTQ})"
        echo ""
    done

FULL PATH IS: /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L001_R1_001.fastq.gz
basename gives us: 21_2019_P_M1_S21_L001_R1_001.fastq.gz

FULL PATH IS: /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L002_R1_001.fastq.gz
basename gives us: 21_2019_P_M1_S21_L002_R1_001.fastq.gz



If you give `basename` a string as a second argument, it will strip that string from the end of the path (if it is there):

In [10]:
for FASTQ in $RAW_FASTQS/21_2019_P_M1_S21_L00[1-2]_R1_001.fastq.gz
    do
        echo "FULL PATH IS: ${FASTQ}"
        echo "basename gives us: $(basename ${FASTQ} '_001.fastq.gz')"
        echo ""
    done

FULL PATH IS: /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L001_R1_001.fastq.gz
basename gives us: 21_2019_P_M1_S21_L001_R1

FULL PATH IS: /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L002_R1_001.fastq.gz
basename gives us: 21_2019_P_M1_S21_L002_R1



> Note that if the string you give is not a suffix of the path, nothing is stripped from the *end* of the path, but the directory prefix will still be removed:

In [11]:
for FASTQ in $RAW_FASTQS/21_2019_P_M1_S21_L00[1-2]_R1_001.fastq.gz
    do
        echo "FULL PATH IS: ${FASTQ}"
        echo "with '_001.fastq.gz': $(basename ${FASTQ} '_001.fastq.gz')"
        echo "with 'fastq':         $(basename ${FASTQ} 'fastq')"
        echo "with 'foo':           $(basename ${FASTQ} 'foo')"
        echo ""
    done

FULL PATH IS: /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L001_R1_001.fastq.gz
with '_001.fastq.gz': 21_2019_P_M1_S21_L001_R1
with 'fastq':         21_2019_P_M1_S21_L001_R1_001.fastq.gz
with 'foo':           21_2019_P_M1_S21_L001_R1_001.fastq.gz

FULL PATH IS: /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L002_R1_001.fastq.gz
with '_001.fastq.gz': 21_2019_P_M1_S21_L002_R1
with 'fastq':         21_2019_P_M1_S21_L002_R1_001.fastq.gz
with 'foo':           21_2019_P_M1_S21_L002_R1_001.fastq.gz



And we can assign the results of `basename` to a variable for later use:

In [12]:
for FASTQ in $RAW_FASTQS/21_2019_P_M1_S21_L00[1-2]_R1_001.fastq.gz
    do
        echo "FULL PATH IS: ${FASTQ}"
        FASTQ_BASE="$(basename ${FASTQ} '_001.fastq.gz')"
        echo "basename gives us: $FASTQ_BASE"
        echo "OUTPUT PATH: ${TRIMMED}/${FASTQ_BASE}_001.trim.fastq.gz"
    done

FULL PATH IS: /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L001_R1_001.fastq.gz
basename gives us: 21_2019_P_M1_S21_L001_R1
OUTPUT PATH: /home/jovyan/work/scratch/bioinf_intro/trimmed_fastqs/21_2019_P_M1_S21_L001_R1_001.trim.fastq.gz
FULL PATH IS: /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L002_R1_001.fastq.gz
basename gives us: 21_2019_P_M1_S21_L002_R1
OUTPUT PATH: /home/jovyan/work/scratch/bioinf_intro/trimmed_fastqs/21_2019_P_M1_S21_L002_R1_001.trim.fastq.gz


With globs and `basename` in our toolbox, we are ready to **conquer the world** or at least run multiple FASTQs through our pipeline, without breaking a sweat!

## A globy pipeline

In [13]:
for FASTQ in $RAW_FASTQS/21_2019_P_M1_S21_L00[1-2]_R1_001.fastq.gz
    do
        FASTQ_BASE="$(basename ${FASTQ} '_001.fastq.gz')"
        echo "---------------- TRIMMING: $FASTQ_BASE ----------------"
        fastq-mcf \
            $MYINFO/neb_e7600_adapters.fasta \
            $RAW_FASTQS/${FASTQ_BASE}_001.fastq.gz \
            -q 20 -x 0.5 \
            -o $TRIMMED/${FASTQ_BASE}_001.trim.fastq.gz
        
        echo "---------------- MAPPING: $FASTQ_BASE ----------------"
        STAR \
            --runMode alignReads \
            --twopassMode None \
            --genomeDir $GENOME_DIR \
            --readFilesIn $TRIMMED/${FASTQ_BASE}_001.trim.fastq.gz \
            --readFilesCommand gunzip -c \
            --outFileNamePrefix ${STAR_OUT}/${FASTQ_BASE}_ \
            --quantMode GeneCounts \
            --outSAMtype BAM SortedByCoordinate \
            --runThreadN 2
    done

---------------- TRIMMING: 21_2019_P_M1_S21_L001_R1 ----------------
Command Line: /home/jovyan/work/scratch/bioinf_intro/myinfo/neb_e7600_adapters.fasta /data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L001_R1_001.fastq.gz -q 20 -x 0.5 -o /home/jovyan/work/scratch/bioinf_intro/trimmed_fastqs/21_2019_P_M1_S21_L001_R1_001.trim.fastq.gz
Scale used: 2.2
Phred: 33
Threshold used: 751 out of 300000
Adapter Adapter (AGATCGGAAGAGCACACGTCTGAACTCCAGTCA): counted 2504 at the 'end' of '/data/hts_2019_data/hts2019_pilot_rawdata/21_2019_P_M1_S21_L001_R1_001.fastq.gz', clip set to 6
Files: 1
Total reads: 2479050
Too short after clip: 1519
Clipped 'end' reads: Count: 46013, Mean: 15.47, Sd: 8.24
Trimmed 299018 reads by an average of 1.71 bases on quality < 20
---------------- MAPPING: 21_2019_P_M1_S21_L001_R1 ----------------
Jun 26 16:30:37 ..... started STAR run
Jun 26 16:30:37 ..... loading genome
Jun 26 16:30:38 ..... started mapping
Jun 26 16:32:13 ..... started sorting BAM
Jun 26 16:3

### And let's check the result

In [14]:
ls ${STAR_OUT}

21_2019_P_M1_S21_L001_R1_Aligned.sortedByCoord.out.bam
21_2019_P_M1_S21_L001_R1_Log.final.out
21_2019_P_M1_S21_L001_R1_Log.out
21_2019_P_M1_S21_L001_R1_Log.progress.out
21_2019_P_M1_S21_L001_R1_ReadsPerGene.out.tab
21_2019_P_M1_S21_L001_R1_short_introns_Aligned.sortedByCoord.out.bam
21_2019_P_M1_S21_L001_R1_short_introns_Aligned.sortedByCoord.out.bam.bai
21_2019_P_M1_S21_L001_R1_short_introns_Log.final.out
21_2019_P_M1_S21_L001_R1_short_introns_Log.out
21_2019_P_M1_S21_L001_R1_short_introns_Log.progress.out
21_2019_P_M1_S21_L001_R1_short_introns_ReadsPerGene.out.tab
21_2019_P_M1_S21_L001_R1_short_introns_SJ.out.tab
21_2019_P_M1_S21_L001_R1_SJ.out.tab
21_2019_P_M1_S21_L002_R1_Aligned.out.bam
21_2019_P_M1_S21_L002_R1_Aligned.sortedByCoord.out.bam
21_2019_P_M1_S21_L002_R1_Log.final.out
21_2019_P_M1_S21_L002_R1_Log.out
21_2019_P_M1_S21_L002_R1_Log.progress.out
21_2019_P_M1_S21_L002_R1_ReadsPerGene.out.tab
21_2019_P_M1_S21_L002_R1_short_introns_Aligned.sortedByCoord.out.bam
21_2019_P_M1_S21

In [15]:
head ${STAR_OUT}/21_2019_P_M1_S21_L00?_R1_ReadsPerGene.out.tab

==> /home/jovyan/work/scratch/bioinf_intro/star_out/21_2019_P_M1_S21_L001_R1_ReadsPerGene.out.tab <==
N_unmapped	47582	47582	47582
N_multimapping	35895	35895	35895
N_noFeature	12476	2181458	18846
N_ambiguous	206438	870	288
CNAG_04548	0	0	0
CNAG_07303	0	0	0
CNAG_07304	10	0	10
CNAG_00001	0	0	0
CNAG_07305	2	0	2
CNAG_00002	43	0	43

==> /home/jovyan/work/scratch/bioinf_intro/star_out/21_2019_P_M1_S21_L002_R1_ReadsPerGene.out.tab <==
N_unmapped	46060	46060	46060
N_multimapping	35006	35006	35006
N_noFeature	12466	2145291	18783
N_ambiguous	203327	820	316
CNAG_04548	0	0	0
CNAG_07303	0	0	0
CNAG_07304	6	0	6
CNAG_00001	0	0	0
CNAG_07305	1	0	1
CNAG_00002	51	0	51

==> /home/jovyan/work/scratch/bioinf_intro/star_out/21_2019_P_M1_S21_L003_R1_ReadsPerGene.out.tab <==
N_unmapped	47625	47625	47625
N_multimapping	36124	36124	36124
N_noFeature	12566	2207478	19048
N_ambiguous	208095	881	289
CNAG_04548	0	0	0
CNAG_07303	0	0	0
CNAG_07304	16	0	16
CNAG_00001	0	0	0
CNAG_07305	0	0	0
CNAG_00002	53	0	53


## Glob References
- [Globbing Section](http://tldp.org/LDP/abs/html/globbingref.html) from [The Advanced Bash-Scripting Guide](http://tldp.org/LDP/abs/html/index.html)
- [Glob](https://en.wikipedia.org/wiki/Glob_%28programming%29) article on Wikipedia