# Run all scripts
#### This is meant as a master list of commands and detailed walk through from raw data to graphs for captive ape microbiome paper.



In [2]:
#set working directory, it should have 3 major folders: data, results, and scripts
%pwd
%cd /Volumes/AHN/captive_ape_microbiome/

/Volumes/AHN/captive_ape_microbiome


### Demultiplex fastqs
DADA2 requires demultiplexed fastqs
Two sequencing runs from Moeller et al. 2014 and one sequencing run from this study (Project Chimps samples) need to be demultiplexed. Moeller et al. 2014 has two fastqs: one containing gorilla and bonobo samples, the other containing all chimpanzee samples. I used two demultiplexing programs because demultiplex 1.0.1 handles barcodes sequences in fastq headers (GonBon fastq) and barcode_splitter takes a separate barcode fastq file (Chimp fastq & captive ape fastq). 
#### Inputs: Multiplexed Fastq, Barcoded Fastq (not for GorBon fastqs bc barcode in read fastq header), metadata file linking barcodes to sample names
#### Outputs: Demultiplexed Fastqs
### Required programs: 
#### demultiplex 1.0.1 (https://pypi.org/project/demultiplex/) 
Handles barcode sequences in fastq headers -> GorBon fastq
#### repair.sh from bbtools (https://github.com/BioInfoTools/BBMap/blob/master/sh/repair.sh) 
Chimp read fastq and barcode fastq headers aren't paired perfectly so the demultiplex program quits as soon as it reaches a mismatch. 
#### barcode_splitter (https://pypi.org/project/barcode-splitter/) 
Takes a separate barcode fastq file -> Chimp fastq & captive ape fastq


In [16]:
!scripts/demultiplex/V4_demultiplex.sh

demultiplexing moeller chimp dataset
Sample	Barcode1	Count	Percent	Files
wd.chi.GM.1.16s	ACCGATAATTCC	209500	0.73%	results/16s_moeller_wild/chimp/demultiplexed_fastq/wd.chi.GM.1.16s-read-*.fastq
wd.chi.GM.2.16s	CAAGTGAGAGAG	193311	0.67%	results/16s_moeller_wild/chimp/demultiplexed_fastq/wd.chi.GM.2.16s-read-*.fastq
wd.chi.GM.3.16s	GTCTTCGTCGCT	146312	0.51%	results/16s_moeller_wild/chimp/demultiplexed_fastq/wd.chi.GM.3.16s-read-*.fastq
wd.chi.GM.4.16s	GGAAAGTCGAAG	163057	0.57%	results/16s_moeller_wild/chimp/demultiplexed_fastq/wd.chi.GM.4.16s-read-*.fastq
wd.chi.GM.5.16s	TCCACAGGAGTT	262959	0.91%	results/16s_moeller_wild/chimp/demultiplexed_fastq/wd.chi.GM.5.16s-read-*.fastq
wd.chi.GM.6.16s	GTTACTGGGTGG	189795	0.66%	results/16s_moeller_wild/chimp/demultiplexed_fastq/wd.chi.GM.6.16s-read-*.fastq
wd.chi.GM.7.16s	GAAGGAAGCAGG	165853	0.58%	results/16s_moeller_wild/chimp/demultiplexed_fastq/wd.chi.GM.7.16s-read-*.fastq
wd.chi.GM.8.16s	TGAGTGGAATGA	474422	1.65%	results/16s_moeller_wild/chimp/

scripts/demultiplex/V4_demultiplex.sh: line 35: ches: command not found


### Remove primer and adapter sequences from select fastqst
#### Required programs: cutadapt
Certain dataset fastqs have already had primers removed, other haven't, or even more confusingly, one had only the forward primer trimmed. GorBon and Chimp Fastqs from Moeller et al. 2014 have already been merged so the parameters in cutadapt are different: Reverse primer is given as the reverse complement and trimmed from the 3' end. 


In [4]:
!scripts/demultiplex/V4_cutadapt.sh

This is cutadapt 2.5 with Python 3.7.4
Command line parameters: -g GTGCCAGCCGCCGCGGTAA -G GGACTACNVGGGTWTCTAAT -o test1.fastq -p test2.fastq data/16s_clayton_captive/raw_fastq/CZ-MB26_S26_L001_R1_001.fastq.gz data/16s_clayton_captive/raw_fastq/CZ-MB26_S26_L001_R2_001.fastq.gz
Processing reads on 1 core in paired-end mode ...
[ 8=---------] 00:00:03        84,584 reads  @     43.9 µs/read;   1.37 M reads/minute
Finished in 3.78 s (45 us/read; 1.34 M reads/minute).

=== Summary ===

Total read pairs processed:             84,584
  Read 1 with adapter:                      36 (0.0%)
  Read 2 with adapter:                      27 (0.0%)
Pairs written (passing filters):        84,584 (100.0%)

Total basepairs processed:    50,919,568 bp
  Read 1:    25,459,784 bp
  Read 2:    25,459,784 bp
Total written (filtered):     50,919,361 bp (100.0%)
  Read 1:    25,459,674 bp
  Read 2:    25,459,687 bp

=== First read: Adapter 1 ===

Sequence: GTGCCAGCCGCCGCGGTAA; Type: regular 5'; Length: 19; Trim

This is cutadapt 2.5 with Python 3.7.4
Command line parameters: -g GTGCCAGCCGCCGCGGTAA -G GGACTACNVGGGTWTCTAAT -o test1.fastq -p test2.fastq data/16s_goodrich/raw_fastq/ERR1382855_1.fastq.gz data/16s_goodrich/raw_fastq/ERR1382855_2.fastq.gz
Processing reads on 1 core in paired-end mode ...
[8=----------] 00:00:02        50,945 reads  @     45.7 µs/read;   1.31 M reads/minute
Finished in 2.39 s (47 us/read; 1.28 M reads/minute).

=== Summary ===

Total read pairs processed:             50,945
  Read 1 with adapter:                      63 (0.1%)
  Read 2 with adapter:                      55 (0.1%)
Pairs written (passing filters):        50,945 (100.0%)

Total basepairs processed:    25,574,390 bp
  Read 1:    12,787,195 bp
  Read 2:    12,787,195 bp
Total written (filtered):     25,573,999 bp (100.0%)
  Read 1:    12,786,989 bp
  Read 2:    12,787,010 bp

=== First read: Adapter 1 ===

Sequence: GTGCCAGCCGCCGCGGTAA; Type: regular 5'; Length: 19; Trimmed: 63 times.

No. of allowed error

[8=----------] 00:00:01        37,574 reads  @     36.4 µs/read;   1.65 M reads/minute
Orang114
[8=----------] 00:00:01        27,164 reads  @     58.6 µs/read;   1.02 M reads/minute
Orang115
[8=----------] 00:00:02        35,151 reads  @     63.2 µs/read;   0.95 M reads/minute
Orang116
[8=----------] 00:00:02        28,864 reads  @     73.8 µs/read;   0.81 M reads/minute
Orang117
[8=----------] 00:00:01        27,178 reads  @     47.3 µs/read;   1.27 M reads/minute
Orang118
[8=----------] 00:00:01        27,152 reads  @     62.1 µs/read;   0.97 M reads/minute
Orang119
[8<----------] 00:00:00        16,040 reads  @     49.5 µs/read;   1.21 M reads/minute
Orang120
[8=----------] 00:00:01        37,033 reads  @     36.3 µs/read;   1.65 M reads/minute
Orang121
[8=----------] 00:00:01        21,635 reads  @     83.1 µs/read;   0.72 M reads/minute
Orang122
[8=----------] 00:00:02        31,646 reads  @     80.1 µs/read;   0.75 M reads/minute
Orang123
scripts/demultiplex/V4_cutadapt.sh: line

[8<----------] 00:00:00         7,980 reads  @     46.0 µs/read;   1.31 M reads/minute
CS.026
[8<----------] 00:00:00        17,318 reads  @     41.7 µs/read;   1.44 M reads/minute
CS.027
[8<----------] 00:00:00        16,320 reads  @     43.9 µs/read;   1.37 M reads/minute
CS.028
[8<----------] 00:00:00        20,819 reads  @     42.8 µs/read;   1.40 M reads/minute
CS.029
[8<----------] 00:00:00        13,047 reads  @     42.1 µs/read;   1.43 M reads/minute
CS.030
[8<----------] 00:00:00        18,891 reads  @     41.1 µs/read;   1.46 M reads/minute
CS.031
[8<----------] 00:00:00        16,701 reads  @     41.3 µs/read;   1.45 M reads/minute
CS.032
[8<----------] 00:00:00        15,167 reads  @     44.2 µs/read;   1.36 M reads/minute
CS.033
[8<----------] 00:00:00        14,140 reads  @     59.9 µs/read;   1.00 M reads/minute
CS.035
[8<----------] 00:00:00        19,099 reads  @     49.6 µs/read;   1.21 M reads/minute
CS.036
[8<----------] 00:00:00        15,685 reads  @     60.6 µs/r

[8<----------] 00:00:00        14,086 reads  @     41.3 µs/read;   1.45 M reads/minute
CS.122
[8<----------] 00:00:00        17,728 reads  @     46.3 µs/read;   1.29 M reads/minute
CS.123
[8<----------] 00:00:00        17,043 reads  @     42.0 µs/read;   1.43 M reads/minute
CS.124
[8<----------] 00:00:01        24,434 reads  @     41.5 µs/read;   1.45 M reads/minute
CS.125
[8<----------] 00:00:00        19,074 reads  @     43.1 µs/read;   1.39 M reads/minute
CS.126
[8=----------] 00:00:01        18,778 reads  @     75.9 µs/read;   0.79 M reads/minute
CS.127
[8<----------] 00:00:00         4,252 reads  @     51.1 µs/read;   1.17 M reads/minute
CS.128
[8<----------] 00:00:00        10,129 reads  @     42.8 µs/read;   1.40 M reads/minute
CS.129
[8<----------] 00:00:00        18,118 reads  @     43.8 µs/read;   1.37 M reads/minute
CS.130
[8<----------] 00:00:00        15,957 reads  @     42.0 µs/read;   1.43 M reads/minute
CS.131
[8<----------] 00:00:00        17,361 reads  @     42.7 µs/r

[8<----------] 00:00:00         5,884 reads  @     42.0 µs/read;   1.43 M reads/minute
CS.222
[8<----------] 00:00:00        12,232 reads  @     43.0 µs/read;   1.39 M reads/minute
CS.223
[8<----------] 00:00:00         2,507 reads  @     47.9 µs/read;   1.25 M reads/minute
CS.224
[8<----------] 00:00:00        15,171 reads  @     43.3 µs/read;   1.39 M reads/minute
CS.227
[8<----------] 00:00:00        16,497 reads  @     45.0 µs/read;   1.33 M reads/minute
CS.228
[8<----------] 00:00:00        17,583 reads  @     46.6 µs/read;   1.29 M reads/minute
CS.229
[8<----------] 00:00:00         4,920 reads  @     51.1 µs/read;   1.17 M reads/minute
CS.230
[8<----------] 00:00:00        18,713 reads  @     45.5 µs/read;   1.32 M reads/minute
CS.231
[8<----------] 00:00:00        16,230 reads  @     43.1 µs/read;   1.39 M reads/minute
CS.232
[8<----------] 00:00:00         9,392 reads  @     45.8 µs/read;   1.31 M reads/minute
CS.233
[8<----------] 00:00:01        25,554 reads  @     45.0 µs/r

[8<----------] 00:00:01        23,355 reads  @     44.5 µs/read;   1.35 M reads/minute
CS.323
[8<----------] 00:00:00         7,743 reads  @     42.1 µs/read;   1.43 M reads/minute
CS.325
[8<----------] 00:00:00        13,961 reads  @     42.9 µs/read;   1.40 M reads/minute
CS.326
[8<----------] 00:00:00         2,179 reads  @     46.4 µs/read;   1.29 M reads/minute
CS.327
[8<----------] 00:00:00         3,085 reads  @     44.1 µs/read;   1.36 M reads/minute
CS.328
[8<----------] 00:00:00         6,553 reads  @     43.9 µs/read;   1.37 M reads/minute
CS.329
[8<----------] 00:00:00        16,284 reads  @     44.9 µs/read;   1.34 M reads/minute
CS.330
[8<----------] 00:00:00        11,153 reads  @     45.5 µs/read;   1.32 M reads/minute
CS.331
[8<----------] 00:00:00         6,834 reads  @     43.4 µs/read;   1.38 M reads/minute
CS.332
[8<----------] 00:00:00         6,646 reads  @     50.3 µs/read;   1.19 M reads/minute
CS.333
[8<----------] 00:00:00         6,086 reads  @     44.8 µs/r

L.004.M5
[8<----------] 00:00:00         6,338 reads  @     43.6 µs/read;   1.37 M reads/minute
L.004.M6
[8<----------] 00:00:00         6,420 reads  @     42.8 µs/read;   1.40 M reads/minute
L.005.M1
[8<----------] 00:00:00         8,156 reads  @     43.9 µs/read;   1.37 M reads/minute
L.005.M2
[8<----------] 00:00:00         5,218 reads  @     43.9 µs/read;   1.37 M reads/minute
L.005.M3
[8<----------] 00:00:00         6,773 reads  @     44.2 µs/read;   1.36 M reads/minute
L.005.M4
[8<----------] 00:00:00         2,954 reads  @     43.7 µs/read;   1.37 M reads/minute
L.005.M5
[8<----------] 00:00:00         4,379 reads  @     42.3 µs/read;   1.42 M reads/minute
L.005.M6
[8<----------] 00:00:00         3,396 reads  @     43.0 µs/read;   1.40 M reads/minute
L.006.M1
[8<----------] 00:00:00         4,720 reads  @     45.3 µs/read;   1.32 M reads/minute
L.006.M2
[8<----------] 00:00:00         4,719 reads  @     45.1 µs/read;   1.33 M reads/minute
L.006.M3
[8<----------] 00:00:00        

L.036.M4
[8<----------] 00:00:00         3,751 reads  @     44.5 µs/read;   1.35 M reads/minute
L.036.M5
[8<----------] 00:00:00        15,973 reads  @     45.7 µs/read;   1.31 M reads/minute
L.036.M6
[8<----------] 00:00:00         1,860 reads  @     50.8 µs/read;   1.18 M reads/minute
T.CS.001
[8<----------] 00:00:00         5,141 reads  @     45.1 µs/read;   1.33 M reads/minute
T.CS.002
[8<----------] 00:00:00         7,181 reads  @     44.1 µs/read;   1.36 M reads/minute
T.CS.003
[8<----------] 00:00:00         1,258 reads  @     51.9 µs/read;   1.16 M reads/minute
T.CS.004
[8<----------] 00:00:00        13,682 reads  @     43.5 µs/read;   1.38 M reads/minute
T.CS.005
[8<----------] 00:00:00         4,626 reads  @     44.9 µs/read;   1.34 M reads/minute
T.CS.007
[8<----------] 00:00:00         5,327 reads  @     41.9 µs/read;   1.43 M reads/minute
T.CS.008
[8<----------] 00:00:00         6,624 reads  @     41.9 µs/read;   1.43 M reads/minute
T.CS.009
[8<----------] 00:00:00        

[8<----------] 00:00:00         5,302 reads  @     41.8 µs/read;   1.43 M reads/minute
T.CS.089
[8<----------] 00:00:00         1,430 reads  @     47.4 µs/read;   1.27 M reads/minute
T.CS.090
[8<----------] 00:00:00        21,843 reads  @     41.6 µs/read;   1.44 M reads/minute
T.CS.091
[8<----------] 00:00:00         5,349 reads  @     46.2 µs/read;   1.30 M reads/minute
T.CS.092
[8<----------] 00:00:00         4,525 reads  @     45.0 µs/read;   1.33 M reads/minute
T.CS.093
[8<----------] 00:00:00         4,713 reads  @     42.7 µs/read;   1.40 M reads/minute
T.CS.094
[8<----------] 00:00:00         4,872 reads  @     45.3 µs/read;   1.32 M reads/minute
T.CS.095
[8<----------] 00:00:00         1,352 reads  @     52.3 µs/read;   1.15 M reads/minute
T.CS.096
[8<----------] 00:00:00         3,798 reads  @     44.3 µs/read;   1.35 M reads/minute
T.CS.097
[8<----------] 00:00:00         4,643 reads  @     43.0 µs/read;   1.39 M reads/minute
T.CS.098
[8<----------] 00:00:00         1,835 re

T.CS.180
[8<----------] 00:00:00            17 reads  @    548.8 µs/read;   0.11 M reads/minute
TFS.004
[8<----------] 00:00:00           105 reads  @    119.5 µs/read;   0.50 M reads/minute
TFS.005
[8<----------] 00:00:00            30 reads  @    289.9 µs/read;   0.21 M reads/minute
TFS.006
[8<----------] 00:00:00        15,505 reads  @     42.0 µs/read;   1.43 M reads/minute
TFS.020
[8<----------] 00:00:00            41 reads  @    268.5 µs/read;   0.22 M reads/minute
TFS.021
[8<----------] 00:00:00         7,615 reads  @     43.4 µs/read;   1.38 M reads/minute
TFS.022
[8<----------] 00:00:00         3,775 reads  @     55.6 µs/read;   1.08 M reads/minute
TFS.023
[8<----------] 00:00:00        17,107 reads  @     41.4 µs/read;   1.45 M reads/minute
TFS.024
[8<----------] 00:00:00           203 reads  @     80.3 µs/read;   0.75 M reads/minute
TFS.025
[8<----------] 00:00:00        14,462 reads  @     43.2 µs/read;   1.39 M reads/minute
TFS.026
[8<----------] 00:00:00           258 rea

### Quality filtering with DADA2

#### Required programs: DADA2 package

#### Input directory, sample names
The input folder where we source the fastqs will vary across datasets because of variable preprocessing steps, e.g. some datasets required demultiplexing, some required primer trimming, some required both steps, others required neither and will be sourced from the data folder. Also DADA2 requires a pattern for recognizing the forward and reverse fastqs and a delimiter to use to split the sample names, these vary across datasets.
#### Trimming lengths
Datasets can also vary in quality across the length of a read. The truncation parameters across the datasets are mostly 145bp F, 145bp R. But Raymann et al. had bad quality reverse reads, so the truncation parameters were set to 200bp F, 100bp R. 
#### Single vs Paired reads
Almost all of the datasets have paired end reads and were processed using the paired read DADA2 script, with the exception of those from Moeller et al. 2014 where the reads were already merged and therefore processed using the single read DADA2 script. The parameters/processing step used in single read and paired read scripts are identical. 
#### Script setup
Required scripts, DADA2_V4, DADA2_single, DADA2_paired
Because of these differences there are three scripts in this folder. DADA_V4 does all of the initial steps of defining indir/outdir/sample names & read truncation before funneling data into either DADA_single or DADA2_paired scripts where reads are processed uniformly.

In [5]:
!find results/16s_moeller_wild/chimp/cutadapt_fastq -name '*.gz' -size -30kb -delete #removes empty fastqs
# ran in DADA2_V4.R in Rstudio

Loading required package: Rcpp
Error: package or namespace load failed for ‘Rcpp’ in library.dynam(lib, package, package.lib):
 shared object ‘Rcpp.dylib’ not found
Error: package ‘Rcpp’ could not be loaded
Execution halted


### Merge DADA2 to Phyloseq