# **TITLE**
### Author : Andrea Grecu

## **Background | Pūtake**

Orr et al., (2020) described a phylogenomic method to measure somatic mutations within a phenotypically mosaic plant individual and further predict the somatic mutation rate within that individual. Both the bioinformatic workflow (*aka. pipeline*) and inputted data used are open-access and found within the orginal journal publication linked below.

[Orr et al., ORIGINAL!](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7126060/)

The original pipeline created by Orr et al., (2020) was published within a github repository utilizing many bioinformatics tools which cumitavley written in  5+ programming languages. The suggested method to replicate this pipeline is via using the makefiles provided.

### Āheitanga

***While the original pipeline is open access, it is not particulary user friendly.***

Computational methods within bioinformatics often fall short in reproducbility due to insufficient or inacessible documentation (Birmingham, 2017). It is imperative to the progression of bioinformatics research that an effort is made to produce interactive methods which are accessible to a wider audience with limited computer literacy. 

> ### Thus, the purpose of this notebook is to provide an easy to follow replication of the original pipeline created by Orr et al., (2020) to detect somatic mutations. 
>The notebook provides a *computational narrative* incorporating inputs, outputs and easy to understand explanations of each step. 
> By detailing the assumptions,logic, necessary input data and expected ouput this notebook should enable this pipeline to be applied to a new set of collected data.


## **Pipeline Technological Limitations | Tepenga**

Considering this pipeline utilizes many (24+) whole genome reads, it requires great computational capacity which restricts the technology which it can be succesfully run on.

While some changes can be made to the scripts to adapt for different RAM and thread values of your device, it is not recommended to run this pipeline on a device with capacity much lower than the default (*as this will likely take a very long time to run and much of the memory of your device*). 

> ### Default Capacity
> **RAM = 64 GB**
>
> **CPU = 20 Threads**




## **Required Software Installation |  Tāuta**

Prior to running this pipeline, the appropiate software's must be installed. 
This can be done directly within this notebook by running the cells below.

#### Bioconda 
Most of the packages used further along in this pipeline are provided via the **Bioconda** channel (package manager). 

More information about bioconda can be found [here](https://bioconda.github.io/)

***To install bioconda run the code cell below.***

In [5]:
import sys 
!conda config --add channels defaults
!conda config --add channels bioconda
!conda config --add channels conda-forge
!conda config --set channel_priority strict



#### Khmer

The khmer software allows for nucleotide sequence analysis, and is necessary for step (* COME BACK) -> uses the clean reads shell (Crusoe et al., 2015). 

More information on Khmer can be found [here](https://github.com/dib-lab/khmer)

***To install Khmer run the code cell below.***

In [6]:
import sys
!conda install --yes --prefix {sys.prefix} khmer

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.11.0
  latest version: 22.11.1

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=22.11.1



# All requested packages already installed.



#### Rcorrector

The Rcorrector software allows for the correction of Illumina RNA-seq , and is necessary for cleaning the reads (STEP 1*)(Song & Florea, 2015). 

More information on Rcorrector can be found [here](https://github.com/mourisl/Rcorrector)

***To install Rcorrector run the code cell below.***

In [40]:
import sys
!conda install --yes --prefix {sys.prefix} rcorrector

1794.94s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


Retrieving notices: ...working... done
Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.11.0
  latest version: 22.11.1

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=22.11.1



# All requested packages already installed.



#### NextGenMap

The NextGenMap software allows for short read mapping with a high sensitivity threshold, and is necessary for step ?* (Sedlazeck et al., 2013).

More information on NextGenMap can be found [here](https://github.com/Cibiv/NextGenMap/wiki)

***To install NextGenMap run the code cell below.***

In [2]:
import sys
!conda install --yes --prefix {sys.prefix} nextgenmap

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.11.0
  latest version: 22.11.1

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=22.11.1



# All requested packages already installed.



#### GNU Parallel

The GNU Parallel operating system is a free software used to conduct jobs in parallel, and is necessary for step ?* (Tange, 2018).

More information on GNU Parallel can be found [here](https://www.gnu.org/software/parallel/)

***To install GNU Parallel run the code cell below.***

In [4]:
import sys
!conda install --yes --prefix {sys.prefix} parallel

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.11.0
  latest version: 22.11.1

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=22.11.1



# All requested packages already installed.



#### Samtools

The Samtools software enables manipulation of next-generation sequencing data, and is necessary for step ?* (Danecek et al., 2021).

More information on Samtools can be found [here](https://github.com/samtools/samtools)

***To install Samtools run the code cell below.***

In [11]:
import sys
!conda install --yes --prefix {sys.prefix} samtools

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.11.0
  latest version: 22.11.1

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=22.11.1



# All requested packages already installed.



#### BCFtools

The BCFtools software provides commands used by Samtools and HTSlib which are adjacently installed (Danecek et al., 2021).

More information on BCFtools can be found [here](https://github.com/samtools/bcftools)

***To install BFCtools run the code cell below.***

In [None]:
import sys
!conda install --yes --prefix {sys.prefix} bcftools

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.11.0
  latest version: 22.11.1

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=22.11.1



# All requested packages already installed.



#### HTSlib

The HTSlib is a C library used for high-throughput sequencing formats used within Samtools and BCFtools (Bonfield et al., 2021).

More information on HTSlib can be found [here](https://github.com/samtools/htslib)

***To install HTSlib run the code cell below.***

In [13]:
import sys
!conda install --yes --prefix {sys.prefix} htslib

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.11.0
  latest version: 22.11.1

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=22.11.1



# All requested packages already installed.



#### GATK

GATK is a **G**enome **A**nalysis **T**ool**K**it designed to identify variants in genomes and is used throughout the pipeline (O’Connor & Van der Auwera, 2020).

More information on GATK can be found [here](https://gatk.broadinstitute.org/hc/en-us)

***To install GATK run the code cell below.***

In [14]:
import sys
!conda install --yes --prefix {sys.prefix} gatk

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.11.0
  latest version: 22.11.1

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=22.11.1



# All requested packages already installed.



#### RAxML

RAxML is a **R**andomized **A**xelerated **M**aximised **L**ikelihood algorithim enabling maximum likelihood phylogenetic tree searches and is used in step ?* (Stamatakis, 2006).

More information on RAxML can be found [here](https://cme.h-its.org/exelixis/web/software/raxml/index.html)

***To install RAxML run the code cell below.***

In [16]:
import sys
!conda install --yes --prefix {sys.prefix} raxml

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.11.0
  latest version: 22.11.1

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=22.11.1



# All requested packages already installed.



#### Bedtools

The bedtools software encompasses many algorithimic programmes used for genome analysis and is used in step ?* (Quinlan & Hall, 2010).

More information on bedtools can be found [here](https://bedtools.readthedocs.io/en/latest/)

***To install Bedtools run the code cell below.***

In [17]:
import sys
!conda install --yes --prefix {sys.prefix} bedtools

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.11.0
  latest version: 22.11.1

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=22.11.1



# All requested packages already installed.



#### UCSC LiftOver

LiftOver is a tool provided within the UCSC Genome Browser used to collate genetic analyses to the same build, version or collate assemblies (Hinrichs et al., 2006).

More information on LiftOver can be found [here](http://hgdownload.cse.ucsc.edu/admin/exe/)

***To install LiftOver run the code cell below.***

In [18]:
import sys
!conda install --yes --prefix {sys.prefix} ucsc-liftover

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.11.0
  latest version: 22.11.1

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=22.11.1



# All requested packages already installed.



#### VCFtools

The VCFtools program package enables operations such as filtering and categorising variants of VCF (**V**ariant **C**all **F**ormat) files and is used in step ?* (Danecek et al., 2011).

More information on VCFtools can be found [here](https://vcftools.github.io/)

***To install VCFtools run the code cell below.***

In [19]:
import sys
!conda install --yes --prefix {sys.prefix} vcftools

Collecting package metadata (current_repodata.json): done
Solving environment: done


  current version: 22.11.0
  latest version: 22.11.1

Please update conda by running

    $ conda update -n base -c defaults conda

Or to minimize the number of packages updated during conda update use

     conda install conda=22.11.1



# All requested packages already installed.



## **Data Input Requirements | Tāuru Raraunga**

### Data Collection 

Orr et al.,(2020) sampled data from a phenotypically mosaic *Eucalyptus melliodora* individual (aka. a Yellow Box tree) in which it was known that one branch expressed resistance to defoliation via the *Anoploganthus* or Christmas beetle genus. From 8 distinct branches, 3 replicate samples were taken from the leaf tip in order to sequence the full genome. 

Data collected to be used within this pipeline is not restrictive in the number of samples but rather the number of replicates per sample;

> #### **!!! TIP !!!**
>*There must be at least 2 replicates per sample and an equal number of replicates across all samples. Each sample should be sequenced fully to produce a genome via Illumina.
The suffix of the each replicate for one sample should be formatted as follows :*
**Sample1a, Sample1b, Sample1c etc.**
>
>*The raw data inputted should be paired sequencing reads for each sample in FASTQ format. The suffix of the pair files per replicate should be* **"R1.fastq"** *and* **"R2.fastq"**
>
>> Raw Data should be deposited in the raw folder (/rep_files/data/raw/), in a new folder "my_data". You will see the raw data used by Orr et al.,(2020) in the raw folder which will be used by default.
Each **replicate** of one **sample** should have two files in the following format: ***"Sample1a_R1.fastq"*** !!!!!!

As always, considerations into collecting and using data in a respectful and responsible manner towards communities (across all taxa) should take precedence. 

### Pseudogenome

There was no high-quality reference genome available for *E. melliodora* thus a pseudoreference genome was created using the reference genome of the closely related *Eucalyptus grandis* aka. Rose Gum tree (Bartholomé et al., 2014). 
> #### **!!! TIP !!!**
>*If replicating with your own data, use either the most closely related high quality reference genome available for your sampled taxa OR for your exact taxa if available.*
>
>*The genome should be in a Fa-file format within a new folder inside the data folder labelled "my_ref" (rep_files/data/my_ref/) and the file itself called "ref.fa".*
>
>The *E.grandis* genome found in the "e_grandis" folder (/rep_files/data/e_grandis/) will be used by default otherwise.


# **Analysis Makefile**
The first makefile that Orr et al., (2020) suggests executing is found within the analysis folder of the repository files. This makefile uses 4 different scripts from the scripts folder, and can be broken down into **? main steps**.
> *Below are descriptions of each stage considering inputs, ouputs, set variables and any potential errors.*

### **Step One | Read Correction**
The first step of the pipeline is to *correct* the raw reads.

This step utilizes the algorthims provided by the **[Rcorrector](https://www.researchgate.net/publication/283260409_Rcorrector_efficient_and_accurate_error_correction_for_Illumina_RNA-seq_reads)** software to determine trusted kmers (using a De Brujin Graph) which will be further used in the next step to correct random sequencing errors producing sliced reads.

#### **Input Data**
> Raw reads with the suffix "R1.fastq" within the directory specified via the **READSFOLDER** variable are utilized. *I.e. the first of the paired reads for one replicate*. By default this is **"../data/raw/"**.
> 
> If using your own raw reads, you may change the directory utilising the code in the stage one cell below when calling the makefile or by overwriting the makefile itself.
>
>> **IF YOU ARE USING YOUR OWN RAW READS AND REFERENCE GENOME UTILIZE THE CODE UNDER RUNNING MAKEFILE!!!**

#### **Scripts Used**
> Lines 100-103 of the script clean_reads.sh execute this step. 
>
> The clean_reads.sh script is called in line 25 of the Makefile
> (This utilizies the directories set via **SCRIPTDIR** and **CLEANREADS**).

#### **Output Produced**
> The corrected reads will be outputted in a folder named "corrected" in the "cleaned_reads" folder specified by **CLEANFOLDER**. 
> 
> The code in the stage one cell can be used to check if the corrected files have been succesfully produced in the right folder.

In [None]:
# STEP ONE CODE

# Use own raw data 
#!make READSFOLDER=../data/raw/my_data/

# Check Corrected Reads 
#!ls [path]/rep_files/analysis/cleaned_reads/

### **Step Two | Khmer Count Graph**
The second step is to store trusted kmers in a khmer graph

In [2]:
# Change Directory
import os
os.getcwd() #run this line first to get your own analysis folder file path and replace below
os.chdir('/home/agre945/anaconda3/envs/studyenv/GitHub/SRS-AG/rep_files/analysis')
os.getcwd()# run to check working directory correct

'/home/agre945/anaconda3/envs/studyenv/GitHub/SRS-AG/rep_files/analysis'

In [None]:
# Export the path for the scripts used
!export PATH=$PATH:/home/agre945/anaconda3/envs/studyenv/GitHub/SRS-AG/rep_files/scripts
#Run the makefile
!make

../scripts/clean_reads.sh -i ../data/raw/
Put the kmers into bloom filter
jellyfish bc -m 32 -s 100000000 -C -t 18 -o tmp_f95afb0c8b8a1bf93582a1633e3d352f.bc ../data/raw/SRR9650834_RRR1.fastq ../data/raw/SRR9650834_RRR2.fastq 
Count the kmers in the bloom filter
jellyfish count -m 32 -s 100000 -C -t 18 --bc tmp_f95afb0c8b8a1bf93582a1633e3d352f.bc -o tmp_f95afb0c8b8a1bf93582a1633e3d352f.mer_counts ../data/raw/SRR9650834_RRR1.fastq ../data/raw/SRR9650834_RRR2.fastq 
Dump the kmers
jellyfish dump -L 2 tmp_f95afb0c8b8a1bf93582a1633e3d352f.mer_counts > tmp_f95afb0c8b8a1bf93582a1633e3d352f.jf_dump
Error correction
/home/agre945/anaconda3/envs/studyenv/bin/rcorrector -k 32 -t 18 -od ./cleaned_reads/corrected/  -p ../data/raw/SRR9650834_RRR1.fastq ../data/raw/SRR9650834_RRR2.fastq -c tmp_f95afb0c8b8a1bf93582a1633e3d352f.jf_dump
Stored 692403282 kmers
Weak kmer threshold rate: 0.062266 (estimated from 0.950/1 of the chosen kmers)
Bad quality threshold is '#'
Processed 117106412 reads
	Corrected