# Real Long Read Data #
For this experiment, the following true E.coli genomic data were used:
#### Reference genome ####
* [_Escherichia coli_ str. K-12 substr. MG1655, complete genome](http://www.ncbi.nlm.nih.gov/nuccore/NC_000913)

#### PacBio Long Reads ####
* [_E.coli_ K12 MG1655 CLR Reads from Pacific Biosciences DevNet](https://github.com/PacificBiosciences/DevNet/wiki/E%20coli%20K12%20MG1655%20Hybrid%20Assembly)
* [_E.coli_ K12 MG1655 CCS Reads from Pacific Biosciences Devnet](https://github.com/PacificBiosciences/DevNet/wiki/E%20coli%20K12%20MG1655%20Hybrid%20Assembly)
* [Pacific Biosciences Model Organism Genome Sequencing of _E.coli_ P5C3 - SRR1284073](http://www.ncbi.nlm.nih.gov/sra/?term=SRR1284073)

## Analyses ##
Length distribution histograms and length-identity scatterplots were created for each dataset and mean identities were also found.

To construct the length - identity scatterplots, the reads were mapped onto the reference genome using BLASR and the outputted SAM files were coverted to [Multiple Alignment Format](https://genome.ucsc.edu/FAQ/FAQformat.html#format5) for analyses.

#### Definitions ####
Given an alignment between reference sequence A and read B, the _identity_ of B is defined as the number of positions in the alignment such that the bases of A and B are equal.

### Mean Identities ###

|  Dataset   | Mean Identity |
| :-------:  | :-----------: |
| CLR        | 1969          |
| CCS        | 692           |
| SRR1284073 | 5325          |

### Length Distributions ###

#### CLR ####
![CLR Length Distribution](ecoli_real/length_distributions/PacBio_10kb_CLR_length_dist.png)

#### CCS ####
![CCS Length Distribution](ecoli_real/length_distributions/PacBio_2kb_CCS_500bp_length_dist.png)

#### SRR1284073 ####
![SRR1284073 Length Distribution](ecoli_real/length_distributions/SRR1284073_length_dist.png)

### Length-Identity Scatterplots ###

#### CLR ####
![CLR Length-Identity Scatterplots](ecoli_real/length_identity_scatters/PacBio_10kb_CLR_length_identity_scatter.png)

#### CCS ####
![CCS Length-Identity Scatterplots](ecoli_real/length_identity_scatters/PacBio_2kb_CCS_length_identity_scatter.png)

#### SRR1284073 ####
![SRR1284073 Length-Identity Scatterplots](ecoli_real/length_identity_scatters/SRR1284073_length_identity_scatter.png)

# Simulated Data #
PacBio long reads were simulated using SimLoRD. Two types of reads were simulated: CLR reads (e.g. maximum number of passes over each each read was set to 1) and CCS reads (e.g. maximum number of passes over each read was set to 40). For each type of read, datasets of long reads with coverages 10, 20, 50 and 75 were simulated.

## Simulation of Short Reads ##
Short reads were simulated using ART using the following command:

```
art-illumina -p -i $ref -l 100 -f $cov -m 300 -s 25 -o $outputPrefix
```

An explanation of the parameters used are as such:
* `-l 100` indicates that the length of the reads simulated were 100.
* `-f $cov` indicates that the fold of read coverage was set to `$cov`. `$cov` was then set to 50, 100 and 200.
* `-m 300` indicates that the mean fragment size of paired end reads was set to 300.
* `-s 25` indicates that the standard deviation of the fragment size of the reads was set to 25.

## Simulation of Long Reads ##
The simulation of long reads were done in two steps:
1. Given the length of reads from a real PacBio FASTQ file, the number of reads required for a given coverage were found.
2. Simulate long reads using SimLoRD while sampling lengths from a real PacBio long reads FASTQ file.

The following commands for SimLoRD were used:

```
$simlord -n $reads -sf $fastq -rr $ref $outputPrefix -mp $passes
```

An explanation of the parameters used follows:
* `-mp $passes` indicates that the maximum number of passes allowed for a given read. `$passes` was set to 1 for simulation of CLR data and 40 (which is the default for SimLoRD) to simulate CCS data.
* `-sf $fastq` indicates that the lengths were sampled from a real PacBio FASTQ file. In this case, the _E.coli_ PacBio file SRR1284073 was used.

## Analyses of Simulated Datasets ##
Length distribution histograms, length-identity scatterplots and length - error rate scatter plots were created for each type of dataset (i.e. CLR and CCS). The mean identities and total error rates were also collected.

To construct the length - identity scatterplots, the reads were mapped onto the reference genome using BLASR and the outputted SAM files were coverted to [Multiple Alignment Format](https://genome.ucsc.edu/FAQ/FAQformat.html#format5) for analyses.

To construct the length - error rate scatterplots, the SAM files outputted by SimLoRD were converted to [Multiple Alignment Format](https://genome.ucsc.edu/FAQ/FAQformat.html#format5) for further analyses.

#### Definitions ####
Given an alignment between reference sequence A and read B, the _identity_ of B is defined as the number of positions in the alignment such that the bases of A and B are equal.

Given an alignment between reference sequence A and read B, the _error rate_ of B is the number of positions in the alignment such that the bases of A and B are not equal.

Given a set of SimLoRD reads S, the _total error rate_ of S is defined as the total number of erroneous bases in S divided by the total number of bases in S. This number is given a priori in the SimLoRD FASTQ files.

## CLR Analyses ##

### Mean Identities ###

|  Coverage   | Mean Identity |
| :--------:  | :-----------: |
| 10          | 2884          |
| 20          | 2873          |
| 50          | 2897          |
| 75          | 2879          |

### Error Rate ###

| Coverage | Error Rate |
| :------: | :--------: |
| 10       | 0.165      |
| 20       | 0.165      |
| 50       | 0.165      |
| 75       | 0.165      |

### Length Distributions ###

#### Coverage 10 ####
![Coverage 10 Length Distribution](ecoli_clr/length_distributions/long-d10.png)

#### Coverage 20 ####
![Coverage 20 Length Distribution](ecoli_clr/length_distributions/long-d20.png)

#### Coverage 50 ####
![Coverage 50 Length Distribution](ecoli_clr/length_distributions/long-d50.png)

#### Coverage 75 ####
![Coverage 75 Length Distribution](ecoli_clr/length_distributions/long-d75.png)

### Length - Identity Scatterplots ###

#### Coverage 10 ####
![Coverage 10 Length Identity Scatterplot](ecoli_clr/length_identity_scatters/ecoli-long-d10_length_identity_scatter.png)

#### Coverage 20 ####
![Coverage 20 Length Identity Scatterplot](ecoli_clr/length_identity_scatters/ecoli-long-d20_length_identity_scatter.png)

#### Coverage 50 ####
![Coverage 50 Length Identity Scatterplot](ecoli_clr/length_identity_scatters/ecoli-long-d50_length_identity_scatter.png)

#### Coverage 75 ####
![Coverage 75 Length Identity Scatterplot](ecoli_clr/length_identity_scatters/ecoli-long-d75_length_identity_scatter.png)

### Length - Error Rate Scatterplots ###

#### Coverage 10 ####
![Coverage 10 Length - Error Rate Scatter](ecoli_clr/error_rate_scatter/ecoli-long-d10_error_rate_scatter.png)

#### Coverage 20 ####
![Coverage 20 Length - Error Rate Scatter](ecoli_clr/error_rate_scatter/ecoli-long-d20_error_rate_scatter.png)

#### Coverage 50 ####
![Coverage 50 Length - Error Rate Scatter](ecoli_clr/error_rate_scatter/ecoli-long-d50_error_rate_scatter.png)

#### Coverage 75 ####
![Coverage 75 Length - Error Rate Scatter](ecoli_clr/error_rate_scatter/ecoli-long-d75_error_rate_scatter.png)

## CCS Analysis ##

### Mean Identities ###

|  Coverage   | Mean Identity |
| :--------:  | :-----------: |
| 10          | 2838          |
| 20          | 2843          |
| 50          | 2791          |
| 75          | 2821          |

### Error Rate ###

| Coverage | Error Rate |
| :------: | :--------: |
| 10       | 0.096      |
| 20       | 0.096      |
| 50       | 0.095      |
| 75       | 0.096      |

### Length Distributions ###

#### Coverage 10 ####
![Coverage 10 Length Distribution](ecoli_ccs/length_distributions/long-d10.png)

#### Coverage 20 ####
![Coverage 20 Length Distribution](ecoli_ccs/length_distributions/long-d20.png)

#### Coverage 50 ####
![Coverage 50 Length Distribution](ecoli_ccs/length_distributions/long-d50.png)

#### Coverage 75 ####
![Coverage 75 Length Distribution](ecoli_ccs/length_distributions/long-d75.png)

### Length - Error Rate Scatterplots ###

#### Coverage 10 ####
![Coverage 10 Length - Error Rate Scatter](ecoli_ccs/error_rate_scatter/ecoli_ccs-long-d10_error_rate_scatter.png)

#### Coverage 20 ####
![Coverage 20 Length - Error Rate Scatter](ecoli_ccs/error_rate_scatter/ecoli_ccs-long-d20_error_rate_scatter.png)

#### Coverage 50 ####
![Coverage 50 Length - Error Rate Scatter](ecoli_ccs/error_rate_scatter/ecoli_ccs-long-d50_error_rate_scatter.png)

#### Coverage 75 ####
![Coverage 75 Length - Error Rate Scatter](ecoli_ccs/error_rate_scatter/ecoli_ccs-long-d75_error_rate_scatter.png)

### Length - Identity Scatterplots ###

#### Coverage 10 ####
![Coverage 10 Length Identity Scatterplot](ecoli_ccs/length_identity_scatters/ecoli-long-d10_length_identity_scatter.png)

#### Coverage 20 ####
![Coverage 20 Length Identity Scatterplot](ecoli_ccs/length_identity_scatters/ecoli-long-d20_length_identity_scatter.png)

#### Coverage 50 ####
![Coverage 50 Length Identity Scatterplot](ecoli_ccs/length_identity_scatters/ecoli-long-d50_length_identity_scatter.png)

#### Coverage 75 ####
![Coverage 75 Length Identity Scatterplot](ecoli_ccs/length_identity_scatters/ecoli-long-d75_length_identity_scatter.png)