## Simmulating MinION reads with Nanosim

[NanoSim](https://github.com/bcgsc/NanoSim) is a fast and scalable read simulator that captures the technology-specific features of ONT data, and allows for adjustments upon improvement of nanopore sequencing technology.

NanoSim proceeds in two steps. In the first step, NanoSim provides a comprehensive alignment-based analysis and generates a set of read profiles (read characterization step). Whereas in the second step, it models the read features such as the base-calling errors of ONT reads and the length distribution (built in the previous step) to produce in silico reads for a given reference genome (simulation step). Although ONT chemistry is still evolving, the statistical models that NanoSim uses are theoretically valid for any version of it. At the moment, it has been tested with both the 1D reads using the R9 chemistry, and the 2D reads using the chemistry R7, R7.3 and R9. NanoSim can simulate either circular or linear genomes and is capable of simulating reads from genomes with multiple chromosomes. An important note is that the runtime and memory requirements of NanoSim are highly dependent on the number of simulated reads and the length of the reference sequence.

NanoDJ uses [Nanosim-h](https://github.com/karel-brinda/NanoSim-H), a fork from the NanoSim project, which permits moving directly to the simulation step using precomputed error profiles instead of investing time on the read characterization step. This alternative may be attractive for those users without access to ONT read-level information data.

In [None]:
!nanosim-h -h

In [None]:
!nanosim-h -n 200 data/sample/reference/NZ_CP010867.1_Ref.fasta

Using the QualityControl.ipynb the user can generate distinct plots and tables (read length, GC content, etc) out of the simulated reads:

In [None]:
%load_ext rpy2.ipython

#Biopython module imports for loading sequence files
from Bio import SeqIO
from Bio.SeqUtils import GC

#Numpy and pandas to manage our data
import numpy as np
import pandas as pd
import matplotlib.mlab as mlab
import matplotlib.pyplot as plt
import matplotlib

#We will use reagular expressions for extract the read id
import re

#Function to build a pandas dataframe from a FASTA or FASTQ file
def read_sequence_file(filename, format):
    rows = []
    columns = ['id', 'sequence', 'read_length', 'gc_content']
    if(format == 'fastq'):
        columns.append('avg_quality')
    #Building each row with the read and its features
    for seq_record in SeqIO.parse(filename, format):
        if(format == 'fastq'):
            quality_score = np.mean(seq_record.letter_annotations["phred_quality"])
        read_id = seq_record.description
        if(format == 'fastq'):
            rows.append([read_id, seq_record.seq, len(seq_record.seq), GC(seq_record.seq), quality_score])
        else:
            rows.append([read_id, seq_record.seq, len(seq_record.seq), GC(seq_record.seq)])  
        
    df = pd.DataFrame(rows, columns=columns)
    return df

# ----- Place the path to your reads file here ----- #
seq_df = read_sequence_file('simulated.fa', 'fasta')

#Setting the plot size
matplotlib.rcParams['figure.figsize'] = (9, 8)

#We can customize the plot using the pre-defined style called “ggplot” to emulate ggplot (popular R vizualization package) style 
plt.style.use('seaborn-notebook')

#On the plt.hist parameters we can set the number of bins and the bar color using labels as 'blue', 'red', 'yellow'... or using a RGB code.
#CHANGE 'read_length' FOR 'gc_content' TO PLOT GC CONTENT INSTAD OF READ LENGTH 
plt.hist(seq_df['read_length'], bins=50, color='#ef8a62')
plt.title('Read length histogram')
plt.xlabel('Read length in bp')
plt.ylabel('Count')

plt.show()

## Nanosim regular version usage

In the regular version of NanoSim, the two stages need to be executed separately. For the read characterization step, NanoSim takes a training read set in FASTA format and use minimap2 or LAST as alignment algorithm to characterize base-calling errors. An alignment file in SAM/MAF format can be provided instead of the reads. A set of read profiles is then generated and used as input in the simulation step.

Ready to use read profiles are available and can be downloaded from [Supplementary data of NanoSim](ftp://ftp.bcgsc.ca/supplementary/NanoSim/). These profiles allow users to move directly to the simulation step.This is available since the last version.

In [None]:
!read_analysis.py -h

In [None]:
!read_analysis.py -i data/sample/reads.fastq -r data/sample/reference/NZ_CP010867.1_Ref.fasta -o sample

Simulation stage takes read profiles and reference as input. The output of this stage consists of a FASTA file with the simulated reads.

In [None]:
!simulator.py -H

In [None]:
!simulator.py circular -r data/sample/reference/NZ_CP010867.1_Ref.fasta -c sample

### Reference:

[1] Yang C., Chu J., Warren R.L, Birol I. NanoSim: nanopore sequence read simulator based on statistical characterization. GigaScience, Volume 6, Issue 4, 1 April 2017, gix010. DOI: https://doi.org/10.1093/gigascience/gix010
