# Creating The Dataset

We use all of the filtration and dataprocessing tools/code in filters.ipynb and dataprocessing.ipynb to actually generate a balanced dataset upon which to train our neural network. Our database is a single .h5 file called "NanoporeFiltered.h5". Inside it:
- the segments of data are stored under the dataset "NoisySignals"
- the low pass timestepped filter optimal parameters are stored under the dataset "LowPass"
- the high pass timestepped filter optimal parameters are stored under the dataset "HighPass"
- the band pass timestepped filter optimal parameters are stored under the dataset "BandPass"
- the Butterworth low pass filter optimal parameters are stored under the dataset "ButterworthLowPass"

By using our object-oriented code, we can keep appending to our dataset with a variety of genomic data from various sources. 

## Modules/Libraries
All the original code used was already written in filters.ipynb and dataprocessing.ipynb. We simply use those notebooks and all their dependencies to keep building our database. 

In [1]:
# Use filtration code developed in filters.ipynb and data processing tools in dataprocessing.ipynb to build dataset
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import solve_ivp
from scipy import signal
from scipy.optimize import curve_fit
import math
from typing import Optional
import os
import h5py
import random

import import_ipynb
from filters import Filters, TimesteppedFilters, ButterworthFilters
from dataprocessing import genData, dataTreatment, writeH5

importing Jupyter notebook from filters.ipynb
importing Jupyter notebook from dataprocessing.ipynb


## Paths
The user provides the paths to DeepSimulator, the sequence of interest, the fasta file containing the sequence, and the database file to build the database. 

In [2]:
PATH_TO_DEEPSIM = '/Users/aaronphilip/ScienceFair/projects/NanoporeSequencingFiltering/DeepSimulator'
SEQUENCE = 'artificial_human_chr22'
PATH_TO_FASTA = os.path.join(PATH_TO_DEEPSIM, 'example/%s.fasta' % SEQUENCE)
PATH_TO_H5 = '/Users/aaronphilip/ScienceFair/projects/NanoporeSequencingFiltering/database'

## Data Processing and H5 Writing
We recommend not using the 'Run All' command. This may result in database overwriting. Instead, after a group of data has been processed, change the H5NAME in that cell to some other name( we used 'test' ) to avoid accidentally running all cells and rewriting the database. In the cells below, we outline exactly how the user can append to their database using samples of various noise. Note that the genData object doesn't need to be created everytime, but can be used with mutliple .noisyRead() run consecutively to save some computation time. However, we initially built in 500 datapoint increments

In [6]:
# 500 segments of data with default noisy on Human Chr. 22
H5NAME = 'test'
humanChr22 = genData(PATH_TO_DEEPSIM, SEQUENCE, PATH_TO_FASTA)
truthRead = humanChr22.truthRead()
noisyRead = humanChr22.noisyRead()

treated = dataTreatment(noisyRead, truthRead, 500)
noisyData, labels, errors = treated.optParams()
h5obj = writeH5(PATH_TO_H5, noisyData, labels, H5NAME)
h5obj.initH5()
h5obj.appendH5()

In [3]:
# 500 segments of data from Human Chr. 22 with 90.3% typical accuracy w/o filtration
H5NAME = 'test'
humanChr22 = genData(PATH_TO_DEEPSIM, SEQUENCE, PATH_TO_FASTA)
truthRead = humanChr22.truthRead()
noisyRead = humanChr22.noisyRead(e=1.5, f=950, s=0)

treated = dataTreatment(noisyRead, truthRead, 500)
noisyData, labels, errors = treated.optParams()
h5obj = writeH5(PATH_TO_H5, noisyData, labels, H5NAME)
h5obj.appendH5()

Offset: 37.0 , Range: 1368.36 , Digitisation: 8192.0
Offset: 37.0 , Range: 1368.36 , Digitisation: 8192.0


In [3]:
# 500 segments of data from Human Chr. 22 with 90.9% typical accuracy w/o filtration
H5NAME = 'NanoporeFiltered'
humanChr22 = genData(PATH_TO_DEEPSIM, SEQUENCE, PATH_TO_FASTA)
truthRead = humanChr22.truthRead()
noisyRead = humanChr22.noisyRead(e=0, f=950, s=2)

treated = dataTreatment(noisyRead, truthRead, 500)
noisyData, labels, errors = treated.optParams()
h5obj = writeH5(PATH_TO_H5, noisyData, labels, H5NAME)
h5obj.appendH5()

Offset: 37.0 , Range: 1368.36 , Digitisation: 8192.0
Offset: 37.0 , Range: 1368.36 , Digitisation: 8192.0


In [None]:
# 500 segments of data from Human Chr. 22 with 91% typical accuracy w/o filtration
H5NAME = 'NanoporeFiltered'
noisyRead = humanChr22.noisyRead(e=0, f=850, s=0)

treated = dataTreatment(noisyRead, truthRead, 500)
noisyData, labels, errors = treated.optParams()
h5obj = writeH5(PATH_TO_H5, noisyData, labels, H5NAME)
h5obj.appendH5()

In [None]:
# 500 segments of data from Human Chr. 22 with 94% typical accuracy w/o filtration
#not run yet
H5NAME = 'NanoporeFiltered'
noisyRead = humanChr22.noisyRead(e=0.5, f=950, s=0)

treated = dataTreatment(noisyRead, truthRead, 500)
noisyData, labels, errors = treated.optParams()
h5obj = writeH5(PATH_TO_H5, noisyData, labels, H5NAME)
h5obj.appendH5()