# Creating The Dataset

We use all of the filtration and dataprocessing tools/code in filters.ipynb and dataprocessing.ipynb to actually generate a balanced dataset upon which to train our neural network. Our database is a single .h5 file called "NanoporeFiltered.h5". Inside it:
- the segments of data are stored under the dataset "NoisySignals"
- the low pass timestepped filter optimal parameters are stored under the dataset "LowPass"
- the high pass timestepped filter optimal parameters are stored under the dataset "HighPass"
- the band pass timestepped filter optimal parameters are stored under the dataset "BandPass"
- the Butterworth low pass filter optimal parameters are stored under the dataset "ButterworthLowPass"

By using our object-oriented code, we can keep appending to our dataset with a variety of genomic data from various sources. 

## Modules/Libraries
All the original code used was already written in filters.ipynb and dataprocessing.ipynb. We simply use those notebooks and all their dependencies to keep building our database. 

In [1]:
# Use filtration code developed in filters.ipynb and data processing tools in dataprocessing.ipynb to build dataset
import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import solve_ivp
from scipy import signal
from scipy.optimize import curve_fit
import math
from typing import Optional
import os
import h5py
import random

import import_ipynb
from filters import Filters, TimesteppedFilters, ButterworthFilters
from dataprocessing import genData, dataTreatment, writeH5

importing Jupyter notebook from filters.ipynb
importing Jupyter notebook from dataprocessing.ipynb


## Paths
The user provides the paths to DeepSimulator, the sequence of interest, the fasta file containing the sequence, and the database file to build the database. 

In [2]:
PATH_TO_DEEPSIM = '/Users/aaronphilip/ScienceFair/projects/NanoporeSequencingFiltering/DeepSimulator'
SEQUENCE = 'artificial_human_chr22'
PATH_TO_FASTA = os.path.join(PATH_TO_DEEPSIM, 'example/%s.fasta' % SEQUENCE)
PATH_TO_H5 = '/Users/aaronphilip/ScienceFair/projects/NanoporeSequencingFiltering/database'

In [3]:
humanChr22 = genData(PATH_TO_DEEPSIM, SEQUENCE, PATH_TO_FASTA)
truthRead = humanChr22.truthRead()
noisyRead = humanChr22.noisyRead()

Offset: 37.0 , Range: 1368.36 , Digitisation: 8192.0
Offset: 37.0 , Range: 1368.36 , Digitisation: 8192.0


## Data Processing and H5 Writing
It should be noted that the user should **never** use "Run All" in this script. This will overwrite the database. Instead, work cell-by-cell to generate results for new data to append to the database. All cells besides the below can be run multiple times. Only run the .initH5() command when creating the database file for the first time or starting over. 

In [6]:
treated = dataTreatment(noisyRead, truthRead, 500)
noisyData, labels, errors = treated.optParams()
h5obj = writeH5(PATH_TO_H5, noisyData, labels)
h5obj.initH5()
h5obj.appendH5()