In [4]:
cd ..

/Users/adrienne/Projects/fastBio


In [5]:
import fastBio

# Steps to training a model

In fast.ai, there are three basic steps to training a deep learning model: 

1) Define your **transforms** (for sequences/text, this means defining the **tokenizer** and **vocabulary** you will use for tokenization and numericalization)

2) Create a **Databunch** (which wraps up a Pytorch Dataset and Dataloader into one)

3) Create a **Learner** with your specified **model config**

and train!

If fastai v1 is new to you, I recommend taking a look at their very extensive [documentation](https://fastai1.fast.ai/), [forum](https://forums.fast.ai/), and [online course](https://course19.fast.ai/). Note fastBio uses fastai v1, which isn't compatible with the new fastai v2.

Biological sequence data asks for some special treatment as compared to text (kmer-based tokenization; handling sequence file types like fasta/fastq), so while we can use much of the built-in fast.ai *text* functionality, fastBio provides some helper functions and classes to deal with some of the quirks of biological data.

# create tokenizer and vocabulary for transforming seq data

In [36]:
#define a tokenizer with the correct kmer size and stride for your data

from fastBio.transform import BioTokenizer, BioVocab
tok = BioTokenizer(ksize=1, stride=1)
tok

BioTokenizer with the following special tokens:
 - xxunk
 - xxpad
 - xxbos
 - xxeos

The kmer size is how many nucleotides constitute a 'word' in the sequence, and the stride is the number of nucleotides to skip between tokens. 

So for a sequence: `ACGGCGCTC`

a kmer size of 3 and stride of 1 would result in the tokenized sequence: `['ACG','CGG','GGC','GCG','CGC','GCT','CTC']`

whereas a kmer size of 3 and stride of 3 would result in: `['ACG','GCG','CTC']`

## create vocab from scratch

In [34]:
model_voc = BioVocab.create_from_ksize(ksize=1)
print(model_voc.itos)
model_voc.stoi

['xxunk', 'xxpad', 'xxbos', 'xxeos', 'T', 'C', 'A', 'G']


defaultdict(int,
            {'xxunk': 0,
             'xxpad': 1,
             'xxbos': 2,
             'xxeos': 3,
             'T': 4,
             'C': 5,
             'A': 6,
             'G': 7})

Above I created a vocabulary using a kmer size of 1 (so just the nucleotides A, C, T, G), but you can use larger kmer sizes as well:

In [39]:
model_voc = BioVocab.create_from_ksize(ksize=2)
print(model_voc.itos)
model_voc.stoi

['xxunk', 'xxpad', 'xxbos', 'xxeos', 'TA', 'TT', 'AG', 'CG', 'TC', 'CT', 'CA', 'AT', 'CC', 'GG', 'GA', 'AA', 'GT', 'AC', 'GC', 'TG']


defaultdict(int,
            {'xxunk': 0,
             'xxpad': 1,
             'xxbos': 2,
             'xxeos': 3,
             'TA': 4,
             'TT': 5,
             'AG': 6,
             'CG': 7,
             'TC': 8,
             'CT': 9,
             'CA': 10,
             'AT': 11,
             'CC': 12,
             'GG': 13,
             'GA': 14,
             'AA': 15,
             'GT': 16,
             'AC': 17,
             'GC': 18,
             'TG': 19})

## Or download the predefined LookingGlass vocabulary

For training the LookingGlass model, I used a ksize=1, stride=1. If you're using a pretrained LookingGlass-based model, you want to make sure that your vocabulary is in the same order so that numericalization is the same for your data as for the LookingGlass weights. 

Or, it's easy to simply download the LookingGlass vocabulary for this purpose:

In [15]:
#or download from pretrained vocab used in LookingGlass

#you might need this if you are me...
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

import urllib.request
urllib.request.urlretrieve ("https://github.com/ahoarfrost/LookingGlass/releases/download/v1.0/ngs_vocab_k1_withspecial.npy", "ngs_vocab_k1_withspecial.npy")


('ngs_vocab_k1_withspecial.npy', <http.client.HTTPMessage at 0x7f99355f4b70>)

In [17]:
voc = np.load('ngs_vocab_k1_withspecial.npy')
model_voc = BioVocab(voc)

In [21]:
print(model_voc.itos)
model_voc.stoi

['xxunk' 'xxpad' 'xxbos' 'xxeos' 'G' 'A' 'C' 'T']


defaultdict(int,
            {'xxunk': 0,
             'xxpad': 1,
             'xxbos': 2,
             'xxeos': 3,
             'G': 4,
             'A': 5,
             'C': 6,
             'T': 7})

Notice that the order of the nucleotides in the vocabulary is different than the one that we generated from scratch; if you're using the pretrained LookingGlass-based models, make sure you're using the LookingGlass vocab described here as well.

# create a databunch 

You can create a databunch using the **BioLMDataBunch** (for language modeling) or **BioClasDataBunch** (for classification). You can do this from raw sequence data fasta/fastq files or csv files:

* from_folder
* from_seqfile
* from_df
* from_multiple_csv

You will probably want to create a **BioLMDataBunch** from_folder (which will include all sequences from a folder containing multiple fasta/fastq files), or from_seqfile (all sequences from a single fasta or fastq file). 

For a **BioClasDataBunch**, I find it easiest in practice to convert sequence files like fasta/fastq to csv files with the label in a column and the sequence in another column, and use from_df or from_multiple_csv, rather than use from_seqfile or from_folder. Alternatively, you *can* use the **BioTextList** class to go straight from sequence files. 

You can create a custom databunch, a la the fast.ai data block API, using the **BioTextList** class, which provides a few extra specialized labeling functions etc. If you *must* use sequence files for classification, for example, you can provide a fairly complicated regex-based function to use fastai's label_from_func, or create a BioTextList.from_folder and use label_from_fname or label_from_header in the BioTextList class to extract labels from a filename or fasta header, for instance.

## BioLMDataBunch example

Here we'll download some toy metagenomes (a small subset of sequences from 6 marine metagenomes from the [TARA project](https://www.ebi.ac.uk/ena/browser/view/PRJEB402)), split them into 'train' and 'valid' folders, and create a BioLMDataBunch:

In [None]:
#these are 1000 random sequences from 6 marine metagenomes from the TARA project:
urllib.request.urlretrieve (url, "/data/train/")


data_path = Path('/data/')
train_path = Path('/data/train/')
valid_path = Path('/data/valid/')
data_outfile = '/data/metagenome_LMbunch.pkl'

#define your batch size, ksize, and bptt
bs=512 
bptt=100
ksize=1

max_seqs=None #None or int to optionally limit the number of sequences read from each file in training
val_max_seqs=None #same for valid set
skiprows = 0 #0 or int to optionally skip X sequences in the beginning of the file before reading into the databunch
val_skiprows = 0 #same for valid set
#these will default to the parameters chosen here, we don't technically need to pass them

#using tok and model_voc defined above

#create new training chunk 
print('creating databunch')
data = BioLMDataBunch.from_folder(path=data_path, 
                                        train=train_path, valid=valid_path, ksize=ksize,
                                        tokenizer=tok, vocab=model_voc,
                                        max_seqs_per_file=max_seqs, val_maxseqs=val_max_seqs,
                                        skiprows=skiprows, val_skiprows=val_skiprows,
                                        bs=bs, bptt=bptt
                                            )
print('there are',len(data.items),'items in itemlist, and',len(data.valid_ds.items),'items in data.valid_ds')
print('databunch preview:')
print(data)
#you can save your databunch to file like so:
data.save(data_outfile)

## BioClasDataBunch example

# create a learner and train

In [None]:
#pretrained 

## using a pretrained model 

In [None]:
#create LookingGlass() model from databunch defined above

In [None]:
#create LookingGlassClassifier() model from classifier databunch defined above

In [None]:
#pretrained model options:
#    'LookingGlass'
#    'LookingGlass_enc'
#    'FunctionalClassifier_enc'
#    'FunctionalClassifier'
#    'OptimalTempClassifier'
#    'OxidoreductaseClassifier'
#    'ReadingFrameClassifier'

## from scratch

In [28]:
#you can now create any model config you want, plug it into a fastai learner along with your databunch, and train
#lm
#classifier

## I don't want to deal with all the databunch/training stuff. What if I really just want to make some predictions on some data with a pretrained model? 

You can do that! LookingGlass and associated transfer learning tasks

In [None]:
#use the export pretrained models with load_learner (like in InterpretLG)