See NLP_vectors_1 for more info about thought process and how the corpus was created from the vectors available via NCBI. Notebook was separated here because the vector sequences take ~3 hrs to collect via the api.

In [9]:
import pandas as pd
import numpy as np
import csv

from Bio import Entrez
from time import sleep

import matplotlib.pyplot as plt
import seaborn as sns

import gensim
import pickle
from itertools import product

In [2]:
corpus = pd.read_csv('corpus.csv')
corpus = list(corpus['0'])

In [3]:
len(corpus)

20384

In order to train W2V I have to 'tokenize' the corpus, or divide up the sequences into 'words' (k-mers). Here I had to make a decision on what length I wanted, or what length was computationally possible. The k-mers have to be subdivided on each reading frame, therefore the number of 'sequences' after tokenization will be [n * k-mer length].

I initially tried 20 bp k-mers.

In [4]:
tokenized_corpus = []
kmer_len = 20

for ind,sequence in enumerate(corpus):
    print('\r' + 'Tokenizing sequence: ' + str(ind+1) + ' of ' + str(len(corpus)) + 
                  ', or ' + str(round((ind/len(corpus))*100, 4)) + '% done', end='')
    for j in range(kmer_len):
        counter = 0
        try:
            tokenized_sequence = []
            while counter+kmer_len < len(sequence):
                tokenized_sequence.append(str(sequence[j+counter:j+counter+kmer_len]))
                counter += kmer_len
            tokenized_corpus.append(tokenized_sequence)
        except:
            pass

Tokenizing sequence: 20384 of 20384, or 99.9951% done

#### Writing the file takes a long time, also, the resulting file is 22.6GB. Omit unless you want to create a huge file

In [None]:
#with open("tokenized_corpus.csv", "w", newline="") as f:
#    writer = csv.writer(f)
#    writer.writerows(tokenized_corpus)

#### This is where the model is trained. Parameters can be tuned. Initially used 10-dim vectors and window size of 5 (default is 2). Thought that a larger window could account for restriction sites in each plasmid sequence (are they always in the same order?). Uses skip-gram algorithm for training. Maybe this also needs a larger vector size?

***LONG TIME TRAINING D:***

Source: https://arxiv.org/pdf/1301.3781.pdf

In [5]:
model = gensim.models.Word2Vec(tokenized_corpus, size=10, window=5, min_count=90, sg=1)
pickle.dump(model, open("w5_model.p", "wb"))

#### Some info about the model that was trained above:

In [12]:
#number of k-mers present in training set
model_vectors = model.wv
len(model_vectors.vectors)

182427