### Word2Vec to classify metagenome samples by vector presence:

Use Word2Vec method that was used for microbiome samples. Create corpus that is a collection of vector sequences from either Addgene or NCBI. Test on one metagenomic sample (desert) to see if clusters form based on presence of vectors. 

Start with a k-mer length of 6 since RIs seem to be 6bp. At this time, I do not think that taking into account MCS would make a difference. (because of the nature of Word2Vec, n-grams, etc.) How would W2V act (what does it tell us) when extending the k-mer length to encompass all/most of the MCS? Think more about this...

**To try:**

See if a longer k-mer length has a better effect (if computationally possible).

Expand to more than one sample.

In [6]:
import pandas as pd
import numpy as np
import csv

from Bio import Entrez
from time import sleep

import matplotlib.pyplot as plt
import seaborn as sns

import gensim
import pickle

Data was extracted from the Genome database on NCBI, only (all available) plasmid sequences were selected. This included 20384 plasmids.

In [2]:
vector_metadata = pd.read_csv('all_ncbi.csv')

Stripping the accession numbers from the 'Replicons' column. Only take the first accession number if there is more than one listed

In [3]:
accessions = [vector_metadata['Replicons'][i].partition(':')[2] for i,n in enumerate(vector_metadata['Replicons'])]

for i,n in enumerate(accessions):
    if '/' in n:
        accessions[i] = accessions[i].partition('/')[0]

Fetch plasmid sequence for each accession number and deposit into corpus for model training

In [4]:
Entrez.email = 'camelliahilker@gmail.com'
corpus = []

for ind,accession in enumerate(accessions):
    try:
        try:
            print('\r' + 'Fetching ' + str(ind+1) + ' of ' + str(len(accessions)) + 
                  ', or ' + str(round((ind/len(accessions))*100, 4)) + '% done', end='')

            handle = Entrez.efetch(db="nucleotide", id=accession, rettype="fasta", retmode="text")

            sequence = ''
            for line in handle:
                sequence += handle.readline().strip()

            corpus.append(sequence)

        except:
            #pause so that API doesn't time out
            sleep(5)

            print('\r' + 'Fetching ' + str(ind+1) + ' of ' + str(len(accessions)) + 
                  ', or ' + str(round((ind/len(accessions))*100, 4)) + '% done', end='')

            handle = Entrez.efetch(db="nucleotide", id=accession, rettype="fasta", retmode="text")

            sequence = ''
            for line in handle:
                sequence += handle.readline().strip()

            corpus.append(sequence)
    except:
        #pause so that API doesn't time out
        sleep(10)

        print('\r' + 'Fetching ' + str(ind+1) + ' of ' + str(len(accessions)) + 
                  ', or ' + str(round((ind/len(accessions))*100, 4)) + '% done', end='')

        handle = Entrez.efetch(db="nucleotide", id=accession, rettype="fasta", retmode="text")

        sequence = ''
        for line in handle:
            sequence += handle.readline().strip()

        corpus.append(sequence)

Fetching 20383 of 20384, or 99.9951% done

In [10]:
corpus_csv = pd.DataFrame(corpus)
corpus_csv.to_csv('corpus.csv', index=False)