#### Running model on one sample from the Desert dataset. Sample "mgs283966" obtained from MG-RAST. Smallest sample in dataset: ~1.2E9 bp, ~7.2E6 sequences

Citation:

Finkel OM, Delmont TO, Post AF, Belkin S. Metagenomic Signatures of Bacterial Adaptation to Life in the Phyllosphere of a Salt-Secreting Desert Tree. Appl Environ Microbiol. 2016 Apr 18;82(9):2854-61.

#### When I did this, there were no hits to the plasmids, I decided to try another sample from the Urban dataset and from NCBI SRA. Hsu_etal_2016_SRA

In [1]:
import pandas as pd
import numpy as np
import csv

from Bio import Entrez
from time import sleep

import matplotlib.pyplot as plt
import seaborn as sns

import gensim
import pickle

import os
import gzip

In [2]:
model = pickle.load(open("w5_model.p", "rb"))
model_vectors = model.wv

In [3]:
#processed to only include sequences, used the '5. Host DNA contamination removal' version of the sample from MG-RAST

desert_sample = open('mgm4620735.3.299.screen.passed.fna', 'r').readlines()
desert_sample = [seq.rstrip().upper() for seq in desert_sample[1::2]]

In [13]:
len(desert_sample)

5670854

Here is where I attempt to iterate over all of the files in the hsu_urban directory. Also need to be unzipped and the files are large, kept read 1 and 2

In [4]:
# urban_sample = []
# directory = os.fsencode('hsu_urban')

# for file in os.listdir(directory):
#     sample = []
#     with gzip.open(file,'rb') as f:        
#         for line in f:
#             sample.append(line.decode("utf-8"))
#             sample = [i.rstrip() for i in sample[1::4]]
#         f.close()
#     urban_sample += sample

#### This process is similar to how the model was trained, but to save memory only k-mers that have a match in the model are kept.

In [5]:
sample = desert_sample
vectorized_sample = []
kmer_len = 20

for ind,sequence in enumerate(sample):
    print('\r' + 'Tokenizing sequence: ' + str(ind+1) + ' of ' + str(len(sample)) + 
                  ', or ' + str(round((ind/len(sample))*100, 4)) + '% done', end='')
    for j in range(kmer_len):
        counter = 0
        num_matches = 0
        try:
            vectorized_sequence = []
            while counter+kmer_len < len(sequence):
                try:
                    model_vectors.similar_by_word(sequence[j+counter:j+counter+kmer_len])
                    if num_matches == 0:
                        vectorized_sequence = model_vectors[sequence[j+counter:j+counter+kmer_len]]
                    else:
                        vectorized_sequence = np.add(vectorized_sequence, model_vectors[sequence[j+counter:j+counter+kmer_len]])
                except:
                    pass
                counter += kmer_len
                num_matches += 1
            vectorized_sample.append(vectorized_sequence/num_matches)
        except:
            pass

Tokenizing sequence: 1 of 5670854, or 0.0% doneTokenizing sequence: 2 of 5670854, or 0.0% doneTokenizing sequence: 3 of 5670854, or 0.0% doneTokenizing sequence: 4 of 5670854, or 0.0001% doneTokenizing sequence: 5 of 5670854, or 0.0001% doneTokenizing sequence: 6 of 5670854, or 0.0001% doneTokenizing sequence: 7 of 5670854, or 0.0001% doneTokenizing sequence: 8 of 5670854, or 0.0001% doneTokenizing sequence: 9 of 5670854, or 0.0001% doneTokenizing sequence: 10 of 5670854, or 0.0002% doneTokenizing sequence: 11 of 5670854, or 0.0002% doneTokenizing sequence: 12 of 5670854, or 0.0002% doneTokenizing sequence: 13 of 5670854, or 0.0002% doneTokenizing sequence: 14 of 5670854, or 0.0002% doneTokenizing sequence: 15 of 5670854, or 0.0002% doneTokenizing sequence: 16 of 5670854, or 0.0003% doneTokenizing sequence: 17 of 5670854, or 0.0003% doneTokenizing sequence: 18 of 5670854, or 0.0003% doneTokenizing sequence: 19 of 5670854, or 0.0003% doneTokenizing sequence: 20 of 56

Tokenizing sequence: 220 of 5670854, or 0.0039% doneTokenizing sequence: 221 of 5670854, or 0.0039% doneTokenizing sequence: 222 of 5670854, or 0.0039% doneTokenizing sequence: 223 of 5670854, or 0.0039% doneTokenizing sequence: 224 of 5670854, or 0.0039% doneTokenizing sequence: 225 of 5670854, or 0.004% doneTokenizing sequence: 226 of 5670854, or 0.004% doneTokenizing sequence: 227 of 5670854, or 0.004% doneTokenizing sequence: 228 of 5670854, or 0.004% doneTokenizing sequence: 229 of 5670854, or 0.004% doneTokenizing sequence: 230 of 5670854, or 0.004% doneTokenizing sequence: 231 of 5670854, or 0.0041% doneTokenizing sequence: 232 of 5670854, or 0.0041% doneTokenizing sequence: 233 of 5670854, or 0.0041% doneTokenizing sequence: 234 of 5670854, or 0.0041% doneTokenizing sequence: 235 of 5670854, or 0.0041% doneTokenizing sequence: 236 of 5670854, or 0.0041% doneTokenizing sequence: 237 of 5670854, or 0.0042% doneTokenizing sequence: 238 of 5670854, or 0.0042% don

Tokenizing sequence: 5220826 of 5670854, or 92.0642% done

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



Tokenizing sequence: 5366477 of 5670854, or 94.6326% doneTokenizing sequence: 5366478 of 5670854, or 94.6326% doneTokenizing sequence: 5366479 of 5670854, or 94.6326% doneTokenizing sequence: 5366480 of 5670854, or 94.6326% doneTokenizing sequence: 5366481 of 5670854, or 94.6327% doneTokenizing sequence: 5366482 of 5670854, or 94.6327% doneTokenizing sequence: 5366483 of 5670854, or 94.6327% doneTokenizing sequence: 5366484 of 5670854, or 94.6327% doneTokenizing sequence: 5366485 of 5670854, or 94.6327% doneTokenizing sequence: 5366486 of 5670854, or 94.6327% doneTokenizing sequence: 5366487 of 5670854, or 94.6328% doneTokenizing sequence: 5366488 of 5670854, or 94.6328% doneTokenizing sequence: 5366489 of 5670854, or 94.6328% doneTokenizing sequence: 5366490 of 5670854, or 94.6328% doneTokenizing sequence: 5366491 of 5670854, or 94.6328% doneTokenizing sequence: 5366492 of 5670854, or 94.6329% doneTokenizing sequence: 5366493 of 5670854, or 94.6329% doneTokenizing se

Tokenizing sequence: 5367141 of 5670854, or 94.6443% doneTokenizing sequence: 5367142 of 5670854, or 94.6443% doneTokenizing sequence: 5367143 of 5670854, or 94.6443% doneTokenizing sequence: 5367144 of 5670854, or 94.6444% doneTokenizing sequence: 5367145 of 5670854, or 94.6444% doneTokenizing sequence: 5367146 of 5670854, or 94.6444% doneTokenizing sequence: 5367147 of 5670854, or 94.6444% doneTokenizing sequence: 5367148 of 5670854, or 94.6444% doneTokenizing sequence: 5367149 of 5670854, or 94.6444% doneTokenizing sequence: 5367150 of 5670854, or 94.6445% doneTokenizing sequence: 5367151 of 5670854, or 94.6445% doneTokenizing sequence: 5367152 of 5670854, or 94.6445% doneTokenizing sequence: 5367153 of 5670854, or 94.6445% doneTokenizing sequence: 5367154 of 5670854, or 94.6445% doneTokenizing sequence: 5367155 of 5670854, or 94.6445% doneTokenizing sequence: 5367156 of 5670854, or 94.6446% doneTokenizing sequence: 5367157 of 5670854, or 94.6446% doneTokenizing se

Tokenizing sequence: 5367416 of 5670854, or 94.6491% doneTokenizing sequence: 5367417 of 5670854, or 94.6492% doneTokenizing sequence: 5367418 of 5670854, or 94.6492% doneTokenizing sequence: 5367419 of 5670854, or 94.6492% doneTokenizing sequence: 5367420 of 5670854, or 94.6492% doneTokenizing sequence: 5367421 of 5670854, or 94.6492% doneTokenizing sequence: 5367422 of 5670854, or 94.6493% doneTokenizing sequence: 5367423 of 5670854, or 94.6493% doneTokenizing sequence: 5367424 of 5670854, or 94.6493% doneTokenizing sequence: 5367425 of 5670854, or 94.6493% doneTokenizing sequence: 5367426 of 5670854, or 94.6493% doneTokenizing sequence: 5367427 of 5670854, or 94.6493% doneTokenizing sequence: 5367428 of 5670854, or 94.6494% doneTokenizing sequence: 5367429 of 5670854, or 94.6494% doneTokenizing sequence: 5367430 of 5670854, or 94.6494% doneTokenizing sequence: 5367431 of 5670854, or 94.6494% doneTokenizing sequence: 5367432 of 5670854, or 94.6494% doneTokenizing se

Tokenizing sequence: 5367564 of 5670854, or 94.6518% doneTokenizing sequence: 5367565 of 5670854, or 94.6518% doneTokenizing sequence: 5367566 of 5670854, or 94.6518% doneTokenizing sequence: 5367567 of 5670854, or 94.6518% doneTokenizing sequence: 5367568 of 5670854, or 94.6518% doneTokenizing sequence: 5367569 of 5670854, or 94.6518% doneTokenizing sequence: 5367570 of 5670854, or 94.6519% doneTokenizing sequence: 5367571 of 5670854, or 94.6519% doneTokenizing sequence: 5367572 of 5670854, or 94.6519% doneTokenizing sequence: 5367573 of 5670854, or 94.6519% doneTokenizing sequence: 5367574 of 5670854, or 94.6519% doneTokenizing sequence: 5367575 of 5670854, or 94.652% doneTokenizing sequence: 5367576 of 5670854, or 94.652% doneTokenizing sequence: 5367577 of 5670854, or 94.652% doneTokenizing sequence: 5367578 of 5670854, or 94.652% doneTokenizing sequence: 5367579 of 5670854, or 94.652% doneTokenizing sequence: 5367580 of 5670854, or 94.652% doneTokenizing sequence

Tokenizing sequence: 5670854 of 5670854, or 100.0% donene

In [11]:
desert_df = pd.DataFrame(vectorized_sample)
desert_df.to_csv("desert_sample.csv")

In [12]:
len(desert_df)

2663