## Goal of the notebook: Identifying deleted regions

#### Approach:

I'm going to go through the output files from running breseq on the WGS data. If there are locations of missing coverage > 1kb, I'll extract the genomic coordinates and find out which genes and pseudogenes lie within them.

#### Output: 

A numpy text file indicating which genes and pseudogenes are lost in each population analysed in this experiment

In [1]:
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import pathlib
import os

In [2]:
#current working directory
cwd = os.getcwd()
print(cwd)

/Users/anuraglimdi/Desktop/TnSeq_Paper/LTEE-TnSeq_Paper/Analysis/Part_2_WGS_analysis


In [3]:
#use the pathlib.Path function to get the parent directories-> goal is to navigate to directory with the metadata
# and the breseq output data
path = pathlib.Path(cwd)
print(path.parents[1]) #this should be the base directory for the github repository: the exact path will differ for 
#each unique user

/Users/anuraglimdi/Desktop/TnSeq_Paper/LTEE-TnSeq_Paper


In [4]:
metadata_path = str(path.parents[1])+'/Metadata/'
data_path = str(path.parents[1])+'/Data/WGS_Data/Breseq_output/'

In [5]:
libraries = ['REL606', 'REL607', 'REL11330', 'REL11333', 'REL11364', 'REL11336', 'REL11339', 'REL11389', 'REL11392', 'REL11342', 'REL11345', 'REL11348', 'REL11367', 'REL11370']

#input directory with all the coverage data
directory = data_path

In [6]:
#opening the pandas file with all the metadata!
all_data = pd.read_csv(metadata_path+"all_metadata_REL606.txt", sep="\t")

In [7]:
names = all_data.iloc[:,0]

gene_start = all_data.iloc[:,3]
gene_end = all_data.iloc[:,4]

locations = np.transpose(np.vstack([gene_start,gene_end]))
# locations = np.loadtxt('/Users/anuraglimdi/Desktop/TnSeq_LTEE/ReferenceGenome/pseudogenes_locations_REL606.txt')

In [21]:
locations_pseudogenes = np.loadtxt(metadata_path+'/pseudogenes_locations_REL606.txt')

In [33]:
exclude_genes = np.zeros([len(libraries), len(names)])
exclude_pseudogenes = np.zeros([len(libraries), locations_pseudogenes.shape[0]])

In [34]:
locations_pseudogenes.shape[0]

134

## Notation for the output array: 1 in the exclude_genes array indicates that the gene is missing in that genetic background and should be excluded from analysis downstream

### First: identify all genes to be excluded

In [35]:
for k in range(0, len(libraries)):
    filename = directory+libraries[k]+'_output.gd'
    with open(filename) as in_handle:
        data = in_handle.read().splitlines()
    start = []
    end = []
    #extract data corresponding to missing coverage
    mc = [line for line in data if 'MC' in line]
    for entry in mc:
        line = entry.split('\t')
        span = int(line[5]) - int(line[4])
        if span >= 1000:
            start.append(line[4])
            end.append(line[5])
    #now that we have the start and the end coordinates, I'm going to find out which genes lie inside these 
    #deleted regions
    #first, convert start and end to numpy arrays
    start = np.array(start)
    end = np.array(end)
    #getting rid of any duplicates (if they exist) and sorting the start and ends
    unique_start = np.unique(start).astype('int')
    unique_end = np.unique(end).astype('int')
    #now for finding out which genes lie in the deleted regions and exclude them from the analysis
    for i in range(0, locations.shape[0]):
        #if either the start or end of the gene falls in the deleted regions, exclude gene from analysis
        for j in range(0, len(unique_end)):
            if (locations[i, 0] > unique_start[j] and locations[i, 0] < unique_end[j]) or (locations[i, 1] > unique_start[j] and locations[i, 1] < unique_end[j]):
                exclude_genes[k, i] = 1
    print(libraries[k])
    print(sum(exclude_genes[k,:]))
    print(unique_start)
    print(unique_end)
    

REL606
0.0
[]
[]
REL607
0.0
[]
[]
REL11330
43.0
[1269579 1607969 2031699 3893550  546979]
[1270657 1616678 2054931 3901928  555923]
REL11333
103.0
[1600434 1607979 2032390 2100308 2123623 3893620 4146323 4547207  495498
  589607]
[1602329 1616687 2055657 2122453 2143763 3899898 4151504 4551411  499062
  619816]
REL11364
271.0
[1451973 1607969 1729055 2032660 2086597 2877316 3015775 3192748 3549960
 3679141 3893601 4289283 4446817 4521618 4565815  546981  572833  588545]
[1462317 1616661 1731500 2055947 2122425 2910191 3035120 3198581 3553484
 3680779 3901455 4333227 4451689 4561282 4588155  572549  588494  619836]
REL11336
147.0
[1607959 2032119 2125712  227115 2647595 2882770 3023968 3351608 3697046
 3893617 3903505 4015657 4146298 4187474 4521606  546972]
[1616682 2055373 2143775  231866 2652672 2883913 3063025 3354229 3699379
 3901404 3908686 4019107 4148275 4192780 4537722  619825]
REL11339
63.0
[1607984 2032788 2100305 3893617  546995]
[1616669 2056031 2122445 3900622  555904]
REL

### Next: identify all pseudogenes to be excluded

In [36]:
for k in range(0, len(libraries)):
    filename = directory+libraries[k]+'_output.gd'
    with open(filename) as in_handle:
        data = in_handle.read().splitlines()
    start = []
    end = []
    #extract data corresponding to missing coverage
    mc = [line for line in data if 'MC' in line]
    for entry in mc:
        line = entry.split('\t')
        span = int(line[5]) - int(line[4])
        if span >= 1000:
            start.append(line[4])
            end.append(line[5])
    #now that we have the start and the end coordinates, I'm going to find out which genes lie inside these 
    #deleted regions
    #first, convert start and end to numpy arrays
    start = np.array(start)
    end = np.array(end)
    #getting rid of any duplicates (if they exist) and sorting the start and ends
    unique_start = np.unique(start).astype('int')
    unique_end = np.unique(end).astype('int')
    #now for finding out which pseudogenes lie in the deleted regions and exclude them from the analysis
    for i in range(0, locations_pseudogenes.shape[0]):
        #if either the start or end of the gene falls in the deleted regions, exclude gene from analysis
        for j in range(0, len(unique_end)):
            if (locations_pseudogenes[i, 0] > unique_start[j] and locations_pseudogenes[i, 0] < unique_end[j]) or (locations_pseudogenes[i, 1] > unique_start[j] and locations_pseudogenes[i, 1] < unique_end[j]):
                exclude_pseudogenes[k, i] = 1
    print(libraries[k])
    print(sum(exclude_pseudogenes[k, :]))
    print(unique_start)
    print(unique_end)

REL606
0.0
[]
[]
REL607
0.0
[]
[]
REL11330
4.0
[1269579 1607969 2031699 3893550  546979]
[1270657 1616678 2054931 3901928  555923]
REL11333
11.0
[1600434 1607979 2032390 2100308 2123623 3893620 4146323 4547207  495498
  589607]
[1602329 1616687 2055657 2122453 2143763 3899898 4151504 4551411  499062
  619816]
REL11364
21.0
[1451973 1607969 1729055 2032660 2086597 2877316 3015775 3192748 3549960
 3679141 3893601 4289283 4446817 4521618 4565815  546981  572833  588545]
[1462317 1616661 1731500 2055947 2122425 2910191 3035120 3198581 3553484
 3680779 3901455 4333227 4451689 4561282 4588155  572549  588494  619836]
REL11336
13.0
[1607959 2032119 2125712  227115 2647595 2882770 3023968 3351608 3697046
 3893617 3903505 4015657 4146298 4187474 4521606  546972]
[1616682 2055373 2143775  231866 2652672 2883913 3063025 3354229 3699379
 3901404 3908686 4019107 4148275 4192780 4537722  619825]
REL11339
8.0
[1607984 2032788 2100305 3893617  546995]
[1616669 2056031 2122445 3900622  555904]
REL11389

### Saving the output data

No need to write to file, already exists in the github repository. If you want to modify the analysis above, and overwrite the existing files, uncomment the following code block.

Alternately, save the output of the modified analysis at a path of your choice.

In [14]:
#saving the list of deleted genes
# np.savetxt("excluded_pseudogenes_REL606_k12annotated.txt", exclude_pseudogenes)

In [23]:
# #saving the list of deleted genes
#np.savetxt("excluded_genes_REL606_k12annotated.txt", exclude_genes)