

All Asgard genomes were downloaded and stored in the directory "ncbi-genomes-2021-03-21". All genomes were compiled into a same bgff file called "asgard_assemblies.gbff" by means of a simple bash command: cat *.gbff > asgard_assemblies.gbff

In [1]:
from Bio import SeqIO # To parse the Genbank files (gbff) with SeqIO.parse().
import re
import sys

In [2]:
unique_hits_protein_id = open("crispr_hits_protein_id.txt", "r")
unique_hits_protein_id = unique_hits_protein_id.readlines()
unique_hits_protein_id = [hit[:-1] for hit in unique_hits_protein_id] # Remove the trailing \n.

**Examples of Genbank file parsing**

In [3]:
# Example of how to parse a Genbank file.

# gb_file = "GCA_001940645.1_ASM194064v1_genomic.gbff"
# for gb_record in SeqIO.parse(open(gb_file, "r"), "genbank"):
#     print ("Name {name}, features {features} ".format(name = gb_record.name, features = len(gb_record.features)))
#     print (gb_record.seq)   

In [17]:
# Example of parsing features in Genbank records.

gb_file = "ncbi-genomes-2021-03-21/GCA_001940645.1_ASM194064v1_genomic.gbff"
gb_records = SeqIO.parse(open(gb_file,"r"), "genbank")
gb_records_list = list(gb_records)
first_record = gb_records_list[0]
first_record_features = first_record.features
first_record_features[4].qualifiers["protein_id"]
print(first_record_features[2])

type: CDS
location: [<0:289](-)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: inference, Value: ['ab initio prediction:Prodigal:2.6']
    Key: locus_tag, Value: ['HeimC3_00010']
    Key: note, Value: ['HeimC3_00010 c1_1; verified contig']
    Key: product, Value: ['hypothetical protein']
    Key: protein_id, Value: ['OLS28042.1']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MTVTTYLEKQNQNPITKIPLTENLVLQLSHEDLNCPFCDIIPVAVTTWGSYWTRAGEIRRYYCYHCKKAFNPAKVPYVYERMSKVIFELGKAVIKN']



In [24]:
# Examples to parse features and obtain sequences.
print(first_record.seq[:289].reverse_complement().translate())
print(first_record_features[4].extract(first_record.seq))
print(first_record.seq[20:200])
print(first_record.seq[20:200].complement())
print(first_record.seq[20:200].reverse_complement())
print(first_record_features[4].qualifiers["protein_id"])
print(first_record_features[4].qualifiers["product"])
print(first_record_features[4].location)
print(first_record_features[4].location.strand) # Result: 1 if it's the + strand, -1 if it's the - strand.
print(first_record_features[4].location.start)

MTVTTYLEKQNQNPITKIPLTENLVLQLSHEDLNCPFCDIIPVAVTTWGSYWTRAGEIRRYYCYHCKKAFNPAKVPYVYERMSKVIFELGKAVIKN
ATGACTATTTTATTCTATGATTCTAAATCTAATAATTCTATTAATGCAAGTTATTTTGGACTATCTTTAGGTCTTTCAATATTGTTATTAAACAACTACTTGAATAACATTCCTATCAATGAATCGCCTATTGATATTGCGACAAAATTATTCTCGATAACATTCTTATCATCAATTTTGTCAAGTTTCTTCATAATAGTTAGATTAGACAAATTTCTTATACCGAAATTTGTAAAATATTTAAAAAAATTTGAAAATAAAGAATGGATGAGCTCTAATGAAGTTTTACTGACACCATATTTAAAGGAATTTTATAACACAATCACTTCAACTATCATAATAGTGATTCTTTTTATAATTCTAACTTTAGCGAATTTCGTATCAGAAGATTCTTTATTTTTTGTAGAAAAAGAAATTACTTTAGCTATTTTGACTATAGCAATATTTTTATTACTGATAGCTCTTTTTTACCGAATAAAGAATGATAATTATAAATTAGATTACGTTGCGATCTTCTATCGACTAAAAAAAGATCCACACATTGATAAAGACCATAAGGACGATGATGAGTCCGAAAGTCTCGGTAACAGATTAATCGAAATTAATTTTGATTTAATGAACAACAACTGGTCTGGTGTAAAGAATAATATAGAAATAATAAAAAGTAAGGCTAATAGTTTATTTTCTGATTGTTTTTTCCAATCAGATAGAAAAGCTTTTTATATTTTAGAATTTGAAAAATTGAAGAATATATCAGAACAAGATGACATACTTTTAAAGGAATATTTTATAACAATTAACAATCCATTAAATGGAAAAGTAAATGGATCTTTGAATACAAGATATCATCATCTAAAGATCCATGGACTTTTTTATGTTTATGAAGAGGGTTCTGATTTATTA



# Printing features that are in the region around each CRISPR-hit

Print features (usually, gene products) that are around each CRISPR hit. For each hit, 7 features are printed:
The hit itself, 3 features on the 5' side, and 3 features on the 3' side.

First, I create a full-feature version. This version prints all information in each feature: protein ID, protein sequence, product, locus tag, etc. This version is meant to be human readable, that is, in order to manually explore the 7 features around a particular hit.

Then, I create a product version. The "product" field of a feature indicates the automatic prediction of what that gene product is. Examples are "putative ski2-type helicase" or "CRISPR-associated protein, Csh2 family". In the product version of the hit neighborhood, I print only the product field of the features around each hit.

In [None]:
# Full-feature neighborhood.

asgard_genomes = open("ncbi-genomes-2021-03-21/asgard_assemblies.gbff", "r")
asgard_genomes = SeqIO.parse(asgard_genomes, "genbank")
counter = 0
original_stdout = sys.stdout

with open("neighborhood_hits_full.txt", "w") as neighborhood_hits:
    sys.stdout = neighborhood_hits # Redirect the output generated with print statements to the neighborhood_hits file.
    for record in asgard_genomes:
        for index, feature in enumerate(record.features):
            try:
                if feature.type == "CDS" and feature.qualifiers["protein_id"][0] in unique_hits_protein_id:
                    counter +=1
                    print("Hit Number: ", counter, "Feature index: ", index, "Number of features: ", len(record.features))
                    print("Neighbors 5'") # Printing neighbors on the 5' side.
                    
                    # Every protein has a CDS feature and a gene feature. Therefore, in the following lines, by saying
                    # "range from 1 to 7" I am saying "the features corresponding to the hit and the next 3 proteins".
                    
                    for i in reversed(range(1, min(8, index))): # If there are less than 7 features before the hit,
                        # the index will be smaller than 8, and I will just print all features before the hit.
                        if record.features[index-i].type not in ["gene", "assembly_gap", "source"]: # I want to avoid printing
                            # gene features, because all proteins have a very similar CDS feature, and assembly gaps and the sample source,
                            # because I am not interested in those features, 
                            print(record.features[index-i])
                    print("HIT")        
                    print(feature)
                    print("Neighbors 3'") # Printing neighbors on the 3' side.
                    for i in range(1, min(8, len(record.features))):
                        if record.features[index+i].type not in ["gene", "assembly_gap", "source"]:
                            print(record.features[index+i])             
            except:
                pass
sys.stdout = original_stdout

In [40]:
# Product version of hit neighborhood.

asgard_genomes = open("ncbi-genomes-2021-03-21/asgard_assemblies.gbff", "r")
asgard_genomes = SeqIO.parse(asgard_genomes, "genbank")

with open("neighborhood_hits_products.txt", "w") as neighborhood_hits:
    for record in asgard_genomes:
        for index, feature in enumerate(record.features):
            try:
                if feature.type == "CDS" and feature.qualifiers["protein_id"][0] in unique_hits_protein_id:
                    for i in reversed(range(1, min(8, index))):
                        if record.features[index-i].type not in ["gene", "assembly_gap", "source"]:
                            neighborhood_hits.write(str(record.features[index-i].qualifiers["product"][0]) + "\n")    
                    neighborhood_hits.write(str(feature.qualifiers["product"][0]) + "\n")
                    for i in range(1, min(8, len(record.features))):
                        if record.features[index+i].type not in ["gene", "assembly_gap", "source"]:
                             neighborhood_hits.write(str(record.features[index+i].qualifiers["product"][0]) + "\n")             
            except:
                pass

NameError: name 'original_stdout' is not defined

In [38]:
asgard_genomes = open("ncbi-genomes-2021-03-21/asgard_assemblies.gbff", "r")
asgard_genomes = SeqIO.parse(asgard_genomes, "genbank")

with open("neighborhood_hits_products_annotated.txt", "w") as neighborhood_hits:
    for record in asgard_genomes:
        for index, feature in enumerate(record.features):
            try:
                if feature.type == "CDS" and feature.qualifiers["protein_id"][0] in unique_hits_protein_id:
                    neighborhood_hits.write("New hit \n")
                    for i in reversed(range(1, min(8, index))):
                        if record.features[index-i].type not in ["gene", "assembly_gap", "source"]:
                            neighborhood_hits.write(str(record.features[index-i].qualifiers["product"][0]) + "\n")
                            print(str(record.features[index-i].qualifiers["product"][0]) + "\n")
                    neighborhood_hits.write(str(feature.qualifiers["product"][0]) + "\n")
                    for i in range(1, min(8, len(record.features))):
                        if record.features[index+i].type not in ["gene", "assembly_gap", "source"]:
                             neighborhood_hits.write(str(record.features[index+i].qualifiers["product"][0]) + "\n")             
            except:
                pass

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

Tryptophan synthase beta chain 1

Isoleucine--tRNA ligase

Sulfide dehydrogenase [flavocytochrome c] flavoprotein chain precursor

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

Photosystem I assembly protein Ycf3

hypothetical protein

2,3-bisphosphoglycerate-dependent phosphoglycerate mutase

Calcium-transporting ATPase 1

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

N-(5'-phosphoribosyl)anthranilate isomerase

Pyruvoyl-dependent arginine decarboxylase

hypothetical protein

hypothetical protein

Protoheme IX farnesyltransferase

Monomeric sarcosine oxidase

hypothetical protein

hypothetical protein

7, 8-dihydropterin-6-methyl-4-(beta-D-ribofuranosyl)- aminobenzene-5'-phosphate synthase

Exosome complex c

methylmalonyl-CoA carboxyltransferase

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

type I-A CRISPR-associated protein Cas4/Csa1

hypothetical protein

type I-A CRISPR-associated protein Cas4/Csa1

hypothetical protein

type I-A CRISPR-associated protein Cas4/Csa1

hypothetical protein

type I-D CRISPR-associated helicase Cas3'

hypothetical protein

type I-D CRISPR-associated helicase Cas3'

type I-D CRISPR-associated protein Cas5/Csc1

hypothetical protein

hypothetical protein

GTP-binding protein

electron transfer flavoprotein subunit beta

electron transfer flavoprotein subunit alpha

DUF1931 domain-containing protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

phenylalanine--tRNA ligase subunit alpha

peptide chain rele

hypothetical protein

TIGR00341 family protein

hypothetical protein

MarR family transcriptional regulator

MBL fold metallo-hydrolase

GNAT family N-acetyltransferase

TraB family protein

hydrogenase nickel incorporation protein HypB

hypothetical protein

hypothetical protein

hypothetical protein

GTP-binding protein EngB

aldehyde ferredoxin oxidoreductase

deoxyribose-phosphate aldolase

hypothetical protein

replication factor C small subunit

hypothetical protein

minichromosome maintenance protein MCM

hypothetical protein

hypothetical protein

hypothetical protein

anaerobic ribonucleoside-triphosphate reductase

glutamine--fructose-6-phosphate transaminase (isomerizing)

CBS domain-containing protein

rhodanese-like domain-containing protein

hypothetical protein

hypothetical protein

adenylosuccinate lyase

hypothetical protein

DegV family EDD domain-containing protein

ribose 5-phosphate isomerase A

tRNA-intron lyase

elongation factor 1-alpha

hypothetical protein

h

hypothetical protein

hypothetical protein

hypothetical protein

Replication factor C small subunit

tRNA pseudouridine(38-40) synthase TruA

hypothetical protein

bifunctional 5,10-methylene-tetrahydrofolate dehydrogenase/5,10-methylene-tetrahydrofolate cyclohydrolase

hypothetical protein

hypothetical protein

Replication factor C small subunit

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

GTP-binding protein

hypothetical protein

GTP-binding protein

hypothetical protein

hypothetical protein

hypothetical protein

hypothetical protein

ATP-dependent helicase HepA

N-6 DNA Methylase

Enamine/imine deaminase

hypothetical protein

fructose-1-P/6-phosphogluconate phosphatase

DNA-directed RNA polymerase subunit K

hypotheti

# Printing the DNA sequence around each hit

As CRISPR arrays are expected to be in a zone nearby CRISPR-Cas systems, the 20000 bp around each hit (10000 bp on each side) were analyzed. Those 20000 bp around each hit are stored in the dnaseq_hits.fasta file, each header being the Genbank ID of the hit.

In [34]:
# Just printing sequences to test code. Remove in the end. 

asgard_genomes = open("ncbi-genomes-2021-03-21/asgard_assemblies.gbff", "r")
asgard_genomes = SeqIO.parse(asgard_genomes, "genbank")

for record in asgard_genomes:
    for feature in record.features:
        try:
            if feature.type == "CDS" and feature.qualifiers["protein_id"][0] in unique_hits_protein_id:
                start = feature.location.start
                end = feature.location.end
                print(">", feature.qualifiers["protein_id"][0])
                print(feature.location.strand)
                window_10000 = record.seq[start - min(10000, start): end + min(10000, len(record.seq)-end)]
                if feature.location.strand == 1:
                    print(window_10000)
                elif feature.location.strand == -1:
                    print(window_10000.reverse_complement()[::-1]) # Take the reverse complementary, and reverse it, to get the complementary.
                print(len(window_10000))
                print(feature.extract(record.seq))
        except:
            pass

> OLS28053.1
1
TATTCTTAATTACAGCTTTACCTAATTCAAAAATAACTTTACTCATCCGCTCATAAACATAAGGGACTTTAGCTGGATTAAAAGCTTTTTTACAATGATAACAATAATAACGACGGATTTCACCTGCTCGTGTCCAATACGAACCCCAAGTAGTTACAGCAACAGGTATAATATCACAAAAAGGACAGTTTAAATCTTCATGTGAGAGCTGAAGAACTAAATTTTCTGTTAATGGAATTTTAGTAATGGGATTTTGATTTTGCTTTTCTAAATAAGTAGTAACAGTCATAATGACTAAAAGAGCTAAGCAATGAGTATATAAATATTAGTATGACCAACGTATAGCAGAGGAATTATTATGTCAAAGTTTTAAAAAATTTAAGAACAAGATCCAACACTATATATGAATTCAAAAAAAGTTTAAGAATCCTACAATTTTCAGGTTATCTCTGTTTTTTGTTAAAAGAGATAAATGATATTTTTCAGAAGTCCAATATATTTTGTCTATCAATGAAAAATAGAGAATTAAAGGGATTTCTATGACTATTTTATTCTATGATTCTAAATCTAATAATTCTATTAATGCAAGTTATTTTGGACTATCTTTAGGTCTTTCAATATTGTTATTAAACAACTACTTGAATAACATTCCTATCAATGAATCGCCTATTGATATTGCGACAAAATTATTCTCGATAACATTCTTATCATCAATTTTGTCAAGTTTCTTCATAATAGTTAGATTAGACAAATTTCTTATACCGAAATTTGTAAAATATTTAAAAAAATTTGAAAATAAAGAATGGATGAGCTCTAATGAAGTTTTACTGACACCATATTTAAAGGAATTTTATAACACAATCACTTCAACTATCATAATAGTGATTCTTTTTATAATTCTAACTTTAGCGAATTTCGTATCAGAAGATTCTTTATTTTTTGTAGAAAAAGAAATTACTTTAGCTATTTTGACTATAGCAATATT

> OLS22936.1
1
GACTCATCTGACCCTAAAAAACCTTGTGAAAATGTAAGTATAAGAATTAGTACAACAAAAGTAATGATAGGTGTAGCTAATAACTCCCATTTATCAAATTTTTCTTTAAAAGAAGTCATACCAAAACGTATCCTGATTCATGGAGTATAATTTTCTATTTATTAACCAAATTTAAATAATCGTATAAATCACATTTTACTTGATTAATGGAAGAATTTGAGTGAATTTCTCTATGTACAATAAAATACAGAACTATTTTATGCTTTTCAGACCTTGGCAATATTATAAAAATGTCCTAATATTTTTTGGAATAGTTTTTTCAGGAAGTATGCTCGATTTCACATTATACATGCCTTTAGTACTCGGATTTATCTCGCTTTGCCTTATATCTTCAGTTAATTATATAATAAATGATTTAAGGGATAGAGAGCATGATAAACTTCACCCCGAGAAAAAAAATCGTCCAATCGCTTCAGGAAATGTATCAATACTTGAAGCTTATACTTTATTATTGATATTTGGTTTTTTAAGCATAATCATTGCATCTTTAATCCCCATGGAAACAGAAAACAGATATTATTTGATTGGGATTTTAGTCTTAATTTTTACAACAAGCCAGCTATATACTTACTATTTCAAGCATAAAGCTCTATTTGATGTAACTTTTATTGCATTAAATTATATTTGGAGAGCTATTGCGGGAGTCATTATCATTGATGTTTACTTGTCTCCTTGGCTTTTTGCTCTTGGATTCTTATTTGCTTTATTTCTAGCATTAGCTAAGAGAAGAGGAGATTTGATTCTACTAGGCGATGAAGCAAAAAAACATCGGAAGGTATTTGAAGAGTATAATCTTCCATTATTAGACCAATTCATAACAATAGTAATAGGATCCATGATTGTATCTTGGGCTATTTATATAATTGAAGCTCCTTTTCGTACCAATGTCCCTGTCACGTTCACTAATGAAAATCTAGCATTATTA

> OLS16665.1
1
TTGATACTGATTGCAACCAGCCGTTGAAAGATTTCTGCAATTTCTTCGAACCTTTCAGTCAGACTTTCTACCAATAATCCTCGTGATTCTAATCCCTCTTTCATCATAATCACTAGACTTGCTAATCGAGCATAAACTTCTGGATATGGCTCAACATATCCTTTTAATAATGGTGGAACACTTGTGATAGCACTGTAAGGTTGTATAATTTGTTGAATAATTTGTTGATGTTTCGTTATCTGTCATTTGTAAATTATGTGATTCTATTTGTAATTGGAATATTCTGTTTAGTATCATTGTTTCTTCCTCTGAAAAGATCTTTAATAAAGGAGACAAAATCATATAGAAAGAATATTTAGTATCAAGACTAATTGCGAGCTTCTTCAGGTAGAAGACCAAAATTTACAACAGATATTTCTGGAGAAATAACTCTAAATCCATTTTTCTAATCAATCATCTACTACAAATTTCGTAGAACGAGAATCTTTATAACATAAAGTTAGAGTTTAATTTAAGAATTATGAAAATTCTCATGTTTTTCTTATTCTTGATGTCTATCTTCCTTATCTATAATCCTCAATCACTAAAATCATATCAGGAGTCTGAATTCATACCATTTAATTTACAAGCTTGTGAAAATCAACTCCTAACATTATTTTTATCTGAAACTGATGAACAAATAACACTGGAGGTCCAATTAACACTAGTAAATAGTTCTTTCTATTTGAACTATTTTGAACCATTGGGTAACCAAAAATTCAATAAAAAATCAATAGAACTGAATTCCGGTGATTATTTCAATCGAACAATCATAACAAACAATGTTTATCTAATATTAGTAAATAATTCAAATTTTTTTGCTAATTACACAGCTGAGGGGTCTTATCGTGTCCATGATGACTTATACAAATATGACAGATCTTGGAAACCTTTTAATTTAACGATCTGTAAACCGACATCTATTAATATCTTTAAAACTGAAGGG

> OLS12690.1
1
GATTAATTTAACGGGTTATTACCTTTCAGAAAAAATATTAATTTTCCACTATTTGTTTTCATATGGGTTATTTTTCCAATATTCAAGCGCGGCAGTCGTCCTTTTACCGCCGCCTTTGGCAAGAATTGGGAGAAGTAAAAGTATACGACACGCATGAGCACCTCCCACCTGAAGAGTATTTGTGGCGAACCCCGGCAGGGAAAGATTTCGATCGTTTGCCTGTCTGGAAAGTATTTGAAACTTCATTCATCTATGGCCCTTGGCAAAATGATGGGGGATATACGAAATGGGCAGAAATTATTAACCGTCAGCGTGGTACAGGTTATCTTAAATCTCTCCTCTGGGCGTTTGAGGAACTATTTGATATGGAGCCTCCAATAACAGCAAATTACTTGGAAGAGCTTGAAGCGATCCTCAACCAGGCATATTCAGGTGACGCCCCAATAAACGACCGACTCCGACATGTCGTACAGGATCACATGCATGCTGAAACCGTGATTTTAAATCTCCCCTTTTTAGACCAACATCTAAAGCTCCCTCAACCAACGTTTCAAGCAGCGATTGGACTTTCTAGCCTTTGTTCGGGAGCGAAAGTCCCAAATAACACAGAAAATCTCGAAGTAAACATCCCATACTACTTTGCAGTCCGAAAGATGGGAAAAAATCTTGCAGACATTCGGACTTTAGACGACTATTTAGACATAATTGATCATCTGATCGAGTATGTGGGAAAAATGAAGTATATCTGCTCAAAGTTCAGATTTGCCTGTGATCGGTCACTTATTTTCCCCAAACCTGAAGAGGATATATCTATGATTCGTATCTTGTTTAATAAATCCAAAATTAACGATAGAGAATTAACACAATTCAGTAATTTTATTATGCACTATTTTCTCGATCAAAGTGTTAAAAAATGGAACCGCCCCGTCCAATTTCACACGGGTTTTGGGCGAATACCTGATGGTGGAAGCAATGGAGTTAACTT

> OLS17413.1
1
CCTGATGTTTCAGTCAAATAAATATACCTCCGGGGAAGAGGTTAATCCTATAAAGAATTTAATCAAACCTAATTAAAAATTTATCATAACAACTATGTATTTATTTAAGAACAGCTTATTAAACTTGAGCTTAGCCGTCGCGTGAAGTAATTTTTGAAGCTCGAGTATTTAGCTATTTTTTATAAGAGATCGTATCTTATTATTTAATAGATTTAAAACTTTTAGAGGGATTTATATTGCTATGTGAGATATGCGGGAAAAACGCGCCTAAAACATTTTATGTGGAATTAGACGGAGCGGTTTTGTCAGTTTGCGAAGAGTGTTCAAAATATGGTAATATAGTGGAAGAAAATGCTTTGCGAGAGGTTAAACCTAAGAGAAAAATAGAGGTTACTAAGAGTGAGGTGAAAGTTAAATCAGATGAGGTTGTGGAGAAAGTTCTCGACCTTAAACCGGATTATTATAAGATTATAAAAACGGAGAGAGAGAAAAGGAAGCTGACCCAGGAAGAGTTAGCTAAACTACTCAATGAGAAACCTTCGGTTATCAGCAAAATAGAGACCGGGCGGTTTGAACCTGAAGAGAATTTGATTCGAAAATTGGAAAAATTTTTTAATGTGAGATTAGTTGAGGTGACGGAGCTGCCGGCTTCTAAACCAGTTTTCAAAAGCGGCGAGTTAACGCTAGGTGAAGTTGTAAACCTTAAGAAGAAAAAATCATAGACGGATTTTACTGAAATCTTTTCAAAACTTTTTTACCTTTTTCCTGAGGTTCAACTACTAGCCTAGCATCGGGGAAAGATATATCAAGCTCTCTCCCCTCATAGTATTCGTGTAAGAAAATAGCTAATGCAGCAACCTCACTATGAGGTTGAGAGCCTATGCTTATATTATAATCCGCTAAATCATATACTTCTCTTGGAACCTTAGCCCCTCCGACAATTAACAAAATATCTTTAAAATTCTTTAACTCGGAGATTTTTTGC

> OLS29121.1
-1
GATAACATGTTCTTAAACGTCTTTTGTAGTACCTTCTATTTAAATAACCCTTATTTACATTTTGGACATCACGATTTATGAAGGGAGGGTCCGCTCGTACGCTCTTAAATGTGAGACTTTACCTTTTAAAACTTCAAAGGTGCCTACGTTGTAACTTTCGTAAGTGCTATTAAGTAAAACGTGGACTATTATACCGTTCATTTCGAGGCATGCAGTAGCGATAACCGCTTAACCTTCTTCCATAATCTCAAGATCGAGTAGAACAACCTTAAAGACGAGTCGGATTTCATCCTTACTTTTACTTTAATTTTAGAGTTTTTAATTGTCTTCTAGCTAGACAGATATTTAAATGAGGTCGCATCTTAATTTTGTTCTTTAAAAGATCAAAATTTCAAGCGATACTTTTGGGATGTTATATAGGTATAGTTCTTTCTTAATTAATTACAATGTAGTTAATAAAGTAGAATTAAGAGATGGCGGTGTAATAAAAGATTTAGACTTTAATCTCAGTAAAACAGATTTGTGGTAAAAGATTATGATGCTACCATTATTTATATAAAGAACTGGGAATTAACAGCGTATGGAAGTAGTTAAAGCCAAAACGATATTAATCCGTAATTCAGATAATAAAGTAAAAATCTTTCTCAGAAAAGTAGATCATTAACTTATTAAAAAAAGACATGCCGGGACCATTCGACGAATTGCTCAAGAGCTTGAGAAGGTGATATAAGGTTACCTTTAAGTACTTATAGTATTGAAAGGAGTCTTACTTTTCATACTACTTGAGCAAGTTTTTAGTATAAAACATTTAGTAAATTTTTGGTATGCCCACGTGGAAAATGAGACGGCTGAGGTTTACCAAAACGTTGAGGTTAATGTCTATAAAGATATGAAGAAAACCAATTAGAGTTTGGTATTATACGGTTTATATTACTTTAACCACAATTCCTTTTTTGCCGACGATCAAGTTTTAATAGTATGTAA

> OLS23386.1
1
GAAGCTTACAGGACGCATTAGAATTAACATCTAAGTATCTTAATCTTGCTTCGATATTTATCTATATAGTAGCAGGTGTAATAATCTTTGTTACTATGAATCGGTATGTAAACGAGCAAAAAAGAGTGATAGGTTCGTTATATGCATTTGGAGTCAAGAAAAAGGAAATTATTTATTCATTTTTTTTCAGAATACTAATATTAAGTTTAATTTCAAGTGTGTTTGGGATTTTACTCGGAAGATATCTTCTGAAATTGCTTGTTTCAGGATTAGGCAATAAATGGGGATTAATTTCGGATGAACCGATAATTTCAAACGAATCAGTGATCATTGCATTGGGAAGCTCAATTGTGATAGCATATAGTTTCACCTATTTAGCTCTCTGGAATTTAATTAAACTAACCCCTTATGAAGCAATGAGAGGAAAAACATCTGAACTAAAGAGTAGTGGATTTTTTTTTAATGCAGCAAACATTATTCCATTTAGATTATTCCGAGCTGCTGCTAAAAATCTGACACGGAATAGAACCCGTACTATATTAACCGTACTCGCCTTTTCAATGGCATTAACTTTTTCAGGATCTCTGATGTATACGCATGATAGTTTGGGTTCCACAGTAGATGATTATTATGATCGTCGATTAAATTTTGATCTTGAAATAACACTAGGTAATGATAACGTAAATAATCAAACAGTTATGCAAAAAATATTAAATTTAGATACTAATAATGATTTAAAACCGGATATCCGGTTTTATGAGCCTTCTTTAGTAACATTTACACCCTTTACAGAAACTCCCGATAAATTAGTTGTCCTATCTGCCCTCAAACGAGATACCCAGATGTTCGATTTCAGTAATAGTACCTTTTCTGAAGGAAGAATGTTCATGTTTAACTCAAGTGAAGTTGTTATTTCAAGATATGTTGCAGGCACACTTGGCTTGAAAATTGGTGAAACTTACGCATTGGACTTTTTAACTAGATC

> OLS32580.1
-1
CGATATGTATACAGGACGCACACCTTGTTCTTTCTTTCCTCTCTAAAGAAGACAAAGTCAACTTGAAAAACCTTCAAAAGGACCAGGATTTCTTAGGAATTTTGTTCGGAGTTAACTACATTAGTTTCATAGACCTAACAAAAACATTTTACTAGGACGTATTTGACGAGAAGCGTAGTAATCTGGATCTATCCTAAAACTTTATTTTTTAGATAGGGAAAATTCTATCTTAAACTTTTAGATAGACGTACAGAAGATTATCGAATCGTATTAGAAATGGAAGTAAACTATTGTCTCCTTATGGTAACTAAACGAATTAGGATTAACAAGAGAATAGGAGTAGAGACCAAAATAAGACCTTGAGAACTTAACGTGACTTCTACTTGTAGTTACATACTCATAGTAACTTAAGTCCTTCTTAGAGTTACAATACTTACTTTAGTTGAAAAAGATGAAAGAATTATACATCTTCACGATAACCAAGAGGCAAAAACTCTTAATAAAGTGGTATCAGACTTCCTTGTCTAGATCTTTTGTCTCTGGATTCAAATTTAGAGAAGTTTTCCATCAAAGATATTTCATAGTTATACCCTAGTAAAGAACCTTGTATACATTTGAGGTGTGAAAGGCGAACTTAGTCTAATAGTAGGAAAACCATTTATATGGGACTATAGTTTAGAAGATAGAGAAAACAACACTTAACTTGAAACAAGAGTAGAAGGTACTTGGCAGGACTAGATTGAGGGAATCTTCGCTAATTGAGAAACTATAAATGAAGATAGACTCACAAACTATTTGGACAATCCAAACGATAGACTAGCTAGCAGTTACCATCCAAACAACTATCATAGGATTTTTAATGTAGATTGGTACTAAGAGAGACTCAAAGTTTAAAAGACCAAACTTATGGAAATTGTACGAATCTGTGACCGTAGGTTTAATAGACTAAACTTTGACTAAAACTAACTTGTTAAAGTACTCTTA

KeyboardInterrupt: 

In [5]:
asgard_genomes = open("ncbi-genomes-2021-03-21/asgard_assemblies.gbff", "r")
asgard_genomes = SeqIO.parse(asgard_genomes, "genbank")

with open("dnaseq_hits.fasta", "w") as dnaseq_hits: 
    for record in asgard_genomes:
        for feature in record.features:
            try:
                if feature.type == "CDS" and feature.qualifiers["protein_id"][0] in unique_hits_protein_id:
                    start = feature.location.start
                    end = feature.location.end
                    dnaseq_hits.write(">" + feature.qualifiers["protein_id"][0] + "\n")
                    window_10000 = record.seq[start - min(10000, start): end + min(10000, len(record.seq)-end)]
                    # window_10000 represents a window of 10000 bp on each side of the hit.
                    if feature.location.strand == -1:
                        window_10000 = window_10000.reverse_complement()[::-1] # Take the reverse complementary, and reverse it, to get the complementary.
                    dnaseq_hits.write(str(window_10000) + "\n")
            except:
                pass

Diagrams.

In [72]:
from Bio.Graphics import GenomeDiagram
from Bio.SeqFeature import SeqFeature, FeatureLocation
import json

In [24]:
random_assembly = SeqIO.parse(open("ncbi-genomes-2021-03-21/GCA_001940645.1_ASM194064v1_genomic.gbff", "r"), "genbank")
for record in random_assembly:
    print(record)

ID: MDVS01000001.1
Name: MDVS01000001
Description: Candidatus Heimdallarchaeota archaeon LC_3 HeimC3_contig000001, whole genome shotgun sequence
Database cross-references: BioProject:PRJNA319486, BioSample:SAMN04924817
Number of features: 225
/molecule_type=DNA
/topology=linear
/data_file_division=ENV
/date=10-JAN-2017
/accessions=['MDVS01000001', 'MDVS01000000']
/sequence_version=1
/keywords=['WGS']
/source=Candidatus Heimdallarchaeota archaeon LC_3 (marine sediment metagenome)
/organism=Candidatus Heimdallarchaeota archaeon LC_3
/taxonomy=['Archaea', 'Asgard group', 'Candidatus Heimdallarchaeota']
/references=[Reference(title='Asgard archaea illuminate the origin of eukaryotic cellular complexity', ...), Reference(title='Direct Submission', ...)]
/comment=Assembly extracted from an MDA-amplified Loki's Castle sample in
two rounds of binning: PhymmBL and ESOM. Assemblies were further
cleaned using unamplified coverage, and collapsing of overlapping
contigs with Minimus.
Annotated usin

ID: MDVS01000048.1
Name: MDVS01000048
Description: Candidatus Heimdallarchaeota archaeon LC_3 HeimC3_contig000048, whole genome shotgun sequence
Database cross-references: BioProject:PRJNA319486, BioSample:SAMN04924817
Number of features: 241
/molecule_type=DNA
/topology=linear
/data_file_division=ENV
/date=10-JAN-2017
/accessions=['MDVS01000048', 'MDVS01000000']
/sequence_version=1
/keywords=['WGS']
/source=Candidatus Heimdallarchaeota archaeon LC_3 (marine sediment metagenome)
/organism=Candidatus Heimdallarchaeota archaeon LC_3
/taxonomy=['Archaea', 'Asgard group', 'Candidatus Heimdallarchaeota']
/references=[Reference(title='Asgard archaea illuminate the origin of eukaryotic cellular complexity', ...), Reference(title='Direct Submission', ...)]
/comment=Assembly extracted from an MDA-amplified Loki's Castle sample in
two rounds of binning: PhymmBL and ESOM. Assemblies were further
cleaned using unamplified coverage, and collapsing of overlapping
contigs with Minimus.
Annotated usin

In [26]:
random_assembly = SeqIO.parse(open("ncbi-genomes-2021-03-21/GCA_001940645.1_ASM194064v1_genomic.gbff", "r"), "genbank")
for record in random_assembly:
    for feature in record.features:
        print(feature)

type: source
location: [0:125086](+)
qualifiers:
    Key: altitude, Value: ['-3283 m']
    Key: collection_date, Value: ['2010']
    Key: country, Value: ['Arctic Ocean: Gakkel Ridge']
    Key: db_xref, Value: ['taxon:1841598']
    Key: environmental_sample, Value: ['']
    Key: isolate, Value: ['LC_3']
    Key: isolation_source, Value: ["LokiAmp MDA-amplified sample from Loki's castle hydrothermal vent sediment"]
    Key: lat_lon, Value: ['73.763167 N 8.463999999999942 E']
    Key: metagenome_source, Value: ['marine sediment metagenome']
    Key: mol_type, Value: ['genomic DNA']
    Key: note, Value: ['metagenomic']
    Key: organism, Value: ['Candidatus Heimdallarchaeota archaeon LC_3']
    Key: submitter_seqid, Value: ['HeimC3_contig000001']

type: gene
location: [<0:289](-)
qualifiers:
    Key: locus_tag, Value: ['HeimC3_00010']

type: CDS
location: [<0:289](-)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: inference, Value: ['ab initio prediction:Prodigal:2.6']
    Key: l

type: gene
location: [143142:143916](-)
qualifiers:
    Key: locus_tag, Value: ['HeimC3_12280']

type: CDS
location: [143142:143916](-)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: inference, Value: ['ab initio prediction:Prodigal:2.6']
    Key: locus_tag, Value: ['HeimC3_12280']
    Key: note, Value: ['HeimC3_12280 c20_132; verified contig; IPR019151; arCOG00347']
    Key: product, Value: ['hypothetical protein']
    Key: protein_id, Value: ['OLS25954.1']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MEQSLANIEQSLAQCICHTTAFGDSGKTIVVGFPGFGLVGTIAAKYIIKSLDLEVVGYLRSPLIPPLAVFLDGILAYPYRIYGDLSGNQDIIVLIGESPAPPQAYYFLANAVLDWGTKFGHAEEVICLDGFSDQGEPKNDVYLVAEPDVKGKMDQYNLPKPQTGYIGGLSGAILNESIIREIDGYALLVSTTSHYPDPNGAGHLIETINKIKNLNIDTKSLFDDGEKIKQTMQDFANRTRQLADQDTQSDYKSSLYL']

type: gene
location: [144054:144975](+)
qualifiers:
    Key: gene, Value: ['speB_1']
    Key: locus_tag, Value: ['HeimC3_12290']

type: CDS
location: [144054:144975](+)
qualifiers:
    Key: EC_number

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



type: source
location: [0:16253](+)
qualifiers:
    Key: altitude, Value: ['-3283 m']
    Key: collection_date, Value: ['2010']
    Key: country, Value: ['Arctic Ocean: Gakkel Ridge']
    Key: db_xref, Value: ['taxon:1841598']
    Key: environmental_sample, Value: ['']
    Key: isolate, Value: ['LC_3']
    Key: isolation_source, Value: ["LokiAmp MDA-amplified sample from Loki's castle hydrothermal vent sediment"]
    Key: lat_lon, Value: ['73.763167 N 8.463999999999942 E']
    Key: metagenome_source, Value: ['marine sediment metagenome']
    Key: mol_type, Value: ['genomic DNA']
    Key: note, Value: ['metagenomic']
    Key: organism, Value: ['Candidatus Heimdallarchaeota archaeon LC_3']
    Key: submitter_seqid, Value: ['HeimC3_contig000156']

type: gene
location: [<0:1593](-)
qualifiers:
    Key: gene, Value: ['cheA_1']
    Key: locus_tag, Value: ['HeimC3_55280']

type: CDS
location: [<0:1593](-)
qualifiers:
    Key: EC_number, Value: ['2.7.13.3']
    Key: codon_start, Value: ['1']
 

In [37]:
diagram = GenomeDiagram.Diagram("Diagram")
track = diagram.new_track(1, name = "Features")
feature_set = track.new_set()

for feature in first_record.features:
    feature_set.add_feature(feature, label = True, sigil = "ARROW")

diagram.draw(format = "linear")
diagram.write("random_diagram.pdf", "PDF")

In [71]:
diagram = GenomeDiagram.Diagram("Diagram")
track = diagram.new_track(1, name = "Features")
feature_set = track.new_set()

ols40 = SeqIO.parse(open("ncbi-genomes-2021-03-21/GCA_001940655.1_ASM194065v1_genomic.gbff", "r"), "genbank")
for record in ols40:
    if record.id == "MBAA01000186.1":
        for feature in record.features:
            if feature.type == "CDS":
            
                feature_set.add_feature(feature, label = True, sigil = "ARROW", name = feature.qualifiers["protein_id"][0] + feature.qualifiers["product"][0], color = "purple",label_size = 20)
                print(feature.qualifiers["product"] + feature.qualifiers["protein_id"])


# feature_set.add_feature(asf, label = True, sigil = "ARROW", name = "asf", color = "green",label_size = 20)
newfeat = SeqFeature(FeatureLocation(100, 1000, strand = 1))
feature_set.add_feature(newfeat, label = True, color = "red")
diagram.draw(format = "linear", fragments = 2)
diagram.write("random_diagram.pdf", "PDF")

['hypothetical protein', 'OLS12938.1']
['hypothetical protein', 'OLS12939.1']
['hypothetical protein', 'OLS12940.1']
['hypothetical protein', 'OLS12941.1']
['hypothetical protein', 'OLS12942.1']
['putative CRISPR-associated helicase', 'OLS12943.1']
['hypothetical protein', 'OLS12944.1']
['hypothetical protein', 'OLS12945.1']
['hypothetical protein', 'OLS12946.1']
['hypothetical protein', 'OLS12947.1']
['hypothetical protein', 'OLS12948.1']
['CRISPR-associated protein, Cse3 family', 'OLS12949.1']
['CRISPR-associated protein Cas1', 'OLS12950.1']
['putative CRISPR-associated protein', 'OLS12951.1']
['hypothetical protein', 'OLS12952.1']
['hypothetical protein', 'OLS12953.1']
['hypothetical protein', 'OLS12954.1']
['Uncharacterized protein', 'OLS12955.1']
['hypothetical protein', 'OLS12956.1']
['hypothetical protein', 'OLS12957.1']
['putative deoxyribonuclease', 'OLS12958.1']
['hypothetical protein', 'OLS12959.1']


In [None]:
diagram = GenomeDiagram.Diagram("Diagram")
track = diagram.new_track(1, name = "Features")
feature_set = track.new_set()
newfeat = SeqFeature(FeatureLocation(9101, 9886, strand = 1))
feature_set.add_feature(newfeat, label = True, color = "red")
newfeat = SeqFeature(FeatureLocation(10001, 11500, strand = 1))
feature_set.add_feature(newfeat, label = True, color = "red")


In [74]:
ls dnaseq_crisprcasfinder_online_results/

[0m[01;32mCasClusters.fasta[0m*         [01;32mCRISPRCasSummary.tsv[0m*      [01;32mrawCas.fna[0m*
[01;32mCas_Clusters_Summary.tsv[0m*  [01;32mCrisprResult.csv[0m*          [01;32mrawCRISPRs.fna[0m*
[01;32mComplete.fasta[0m*            [01;32mCRTFilteredContigs.fasta[0m*  [01;32mRepeatLib.fasta[0m*
[01;32mCompletion.tsv[0m*            [01;32mmetadata.json[0m*             [01;32mresult.json[0m*
[01;32mCRISPRCasSummary.csv[0m*      [01;32mprogress.log[0m*              [01;32mstat.tsv[0m*


In [89]:
with open("dnaseq_crisprcasfinder_online_results/result.json") as json_file:
    data = json.load(json_file)
    print(data["Sequences"][1]["Cas"])

[{'End': 16213, 'Genes': [{'End': 16213, 'Orientation': '+', 'Start': 15113, 'Sub_type': 'Cas1_0_IE'}, {'End': 13771, 'Orientation': '+', 'Start': 13043, 'Sub_type': 'Cas5_0_IE'}, {'End': 14574, 'Orientation': '+', 'Start': 13771, 'Sub_type': 'Cas6_0_IE'}, {'End': 13029, 'Orientation': '+', 'Start': 11791, 'Sub_type': 'Cas7_0_IE'}, {'End': 11676, 'Orientation': '+', 'Start': 10891, 'Sub_type': 'Cse2_0_IE'}], 'Start': 10891, 'Type': 'General-Class1'}]
