This notebook is for troubleshooting plastomes using NCBI-provided software pipelines. It was a suggestion from NCBI team rejected the annotation twice after uploading.

In [9]:
import os

# Defining variables for files
template = 'plastomes/final/template.sbt'
gb = 'plastomes/final/Crepis_callicephala.gb'
fasta = 'plastomes/in/Crepis_callicephala.fasta'

# command itself:
# table2asn -t template.sbt -i sequence.fsa

In [2]:
# check for fasta

from Bio import SeqIO, SeqRecord

try:
    with open(fasta, "r") as fasta_file:
        print(f"File {fasta} does exist.")
except FileNotFoundError:
    print(f"File '{fasta}' not found.\nConverting genbank file '{gb}' to FASTA...")
    try:
        first_record = next(SeqIO.parse(gb, "genbank"))
        SeqIO.write(first_record, fasta, "fasta")
        print(f"Successfully created '{fasta}'.")
    except FileNotFoundError:
        print(f"Error: The source GenBank file '{gb}' was not found.")
    except StopIteration:
        print(f"Error: The GenBank file '{gb}' is empty.")

File plastomes/in/Crepis_callicephala.fasta does exist.


## Run `table2asn` program

There are some recommendations on NCBI site that were not illuminated at Readme file of the program.
`table2asn` will recognize files with **the same basename** as the input sequence file. Sequences that are part of a plasmid, or an organellar chromosome, or specific nuclear chromosomes need to have that information included in the fasta definition line, in these formats:

- [location=mitochondrion]
- [location=chloroplast]

Sequences that are a complete circular chromosome or plasmid need to have the circular topology and the completeness included.

- [topology=circular] [completeness=complete]
- [topology=circular] gap at end, not circularized




In [3]:
# preparing directory
# all the files should be stored at the same directory
import os

project_dir = 'plastomes/table2asn'
tbl_file = "plastomes/final/gb2sequin_out/c_callicephala_final/05/Crepis_callicephala___149980_bp____DNA.tbl"
sqn_file = "plastomes/final/gb2sequin_out/c_callicephala_final/05/Crepis_callicephala___149980_bp____DNA(1).sqn"

SHORT_NAMES = {
    "Crepis_callicephala": "cc",
    "Crepis_purpurea": "cp",
}

def create_symlinks(project_dir: str, sourcefile: str):
    """
    Create symlinks for files to process.
    """
    basename = os.path.basename(sourcefile)
    ext= basename.rsplit(".")[-1]
    #print("ext", ext)

    # handle sql and tbl names
    basename = str(basename).rsplit("___")[0]

    #print("basename", basename)
    species = basename.rsplit(".")[0]
    #print("species", species)
    
    label = SHORT_NAMES[species]
    #print("label", label)
    target_dir = os.path.join(project_dir, label)
    #print("target_dir", target_dir)
    abs_target_dir = os.path.abspath(target_dir)
    # filename should match sequence id in FASTA header
    target_filename = os.path.join(abs_target_dir, f"{species}.{ext}")
    
    #print("target_filename", target_filename)
    abs_sourcefile = os.path.abspath(sourcefile)

    try:
        os.path.exists(sourcefile)
        try:
            # fixing first relative paths
            os.symlink(abs_sourcefile, target_filename)
            print(f"Symlink '{target_filename}' for '{abs_sourcefile}' was successfully created.")
        except FileExistsError:
            print(f"Symlink for '{sourcefile}' is already exist.")
        except PermissionError:
            print(f"You have no permissions for this action.")
    except FileExistsError:
        print(f"Source file '{sourcefile}' does not exist.")


print("Creating symbolic links:")
# create_symlinks(project_dir, gb) #  this is output file
# create_symlinks(project_dir, fasta) # the header line should be modified
create_symlinks(project_dir, tbl_file)
#create_symlinks(project_dir, sqn_file) #  this is output file


Creating symbolic links:
Symlink for 'plastomes/final/gb2sequin_out/c_callicephala_final/05/Crepis_callicephala___149980_bp____DNA.tbl' is already exist.


The header line of fasta file should follow definition line conventions. To keep previous files intact but change required data, the file should be just copied instead of linking.

In [13]:
import shutil

destination_directory = f"{project_dir}/cc"
dest_file = f"{destination_directory}/{os.path.basename(fasta).rsplit(".")[0]}.fsa"
print("destination:", dest_file)

if not os.path.exists(dest_file):
    try:
        shutil.copy(fasta, dest_file)
        print(f"File '{fasta}' copied to '{destination_directory}' successfully.")
    except FileNotFoundError:
        print("Source file not found.")
    except Exception as e:
        print(f"An error occurred: {e}")

destination: plastomes/table2asn/cc/Crepis_callicephala.fsa
File 'plastomes/in/Crepis_callicephala.fasta' copied to 'plastomes/table2asn/cc' successfully.


In [14]:
# testing table2asn
import subprocess
import sys


# Configuration
CONDA_ENV_NAME = "plastome_post-env"
INPUT_SUBDIR = "cc"

location = "[location=chloroplast]"
topology = "[topology=circular]"
completeness = "[completeness=complete]"
organism = "[organism=Crepis callicephala]"
code = "[gcode=11]"
pcode = "[gpcode=11]"

src_qualifiers = f"{organism} {location} {topology} {completeness} {code}"
print(src_qualifiers)

input_dir = f"{project_dir}/cc"
# Check inputs
print(f"Checking input directory...\n  {input_dir} is present: {os.path.isdir(input_dir)}")
print(f"Checking input files...\n  {template} is present: {os.path.isfile(template)}")
present_flag = False
for i in os.listdir(input_dir):
    file = os.path.join(input_dir, i)
    present_flag = os.path.exists(file)
    print(f"  {i} is present: {present_flag}")
    if present_flag == False:
        print(f"    relative path: {file}")

    


# table2asn -indir {project_dir}/cc -t {template} -j {comment}

cmd = [
    "conda", "run", "-n", "table2asn-env",
    "table2asn",
    "-indir", str(os.path.abspath(input_dir)),
    "-t", str(os.path.abspath(template)),
    "-j", f"'{src_qualifiers}'",
    "-V", "vb",
    "-verbose",
]

cmd_version = [
    "conda", "run", "-n", "table2asn-env",
    "table2asn",
    "-version",
]

print(f"Running: {' '.join(cmd)}")
#print(f"Running: {' '.join(cmd_version)}")

try:
    result = subprocess.run(cmd, check=True, capture_output=True, text=True)
    print("STDOUT:", result.stdout, flush=True)
    print("STDERR:", result.stderr, flush=True)
    print("`table2asn` completed successfully!", flush=True)
except subprocess.CalledProcessError as e:
    print(f"`table2asn` failed with exit code {e.returncode}")
    sys.exit(e.returncode)

[organism=Crepis callicephala] [location=chloroplast] [topology=circular] [completeness=complete] [gcode=11]
Checking input directory...
  plastomes/table2asn/cc is present: True
Checking input files...
  plastomes/final/template.sbt is present: True
  Crepis_callicephala.tbl is present: True
  Crepis_callicephala.fsa is present: True
Running: conda run -n table2asn-env table2asn -indir /home/asan/BIO_GENOMICS/notebooks/plastome_postprocessing/plastomes/table2asn/cc -t /home/asan/BIO_GENOMICS/notebooks/plastome_postprocessing/plastomes/final/template.sbt -j '[organism=Crepis callicephala] [location=chloroplast] [topology=circular] [completeness=complete] [gcode=11]' -V vb -verbose
STDOUT: 
STDERR: This copy of table2asn is more than 1 year old. Please download the current version if it is newer.
Will be using one threads
Recognized annotation format: five-column feature table
Falling back on built-in data for popular organisms.


`table2asn` completed successfully!


The `table2asn` did not performed well. The working directory does not contain any output files.
There the "This copy of table2asn is more than 1 year old. Please download the current version if it is newer." message appeared in STDERR. Trying to launch binary executable file from the fresh release might be a solution.

## Solution
Thwe problem was copying fasta file with `.fasta` file extension. After changing it to `.fsa` the `table2asn` performed well.

In [6]:
table2asn = "bin/table2asn.linux64"

import subprocess
import sys

cmd_version = [
    table2asn,
    "-version",
]

try:
    result = subprocess.run(cmd_version, check=True, capture_output=True, text=True)
    print("STDOUT:", result.stdout, flush=True)
    print("STDERR:", result.stderr, flush=True)
    print("`table2asn` completed successfully!", flush=True)
except subprocess.CalledProcessError as e:
    print(f"`table2asn` failed with exit code {e.returncode}")
    sys.exit(e.returncode)



STDOUT: table2asn: 1.29.324

STDERR: 
`table2asn` completed successfully!


In [7]:
import os

cmd = [
    table2asn,
    "-indir", str(os.path.abspath(input_dir)),
    "-t", str(os.path.abspath(template)),
    "-V", "vb",
    "-verbose",
]

print(f"Running: {' '.join(cmd)}")

try:
    result = subprocess.run(cmd, check=True, capture_output=True, text=True)
    print("STDOUT:", result.stdout, flush=True)
    print("STDERR:", result.stderr, flush=True)
    print("`table2asn` completed successfully!", flush=True)
except subprocess.CalledProcessError as e:
    print(f"`table2asn` failed with exit code {e.returncode}")
    sys.exit(e.returncode)

Running: bin/table2asn.linux64 -indir /home/asan/BIO_GENOMICS/notebooks/plastome_postprocessing/plastomes/table2asn/cc -t /home/asan/BIO_GENOMICS/notebooks/plastome_postprocessing/plastomes/final/template.sbt -V vb -verbose
STDOUT: 
STDERR: 
`table2asn` completed successfully!


## Summarizing results
There are 4 errors in Crepis callicephala genome of two types. The reference genome annotations were validated via GB2Sequin and compared to C. callicephala plastid genome.
### Errors

| species | accession | SEQ_INST.BadProteinStart | SEQ_FEAT.StartCodon |
| ------- | --------- | ------------------------ | ------------------- |
| Crepis callicephala | - | psbL, ndhD | "photosystem II subunit L", "NADH dehydrogenase subunit D" |
| Lactuca sativa | NC_007578 | psbL, ndhD | - |
| Lactuca sativa cv. Ramosa | PP999684 | psbL | - |
| Nicotiana tabacum | NC_001879 | psbL | - |
| Arabidopsis thaliana | NC_000932 | ndhD | - |
| Oryza sativa | NC_031333 | rpl2, rpl2 | - |

At least each fatal error should be checked manually. One of the solution for 'illegal start codon' type error might be exception defining in annotation.

### Warnings
In Crepis callicephala genome validation report

| Warning | Feature | Feature_description | Location |
| - | - | - | - |
| SEQ_FEAT.CDSgeneRange | CDS | ribosomal protein S12 <133> | rps12:[lcl\|Crepis_callicephala:c67482-67369, 136002-136794] |
| SEQ_FEAT.CDSgeneRange | CDS | Ycf15 protein <158> | ycf15:lcl\|Crepis_callicephala:c96959-96768 |
| SEQ_FEAT.CDSgeneRange | CDS | ribosomal protein S7 <159> | rps7:lcl\|Crepis_callicephala:c94947-94480 |
| SEQ_FEAT.CDSgeneRange | CDS | NADH dehydrogenase subunit B <160> | ndhB:lcl\|Crepis_callicephala:c94189-91988 |
| SEQ_FEAT.CDSgeneRange | CDS | Ycf2 protein <161> | ycf2:lcl\|Crepis_callicephala:84073-90927 |
| SEQ_FEAT.CDSgeneRange | CDS | ribosomal protein L23 <162> | rpl23:lcl\|Crepis_callicephala:c83722-83441 |
| SEQ_FEAT.CDSgeneRange | CDS | ribosomal protein L2 <163> | rpl2:lcl\|Crepis_callicephala:c83422-81933 |
| SEQ_FEAT.GeneXrefWithoutGene | tRNA | - | lcl\|Crepis_callicephala:c7316-7244 |
| SEQ_FEAT.MissingTrnaAA | tRNA | - | lcl\|Crepis_callicephala:c7316-7244 |
| SEQ_FEAT.GeneXrefWithoutGene | tRNA | - | lcl\|Crepis_callicephala:c8577-8490 |
| SEQ_FEAT.MissingTrnaAA | tRNA | - | lcl\|Crepis_callicephala:c8577-8490 |
| SEQ_FEAT.GeneXrefWithoutGene | tRNA | - | lcl\|Crepis_callicephala:c11817-11734 |
| SEQ_FEAT.MissingTrnaAA | tRNA | - | lcl\|Crepis_callicephala:c11817-11734 |
| SEQ_FEAT.GeneXrefWithoutGene | tRNA | - | lcl\|Crepis_callicephala:c11992-11921 |
| SEQ_FEAT.MissingTrnaAA | tRNA | - | lcl\|Crepis_callicephala:c11992-11921 |
| SEQ_FEAT.GeneXrefWithoutGene | tRNA | - | lcl\|Crepis_callicephala:c29889-29818 |
| SEQ_FEAT.MissingTrnaAA | tRNA | - | lcl\|Crepis_callicephala:c29889-29818 |
| SEQ_FEAT.MissingTrnaAA | tRNA | - | lcl\|Crepis_callicephala:c35029-34947 |
| SEQ_FEAT.GeneXrefWithoutGene | tRNA | Gly | lcl\|Crepis_callicephala:35855-35925 |
| SEQ_FEAT.MissingTrnaAA | tRNA | - | lcl\|Crepis_callicephala:44778-44864 |
| SEQ_FEAT.GeneXrefWithoutGene | tRNA | Met | lcl\|Crepis_callicephala:c83961-83888 |
| SEQ_FEAT.MissingTrnaAA | tRNA | - | lcl\|Crepis_callicephala:110967-111046 |
| SEQ_FEAT.MissingTrnaAA | tRNA | - | (lcl\|Crepis_callicephala:c132067-132025, c131268-131234) |
| SEQ_FEAT.GeneXrefWithoutGene | tRNA | Met | lcl\|Crepis_callicephala:147834-147908 |
| SEQ_FEAT.GeneXrefWithoutGene | exon | /number=2 | lcl\|Crepis_callicephala:c1794-1759 |
| SEQ_FEAT.GeneXrefWithoutGene | intron | /number=1 | lcl\|Crepis_callicephala:c4320-1795 |
| SEQ_FEAT.GeneXrefWithoutGene | exon | /number=1 | lcl\|Crepis_callicephala:c4358-4321 |
| SEQ_FEAT.GeneXrefWithoutGene | exon | /number=2 | lcl\|Crepis_callicephala:47244-47293 |
| SEQ_FEAT.GeneXrefWithoutGene | exon | /number=1 | lcl\|Crepis_callicephala:c51727-51690 |
| SEQ_FEAT.GeneXrefWithoutGene | exon | /number=1 | lcl\|Crepis_callicephala:99728-99770 |
| SEQ_FEAT.GeneXrefWithoutGene | intron | /number=1 | lcl\|Crepis_callicephala:99771-100526 |
| SEQ_FEAT.GeneXrefWithoutGene | exon | /number=2 | lcl\|Crepis_callicephala:100527-100561 |
| SEQ_FEAT.GeneXrefStrandProblem | CDS | ribosomal protein S12 <133> | (lcl\|Crepis_callicephala:c67482-67369, c95793-95563, c95027-95001) |
| SEQ_FEAT.GeneXrefStrandProblem | CDS | Ycf15 protein <158> | lcl\|Crepis_callicephala:134836-135027 |
| SEQ_FEAT.GeneXrefStrandProblem | CDS | ribosomal protein S7 <159> | lcl\|Crepis_callicephala:136848-137315 |
| SEQ_FEAT.GeneXrefStrandProblem | CDS | NADH dehydrogenase subunit B <160> | (lcl\|Crepis_callicephala:137606-138382, 139052-139807) |
| SEQ_FEAT.GeneXrefStrandProblem | CDS | Ycf2 protein <161> | lcl\|Crepis_callicephala:c147722-140868 |
| SEQ_FEAT.GeneXrefStrandProblem | CDS | ribosomal protein L23 <162> | lcl\|Crepis_callicephala:148073-148354 |
| SEQ_FEAT.GeneXrefStrandProblem | CDS | ribosomal protein L2 <163> | (lcl\|Crepis_callicephala:148373-148763, 149429-149862) |

### Notes
For C. Callicephala, lines of type 'Note' are absent in table2asn validation output.

In [None]:
# parsing text file with validation warnings to print it in structured format.

file_2_parse = 'plastomes/table2asn/cc/Crepis_callicephala.val'

with open(file_2_parse, "r") as handle:
    for line in handle:
        warning = "-"
        feature = "-"
        gene = "-"
        location = "-"
        if "Warning:" in line:
            line = line.split(" ", maxsplit=3)
            warning = line[2].strip("[]")
            feature = line[-1].rsplit('FEATURE: ')[-1].split(": ")[0]
            gene = line[-1].rsplit('FEATURE: ')[-1].split("[")[0].split(": ")[-1]
            location = line[-1].split("[", maxsplit=1)[1].split("] ")[0]
            # sanitizing values
            variables = [warning, feature, gene, location]
            for i, value in enumerate(variables):
                value = value.strip(" ")
                if not value or value.isspace():
                    variables[i] = "-"
                elif "|" in value:
                    variables[i] = value.replace("|", "\|")
                else:
                    variables[i] = value
            warning, feature, gene, location = variables
            # printing in md-compatible format
            print(f"| {warning} | {feature} | {gene} | {location} |")

| SEQ_FEAT.CDSgeneRange | CDS | ribosomal protein S12 <133> | rps12:[lcl\|Crepis_callicephala:c67482-67369, 136002-136794] |
| SEQ_FEAT.CDSgeneRange | CDS | Ycf15 protein <158> | ycf15:lcl\|Crepis_callicephala:c96959-96768 |
| SEQ_FEAT.CDSgeneRange | CDS | ribosomal protein S7 <159> | rps7:lcl\|Crepis_callicephala:c94947-94480 |
| SEQ_FEAT.CDSgeneRange | CDS | NADH dehydrogenase subunit B <160> | ndhB:lcl\|Crepis_callicephala:c94189-91988 |
| SEQ_FEAT.CDSgeneRange | CDS | Ycf2 protein <161> | ycf2:lcl\|Crepis_callicephala:84073-90927 |
| SEQ_FEAT.CDSgeneRange | CDS | ribosomal protein L23 <162> | rpl23:lcl\|Crepis_callicephala:c83722-83441 |
| SEQ_FEAT.CDSgeneRange | CDS | ribosomal protein L2 <163> | rpl2:lcl\|Crepis_callicephala:c83422-81933 |
| SEQ_FEAT.GeneXrefWithoutGene | tRNA | - | lcl\|Crepis_callicephala:c7316-7244 |
| SEQ_FEAT.MissingTrnaAA | tRNA | - | lcl\|Crepis_callicephala:c7316-7244 |
| SEQ_FEAT.GeneXrefWithoutGene | tRNA | - | lcl\|Crepis_callicephala:c8577-8490 |
| SE

  variables[i] = value.replace("|", "\|")


## Checking tRNA data
### Perform re-assess using Aragorn v1.2.41
The code to use:
```shell
aragorn -v -e -gcbact -c -i -o $output $fasta
```
`-gcbact`   Use Bacterial/Plant chloroplast genetic code.

`-c`        Assume that each sequence has a circular topology. Search wraps around each end.

`-d`        Double. Search both strands of each sequence. Default setting

`-i`        Search for tRNA genes with introns in anticodon loop with maximum length 3000 bases. Minimum intron length is 0 bases.

`-e`        Print out score for each reported gene.

`-v`        Verbose. Prints out information during search to STDERR.

`-o <outfile>`    Print output to <outfile>. If <outfile> already exists, it is overwritten. By default all output goes to stdout.

In [None]:
import subprocess
import sys

fasta = "plastomes/table2asn/cc/Crepis_callicephala.fsa"
gb = "plastomes/table2asn/cc/Crepis_callicephala.gbf"
aragorn_out = "plastomes/table2asn/cc/Crepis_callicephala.aragorn.trnas.txt"

aragorn_cmd = [
    "conda", "run", "-n", "aragorn-env",
    "aragorn",
    "-v", "-e",
    "-gcbact", 
    "-c", "-i", 
    "-o", aragorn_out, 
    gb,
]

try:
    result = subprocess.run(cmd, check=True, capture_output=True, text=True)
    print("STDOUT:", result.stdout, flush=True)
    print("STDERR:", result.stderr, flush=True)
    print("`ARAGORN` run completed successfully!", flush=True)
except subprocess.CalledProcessError as e:
    print(f"`ARAGORN` failed with exit code {e.returncode}")
    sys.exit(e.returncode)