# Converting tRNAscan-SE output to GFF3 format

tRNAscan-SE produces multiple output files:

- trnascan.bed: a BED file with the coordinates of the tRNAs
- trnascan.fasta: a FASTA file with the sequences of the tRNAs (and their coordinates in the headers)
- trnascan.stats: a human-readable file with summary statistics from the program output
- trnascan.out: a human-readable table with lots of metadata for each tRNA

For our resulting GFF3 file we would like to keep all of the predicted tRNAs that 

1. are not pseudogenes
2. don't have introns
3. don't overlap with any protein-coding genes

Info for points 1 and 2 can be found in the `trnascan.out` file, while point 3 requires a separate
GFF3 file with the protein-coding genes.

In [1]:
import pandas as pd

In [2]:
scan_loc = "/Volumes/scratch/pycnogonum/genome/draft/trnascan/"
out = scan_loc + 'trnascan.out'

trna = pd.read_csv(out, sep='\t', header=None, skiprows=3)
trna.columns = ['seqname', 'tRNA', 'start', 'end', 'type', 'anti', 'intron_start', 'intron_end', 'inf score', 'origin', 'note']

trna['id'] = trna['seqname'].str.strip() + '.tRNA' + trna['tRNA'].astype(str) + '-' + trna['type'] + trna['anti']
trna.set_index('id', inplace=True)

For point 3 we need to check the overlap of the tRNAs with the protein-coding genes. We can use
`bedtools intersect` and the draft gff file:

```bash
$ bedtools intersect -wao -v -a trnascan.bed -b ../annot_merge/merged_sorted.gff3 > no_overlap.bed
$ bedtools intersect -wao -S -a trnascan.bed -b ../annot_merge/merged_sorted.gff3 | grep -e "-1" -v > overlap_otherstrand.bed
$ cat overlap_otherstrand.bed > keep.bed
$ cat no_overlap.bed >> keep.bed
$ cut -d"        " -f4,6 keep.bed | sort -u > keep_unique.txt # the delimiter should be a tab character; try ctrl+v tab
```

In [3]:
trna_ids = pd.read_table(scan_loc + "keep_unique.txt", header=None, index_col=0)
trna_ids.columns = ['strand']

In [4]:
keep = trna.join(trna_ids, how='inner')
not_pseudo = keep['note'].fillna("") == ""
not_intron = (keep['intron_start'] == 0) & (keep['intron_end'] == 0)

gff = keep[not_pseudo & not_intron]

In [5]:
def gff_string(index, name, row):
    seqid = row['seqname'].strip()
    source = "tRNAscan-SE"
    seq_type = "tRNA"
    start = min(row['start'], row['end'])
    stop = max(row['start'], row['end'])
    score = "."
    strand = row['strand']
    phase = "."
    # :Gly-GCC-1-1
    attributes = f"ID=tRNA{index+1};gene_id={name};name=tRNA:{row['type']}-{row['anti']};score={row['inf score']}"
    gene = f"{seqid}\t{source}\tgene\t{start}\t{stop}\t{score}\t{strand}\t{phase}\t{attributes};\n"
    attributes = f"ID=tRNAscan{index+1};name=tRNA:{row['type']}-{row['anti']}"
    trna = f"{seqid}\t{source}\ttRNA\t{start}\t{stop}\t{score}\t{strand}\t{phase}\t{attributes};\n"
    return gene + trna

In [6]:
with open(scan_loc + 'trnascan.gff3', 'w') as f:
    for i, (name, row) in enumerate(gff.iterrows()):
        f.write(gff_string(i, name, row))