Analysis of protein coding exons in single molecule assemblies
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
envs
scripts
README.md
Snakefile

README.md

sm_assemblies

Analysis of protein coding exons in single molecule assemblies

dependencies

clone this repo

git clone https://github.com/WatsonLab/sm_assemblies.git
cd sm_assemblies

download genomes

/bin/bash scripts/download.sh
gunzip *.fasta.gz
mkdir genomes
mv *.fasta genomes

run

snakemake --use-conda

parse splign output

Perl script alnparse.pl can be used to summarise the splign output.

The way the pipeline stores splign results is in a "one query, one subject" file i.e. "ENST00000052569.10.OCVW01001666.1.aln" - this would be all of the splign hits from ENST00000052569.10 (query) against OCVW01001666.1 (subject).

To summarise the best hit from a single alignment file, run the script like this:

perl scripts/alnparse.pl ENST00000052569.10.OCVW01001666.1.aln

However, often we want to consider the hits against multiple subjects in order to find the best hit. In this case, we run it like this:

perl scripts/alnparse.pl <(cat ENST00000052569.10.*.aln)

This will find the best hit from alignments of ENST00000052569.10 against all subjects it has been aligned against.

The output is tab-delimited:

  • query name
  • hit name
  • query start
  • length of alignment
  • number of mismatch events
  • number of bases in mismatch events
  • number of insertion events
  • number of bases in insertion events
  • number of deletion events
  • number of bases in deletion events
  • protein sequence of aligned bases