# Comparison of DevBrain IsoSeq vs CHESS Transcriptomes


## Download CHESS v 3.0.1 Datasets
### Primary Transcriptome
* 168451 transcripts

In [1]:
cd /u/project/gandalm/gandalm/jupyter/240529_IsoSeq_vs_CHESS
if [ ! -d ./data ]; then
    mkdir data
fi

if [ ! -d ./results ]; then
    mkdir results
fi

In [2]:
wget -P ./data https://github.com/chess-genome/chess/releases/download/v.3.0.1/chess3.0.1.gtf.gz 
gunzip data/chess3.0.1.gtf.gz

cat data/chess3.0.1.gtf | awk ' $3 == "transcript" { print }  ' | wc -l

168451


### Assembled Transcriptome 
* 987244 transcripts

In [4]:
#wget -P ./data/ https://github.com/chess-genome/chess/releases/download/v.3.0.1/assembled.gtf.gz
#gunzip data/assembled.gtf.gz
cat data/assembled.gtf | awk ' $3 == "transcript" { print }  ' | wc -l


987244


## Download IsoSeq Data from Developing Human Brain
### hg19 version:
* 214516 transcripts (total)

In [5]:
#wget -P ./data https://github.com/gandallab/Dev_Brain_IsoSeq/raw/main/data/cp_vz_0.75_min_7_recovery_talon.gtf.gz
#gunzip data/cp_vz_0.75_min_7_recovery_talon.gtf.gz
cat data/cp_vz_0.75_min_7_recovery_talon.gtf | awk ' $3 == "transcript" { print }  ' | wc -l


214516


### CrossMap to hg38
* Install crossmap, download liftover chain

In [None]:
# mamba install CrossMap
wget -P ./data/ https://hgdownload.soe.ucsc.edu/goldenPath/hg19/liftOver/hg19ToHg38.over.chain.gz

In [6]:
CrossMap  gff data/hg19ToHg38.over.chain.gz data/cp_vz_0.75_min_7_recovery_talon.gtf data/devBrainIsoSeq_hg38.gtf

cat data/devBrainIsoSeq_hg38.gtf | awk ' $3 == "transcript" {print} ' | wc -l

cat data/devBrainIsoSeq_hg38.gtf | grep NOVEL > data/devBrainIsoSeq_hg38_novel.gtf

cat data/devBrainIsoSeq_hg38_novel.gtf | awk ' $3 == "transcript" {print} ' | wc -l

2024-05-29 12:04:13 [INFO]  Read the chain file "data/hg19ToHg38.over.chain.gz" 
204191
141975


* devBrainIsoSeq_hg38.gtf has 204191 total transcripts
* devBrainIsoSeq_hg38_novel.gtf has 141975 novel transcripts


## Compare Transcriptome Annotations with GFFCOMPARE

In [7]:
gffcompare --version

gffcompare v0.12.6


### 1. Isoseq vs Primary CHESS

In [9]:
gffcompare data/devBrainIsoSeq_hg38.gtf -r data/chess3.0.1.gtf -o results/IsoSeq_vs_Primary
cat results/IsoSeq_vs_Primary.stats

  168451 reference transcripts loaded.
  291 duplicate reference transcripts discarded.
  214394 query transfrags loaded.
# gffcompare v0.12.6 | Command line was:
#gffcompare data/devBrainIsoSeq_hg38.gtf -r data/chess3.0.1.gtf -o results/IsoSeq_vs_Primary
#

#= Summary for dataset: data/devBrainIsoSeq_hg38.gtf 
#     Query mRNAs :  214394 in   23376 loci  (206255 multi-exon transcripts)
#            (13767 multi-transcript loci, ~9.2 transcripts per locus)
# Reference mRNAs :  168160 in   61304 loci  (141572 multi-exon)
# Super-loci w/ reference transcripts:    19733
#-----------------| Sensitivity | Precision  |
        Base level:    40.9     |    62.8    |
        Exon level:    48.8     |    62.2    |
      Intron level:    53.6     |    79.5    |
Intron chain level:    31.6     |    21.7    |
  Transcript level:    28.4     |    22.3    |
       Locus level:    28.9     |    73.7    |

     Matching intron chains:   44805
       Matching transcripts:   47720
              Matching

* 47720 / 214394 IsoSeq transcripts (22%) are found in CHESS Primary GTF


### 2. Isoseq vs Assembled Transcripts

In [10]:
gffcompare -r data/assembled.gtf data/devBrainIsoSeq_hg38.gtf -o results/IsoSeq_vs_Assembly
cat results/IsoSeq_vs_Assembly.stats

  987244 reference transcripts loaded.
  115 duplicate reference transcripts discarded.
  214394 query transfrags loaded.
# gffcompare v0.12.6 | Command line was:
#gffcompare -r data/assembled.gtf data/devBrainIsoSeq_hg38.gtf -o results/IsoSeq_vs_Assembly
#

#= Summary for dataset: data/devBrainIsoSeq_hg38.gtf 
#     Query mRNAs :  214394 in   23376 loci  (206255 multi-exon transcripts)
#            (13767 multi-transcript loci, ~9.2 transcripts per locus)
# Reference mRNAs :  987129 in  168026 loci  (882562 multi-exon)
# Super-loci w/ reference transcripts:    16125
#-----------------| Sensitivity | Precision  |
        Base level:    12.8     |    77.8    |
        Exon level:    14.2     |    62.5    |
      Intron level:    24.1     |    84.4    |
Intron chain level:     6.5     |    27.8    |
  Transcript level:     5.9     |    27.0    |
       Locus level:     8.8     |    69.4    |

     Matching intron chains:   57263
       Matching transcripts:   57925
              Matching

* 57263 / 214394  IsoSeq transcripts (26.7%) are found in the Assembled 


### 3. IsoSeq vs (Primary + Assembled) Transcripts

In [11]:
gffcompare data/assembled.gtf data/chess3.0.1.gtf -o results/Primary_and_Assembled
gffcompare -r results/Primary_and_Assembled.combined.gtf data/devBrainIsoSeq_hg38.gtf -o results/IsoSeq_vs_Primary_and_Assembled -T
cat results/IsoSeq_vs_Primary_and_Assembled.stats

Loading query file #1: data/assembled.gtf
  987244 query transfrags loaded.
  868 duplicate query transfrags discarded.
Loading query file #2: data/chess3.0.1.gtf
  168451 query transfrags loaded.
  372 duplicate query transfrags discarded.
  1053854 reference transcripts loaded.
  214394 query transfrags loaded.
# gffcompare v0.12.6 | Command line was:
#gffcompare -r results/Primary_and_Assembled.combined.gtf data/devBrainIsoSeq_hg38.gtf -o results/IsoSeq_vs_Primary_and_Assembled -T
#

#= Summary for dataset: data/devBrainIsoSeq_hg38.gtf 
#     Query mRNAs :  214394 in   23376 loci  (206255 multi-exon transcripts)
#            (13767 multi-transcript loci, ~9.2 transcripts per locus)
# Reference mRNAs : 1053854 in  186194 loci  (924838 multi-exon)
# Super-loci w/ reference transcripts:    18217
#-----------------| Sensitivity | Precision  |
        Base level:    12.4     |    79.7    |
        Exon level:    14.3     |    64.7    |
      Intron level:    23.2     |    85.7    |
Intro

* 66325 / 214394 of the IsoSeq transcripts (31%) match the combined CHESS Primary + Assembled Transcriptomes
* 148069  / 214394 IsoSeq transcripts are NOT found in CHESS Primary or Assembled GTFs

### 4. NOVEL_only IsoSeq vs (Primary + Assembled) Transcripts


In [12]:
gffcompare -r results/Primary_and_Assembled.combined.gtf data/devBrainIsoSeq_hg38_novel.gtf -o results/IsoSeqNovel_vs_Primary_and_Assembled
cat results/IsoSeqNovel_vs_Primary_and_Assembled.stats

  1053854 reference transcripts loaded.
  150038 query transfrags loaded.
# gffcompare v0.12.6 | Command line was:
#gffcompare -r results/Primary_and_Assembled.combined.gtf data/devBrainIsoSeq_hg38_novel.gtf -o results/IsoSeqNovel_vs_Primary_and_Assembled
#

#= Summary for dataset: data/devBrainIsoSeq_hg38_novel.gtf 
#     Query mRNAs :  150038 in   13162 loci  (149132 multi-exon transcripts)
#            (10128 multi-transcript loci, ~11.4 transcripts per locus)
# Reference mRNAs : 1053854 in  186194 loci  (924838 multi-exon)
# Super-loci w/ reference transcripts:    11107
#-----------------| Sensitivity | Precision  |
        Base level:     7.9     |    76.6    |
        Exon level:    10.1     |    65.3    |
      Intron level:    17.5     |    83.8    |
Intron chain level:     2.8     |    17.4    |
  Transcript level:     2.5     |    17.4    |
       Locus level:     4.4     |    66.2    |

     Matching intron chains:   25897
       Matching transcripts:   26084
              M

* 26084 / 150038 of NOVEL IsoSeq Transcripts (17.4%) are found in CHESS Primary + Assembled

### 5. NOVEL_only IsoSeq vs (Primary + Assembled) Transcripts with Relaxed Matching
-e 1000
-d 1000

In [13]:
gffcompare data/devBrainIsoSeq_hg38_novel.gtf -T -e 1000 -d 1000 -r results/Primary_and_Assembled.combined.gtf -o results/IsoSeqNovel_vs_Primary_and_Assembled_relaxed
cat results/IsoSeqNovel_vs_Primary_and_Assembled_relaxed.stats

  1053854 reference transcripts loaded.
  150038 query transfrags loaded.
# gffcompare v0.12.6 | Command line was:
#gffcompare data/devBrainIsoSeq_hg38_novel.gtf -T -e 1000 -d 1000 -r results/Primary_and_Assembled.combined.gtf -o results/IsoSeqNovel_vs_Primary_and_Assembled_relaxed
#

#= Summary for dataset: data/devBrainIsoSeq_hg38_novel.gtf 
#     Query mRNAs :  150038 in   13162 loci  (149132 multi-exon transcripts)
#            (10128 multi-transcript loci, ~11.4 transcripts per locus)
# Reference mRNAs : 1053854 in  186194 loci  (924838 multi-exon)
# Super-loci w/ reference transcripts:    11107
#-----------------| Sensitivity | Precision  |
        Base level:     7.9     |    76.6    |
        Exon level:    12.7     |    69.8    |
      Intron level:    17.5     |    83.8    |
Intron chain level:     2.8     |    17.4    |
  Transcript level:     2.5     |    17.4    |
       Locus level:     4.4     |    66.2    |

     Matching intron chains:   25897
       Matching transcrip

## CONCLUSIONS
#### 124k of 149k (>82%) of NOVEL IsoSeq Transcripts identified in the developing human brain are not present in CHESS (168k) or Assembled (987k) Transcriptomes
#### 148k / 214k (>69%) of TOTAL IsoSeq Transcripts identified in the developing human brain are not present in CHESS (168k) or Assembled (987k) Transcriptomes

In [None]:
gzip data/*.gtf
gzip results/*.gtf