### **Fusion Transcript Data Wrangling**

#### Concatenating and Filtering Raw Fusion Transcript Output from Arriba and FusionCatcher

This notebook details the processes (semi-automated) done to further process the raw output files from Arriba and FusionCatcher fusion transcript callers. 

1. Run the `wrangle-ft-tsv.py` script to generate fusion transcript list from Arriba and FusionCatcher output files. The script takes a mandatory input of path to the directory where sample-specific fusion call output files from Arriba or FusionCatcher are stored as the first argument, and the specific string that is used to identify tool name (`arr` for Arriba fusion transcript call output file prefix, for instance). 

For example:
> ``` wrangle-ft-tsv.py data/FTmyBRCAs_raw/Arriba arr ```

In [1]:
import polars as pl

In [9]:
# load up Arriba and FusionCatcher merged dataframes lazily
arriba_mdf = pl.scan_parquet('data/Arriba-fusiontranscript-raw-list.parquet')
fc_mdf = pl.scan_parquet('data/FusionCatcher-fusiontranscript-raw-list.parquet')

In [7]:
arriba_mdf.collect()

fusionTranscriptID,fusionGeneID,breakpointPair,strand1,strand2,site1,site2,type,confidence,sampleID,toolID
cat,cat,cat,cat,cat,cat,cat,cat,cat,i64,cat
"""TRMT11::SMG6__6:125986622-17:2…","""TRMT11::SMG6""","""6:125986622-17:2244719""","""+""","""-""","""CDS/splice-site""","""CDS/splice-site""","""translocation""","""high""",1,"""Arriba"""
"""STAG3::MEF2C-AS1__7:100189570-…","""STAG3::MEF2C-AS1""","""7:100189570-5:88919251""","""+""","""-""","""CDS""","""intron""","""translocation/5'-5'""","""low""",1,"""Arriba"""
"""MAPK13::C1QL1__6:36132629-17:4…","""MAPK13::C1QL1""","""6:36132629-17:44965446""","""+""","""+""","""CDS""","""intron""","""translocation/5'-5'""","""low""",1,"""Arriba"""
"""STX16::NPEPL1__20:58673711-20:…","""STX16::NPEPL1""","""20:58673711-20:58691724""","""+""","""+""","""CDS/splice-site""","""5'UTR/splice-site""","""deletion/read-through""","""low""",1,"""Arriba"""
"""MAPK13::NMT1__6:36132629-17:44…","""MAPK13::NMT1""","""6:36132629-17:44965446""","""+""","""+""","""CDS""","""intron""","""translocation""","""low""",1,"""Arriba"""
…,…,…,…,…,…,…,…,…,…,…
"""DENND5B::AC087311.1(22711),SYT…","""DENND5B::AC087311.1(22711),SYT…","""12:31479608-12:33016465""","""-""","""+""","""CDS""","""intergenic""","""inversion""","""low""",992,"""Arriba"""
"""LINC01145::AC245100.2__1:14520…","""LINC01145::AC245100.2""","""1:145201150-1:148436753""","""-""","""-""","""exon""","""exon""","""duplication/5'-5'""","""low""",992,"""Arriba"""
"""NET1::RNF169__10:5412820-11:74…","""NET1::RNF169""","""10:5412820-11:74834676""","""+""","""+""","""CDS/splice-site""","""CDS/splice-site""","""translocation""","""low""",992,"""Arriba"""
"""MAN2C1::SIN3A__15:75366522-15:…","""MAN2C1::SIN3A""","""15:75366522-15:75375872""","""-""","""-""","""CDS/splice-site""","""CDS/splice-site""","""duplication""","""low""",992,"""Arriba"""


In [11]:
print(fc_mdf.collect())

shape: (31_364, 11)
┌─────────────┬─────────────┬────────────┬─────────┬───┬──────┬────────────┬──────────┬────────────┐
│ fusionTrans ┆ fusionGeneI ┆ breakpoint ┆ strand1 ┆ … ┆ type ┆ confidence ┆ sampleID ┆ toolID     │
│ criptID     ┆ D           ┆ Pair       ┆ ---     ┆   ┆ ---  ┆ ---        ┆ ---      ┆ ---        │
│ ---         ┆ ---         ┆ ---        ┆ cat     ┆   ┆ cat  ┆ cat        ┆ i64      ┆ cat        │
│ cat         ┆ cat         ┆ cat        ┆         ┆   ┆      ┆            ┆          ┆            │
╞═════════════╪═════════════╪════════════╪═════════╪═══╪══════╪════════════╪══════════╪════════════╡
│ SIDT2::TAGL ┆ SIDT2::TAGL ┆ 11:1171959 ┆ +       ┆ … ┆ .    ┆ .          ┆ 2        ┆ FusionCatc │
│ N__11:11719 ┆ N           ┆ 15-11:1172 ┆         ┆   ┆      ┆            ┆          ┆ her        │
│ 5915-11:…   ┆             ┆ 03002      ┆         ┆   ┆      ┆            ┆          ┆            │
│ AZGP1::GJC3 ┆ AZGP1::GJC3 ┆ 7:99971746 ┆ -       ┆ … ┆ .    ┆ .      

Now, we can merge the two dataframes into one masterFrame. Use Polars' `concat` (vertical concatenation is the default, where two dataframes sharing the exact same columns would be joined together, adding all rows of dataframe 1 and 2 vertically).

In [12]:
joined_df = pl.concat(
    [
        arriba_mdf.collect(),
        fc_mdf.collect()
    ]
)

joined_df



fusionTranscriptID,fusionGeneID,breakpointPair,strand1,strand2,site1,site2,type,confidence,sampleID,toolID
cat,cat,cat,cat,cat,cat,cat,cat,cat,i64,cat
"""TRMT11::SMG6__6:125986622-17:2…","""TRMT11::SMG6""","""6:125986622-17:2244719""","""+""","""-""","""CDS/splice-site""","""CDS/splice-site""","""translocation""","""high""",1,"""Arriba"""
"""STAG3::MEF2C-AS1__7:100189570-…","""STAG3::MEF2C-AS1""","""7:100189570-5:88919251""","""+""","""-""","""CDS""","""intron""","""translocation/5'-5'""","""low""",1,"""Arriba"""
"""MAPK13::C1QL1__6:36132629-17:4…","""MAPK13::C1QL1""","""6:36132629-17:44965446""","""+""","""+""","""CDS""","""intron""","""translocation/5'-5'""","""low""",1,"""Arriba"""
"""STX16::NPEPL1__20:58673711-20:…","""STX16::NPEPL1""","""20:58673711-20:58691724""","""+""","""+""","""CDS/splice-site""","""5'UTR/splice-site""","""deletion/read-through""","""low""",1,"""Arriba"""
"""MAPK13::NMT1__6:36132629-17:44…","""MAPK13::NMT1""","""6:36132629-17:44965446""","""+""","""+""","""CDS""","""intron""","""translocation""","""low""",1,"""Arriba"""
…,…,…,…,…,…,…,…,…,…,…
"""CTBS::GNG5__1:84563257-1:84501…","""CTBS::GNG5""","""1:84563257-1:84501970""","""-""","""-""","""in-frame""",""".""",""".""",""".""",991,"""FusionCatcher"""
"""MRPS30-DT::LINC02224__5:448086…","""MRPS30-DT::LINC02224""","""5:44808642-5:44658557""","""-""","""-""","""exonic(no-known-CDS)""","""exonic(no-known-CDS)""",""".""",""".""",991,"""FusionCatcher"""
"""NBEA::CR382287.1__13:35070852-…","""NBEA::CR382287.1""","""13:35070852-21:10127330""","""+""","""+""","""CDS(truncated)""","""exonic(no-known-CDS)""",""".""",""".""",991,"""FusionCatcher"""
"""HACL1::COLQ__3:15563358-3:1548…","""HACL1::COLQ""","""3:15563358-3:15489637""","""-""","""-""","""out-of-frame""",""".""",""".""",""".""",991,"""FusionCatcher"""


Now sort `sampleID` in ascending order.

In [13]:
joined_df.sort("sampleID")

fusionTranscriptID,fusionGeneID,breakpointPair,strand1,strand2,site1,site2,type,confidence,sampleID,toolID
cat,cat,cat,cat,cat,cat,cat,cat,cat,i64,cat
"""TRMT11::SMG6__6:125986622-17:2…","""TRMT11::SMG6""","""6:125986622-17:2244719""","""+""","""-""","""CDS/splice-site""","""CDS/splice-site""","""translocation""","""high""",1,"""Arriba"""
"""STAG3::MEF2C-AS1__7:100189570-…","""STAG3::MEF2C-AS1""","""7:100189570-5:88919251""","""+""","""-""","""CDS""","""intron""","""translocation/5'-5'""","""low""",1,"""Arriba"""
"""MAPK13::C1QL1__6:36132629-17:4…","""MAPK13::C1QL1""","""6:36132629-17:44965446""","""+""","""+""","""CDS""","""intron""","""translocation/5'-5'""","""low""",1,"""Arriba"""
"""STX16::NPEPL1__20:58673711-20:…","""STX16::NPEPL1""","""20:58673711-20:58691724""","""+""","""+""","""CDS/splice-site""","""5'UTR/splice-site""","""deletion/read-through""","""low""",1,"""Arriba"""
"""MAPK13::NMT1__6:36132629-17:44…","""MAPK13::NMT1""","""6:36132629-17:44965446""","""+""","""+""","""CDS""","""intron""","""translocation""","""low""",1,"""Arriba"""
…,…,…,…,…,…,…,…,…,…,…
"""LINC01145::AC245100.2__1:14520…","""LINC01145::AC245100.2""","""1:145201150-1:148436753""","""-""","""-""","""exon""","""exon""","""duplication/5'-5'""","""low""",992,"""Arriba"""
"""NET1::RNF169__10:5412820-11:74…","""NET1::RNF169""","""10:5412820-11:74834676""","""+""","""+""","""CDS/splice-site""","""CDS/splice-site""","""translocation""","""low""",992,"""Arriba"""
"""MAN2C1::SIN3A__15:75366522-15:…","""MAN2C1::SIN3A""","""15:75366522-15:75375872""","""-""","""-""","""CDS/splice-site""","""CDS/splice-site""","""duplication""","""low""",992,"""Arriba"""
"""LINC02224::MRPS30-DT__5:446584…","""LINC02224::MRPS30-DT""","""5:44658462-5:44777328""","""-""","""-""","""exon/splice-site""","""exon/splice-site""","""duplication""","""low""",992,"""Arriba"""


In [16]:
all_ft_counts = joined_df.select(pl.col("fusionTranscriptID").value_counts(sort=True))
all_ft_counts.unnest("fusionTranscriptID")

fusionTranscriptID,count
cat,u32
"""CTBS::GNG5__1:84563257-1:84501…",661
"""AZGP1::GJC3__7:99971746-7:9992…",608
"""NPEPPS::TBC1D3__17:47592545-17…",608
"""TMED7::TICAM2__5:115616318-5:1…",428
"""SIDT2::TAGLN__11:117195915-11:…",412
…,…
"""AL021546.1::DYNLL1__12:1204571…",1
"""CNOT1::C16ORF78__16:58629728-1…",1
"""RPS19::AC067930.9__19:41872769…",1
"""GTPBP3::AC097717.1__19:1733959…",1


In [18]:
genelevel_ft_counts = joined_df.select(pl.col("fusionGeneID").value_counts(sort=True))
genelevel_ft_counts.unnest("fusionGeneID")

fusionGeneID,count
cat,u32
"""TVP23C::CDRT4""",1448
"""RBM14::RBM4""",835
"""AZGP1::GJC3""",830
"""SMG1::NPIPB5""",765
"""CTBS::GNG5""",663
…,…
"""AL021546.1::DYNLL1""",1
"""CNOT1::C16ORF78""",1
"""RPS19::AC067930.9""",1
"""GTPBP3::AC097717.1""",1
