# check-oncogenes
Reviewers point out that some of the genes listed as oncogenes by AC are not actually oncogenes. This notebook:
- Checks the AC list against the Cosmic CGC and ONGene databases it's supposed to come from;
- Asks which are included in my data.
## Results

I count 81 genes in my data which are annotated by AC as oncogenes but not in COSMIC or ONGene.
Looking at this list, I see three main sources for these annotations:
- COSMIC. The majority are annotated as TSG or fusion in COSMIC and nowhere else. 
- hg38_alias. No idea where these come from. Evidence for these seems weak or mixed:
	CCN3 (TSG), CCN4 (oncogene), KMT2C (TSG), LHFPL6 (no evidence), PWWP3A (no evidence), WDCP (no evidence)
- COSMIC and another source, either "Frankell2019" or "Paulson2022". 
	Paulson2022 seems to refer to this paper on Barrett's esophagus: https://doi.org/10.1038/s41467-022-29767-7
		Only 1 gene in this category, CDK12 (TSG), amplified in 1 "unknown" amp in our data.
	Frankell2019 could be a paper on esophageal adenocarcinoma: https://doi.org/10.1038/s41588-018-0331-5
		ARID2 (TSG), AXIN1 (TSG, some evidence that certain splice variants oncogenic), SMARCA4 (TSG), STK11 (TSG).

## Discussion
None of these genes except CCN4 should be reported as oncogenes. Only CCN4 recurrently amplified.
Include as annotated oncogene or new 'putative' oncogene?

In [None]:
import pandas as pd
from pathlib import Path
import warnings

import sys
sys.path.append('../src')
from data_imports import *

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)

In [None]:
def read_ac_table(file='../data/oncogenes/AmpliconClassifier_v1.3.1/combined_oncogene_list.txt'):
    return pd.read_csv(file,sep='\t',index_col=0,names=['gene','source'])
def read_ongene_table(file='../data/oncogenes/ongene_human.txt'):
    return pd.read_csv(file,sep='\t',index_col=1)
def read_cosmic_table(file='../data/oncogenes/Cosmic_CancerGeneCensus_Tsv_v101_GRCh38/Cosmic_CancerGeneCensus_v101_GRCh38.tsv'):
    return pd.read_csv(file,sep='\t',index_col=0)
def get_cosmic_oncogenes(df=None):
    if df is None:
        df = read_cosmic_table()
    df = df[~(df.ROLE_IN_CANCER.isna()) & (df.ROLE_IN_CANCER.str.contains('oncogene'))]
    return set(df.index)
def get_unsupported_oncogenes(subset_my_data=True):
    '''
    get all oncogenes in my dataset which are included in the AC oncogene list but not in ONGene or COSMIC CGC.
    '''
    oncogenes = set(read_ac_table().index)
    oncogenes -= set(read_ongene_table().index)
    oncogenes -= set(get_cosmic_oncogenes())
    if subset_my_data:
        my_data = import_genes()
        my_data = set(my_data[my_data.is_canonical_oncogene].gene)
        oncogenes = oncogenes & my_data
    return oncogenes

In [None]:
misses = get_unsupported_oncogenes(subset_my_data = False)
print(len(misses))

In [None]:
ac_table = read_ac_table()

In [None]:
ac_table[ac_table.index.isin(misses)]

In [None]:
cosmic_tbl = read_cosmic_table()
cosmic_tbl[cosmic_tbl.index.isin(misses)]

In [None]:
#  Write a blacklist
misses = get_unsupported_oncogenes(subset_my_data = True)
blacklist_file='../data/oncogenes/oncogene_blacklist.txt'
with open(blacklist_file, "w") as f:
    f.write("\n".join(misses))