# EssenCDNA Analysis Notebook (DepMap 24Q2)

Adrian Layer (alayer@ucsd.edu); Anthony Vasquez (pavasquez@ucsd.edu); Yasmin A. Jaber (yjaber@ucsd.edu); Omar Halawa (ohalawa@ucsd.edu)

##### The following is an analysis notebook that utilizes the repository code to <u>analyze the correlation between CRISPR gene essentiality (dependency) and the presence (or absence) of ecDNA</u> for DepMap 24Q2.

##### By changing "Input Data" filepaths below, you may carry out the same analysis on your data of interest.

# Imports:

In [1]:
# Importing functions from utils.py supporting file
from utils import *

import matplotlib
import matplotlib.pyplot as plt
print(f"matplotlib: {matplotlib.__version__}")

import pandas as pd
print(f"pandas: {pd.__version__}")

import seaborn as sns
print(f"seaborn: {sns.__version__}")

import scipy
from scipy import stats
print(f"scipy: {scipy.__version__}")

matplotlib: 3.8.4
pandas: 2.2.2
seaborn: 0.13.2
scipy: 1.13.1


# Input Data:

### Change the following to match your respective data inputs:

In [2]:
# DepMap standard-format CRISPR essentiality data file
DEPMAP_CRISPR_ESEN = "../data/DepMap 24Q2/CRISPRGeneEffect.csv"

# DepMap metadata file
DEPMAP_METDATA = "../data/DepMap 24Q2/Model.csv"

# Amplicon Architect standard-format aggregated results file
AA_RESULTS = "../data/aggregated_results.csv"

# Dataset obtained from literature corresponding to most differentially-expressed genes for ecDNA up/downregulation
# See cell below for source
LITERATURE_ECDNA_TARGET_GENES = "../data/ecDNA Target genes.csv"

Lin Miin S., Jo Se-Young, Luebeck Jens, Chang Howard Y., Wu Sihan, Mischel Paul S., Bafna Vineet (2023) <b>Transcriptional immune suppression and upregulation of double stranded DNA damage and repair repertoires in ecDNA-containing tumors</b> <i>eLife</i> 12:RP88895

https://doi.org/10.7554/eLife.88895.2

# Processing Data:

### DepMap CRISPR essentiality data (cell-lines along rows, genes along columns):

##### For the reason behind using the CRISPRGeneEffect.csv file as opposed to CRISPRGeneDependency, see [here.](https://forum.depmap.org/t/crisprgeneeffect-vs-crisprgenedependency/2333)

In [3]:
esen_df = read_esen(DEPMAP_CRISPR_ESEN)
esen_df

Unnamed: 0,A1BG,A1CF,A2M,A2ML1,A3GALT2,A4GALT,A4GNT,AAAS,AACS,AADAC,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
ACH-000001,-0.134132,0.029103,0.016454,-0.137540,-0.047273,0.181367,-0.082437,-0.059023,0.194592,0.035473,...,-0.123528,0.085140,0.181954,0.239474,0.172965,-0.230327,0.055657,0.044296,0.107361,-0.410449
ACH-000004,-0.001436,-0.080068,-0.125263,-0.027607,-0.053838,-0.151272,0.240094,-0.038922,0.186438,0.160221,...,-0.186899,-0.359257,0.202271,0.057740,0.089295,0.086703,-0.304930,0.086858,0.254538,-0.087671
ACH-000005,-0.144940,0.026541,0.160605,0.088015,-0.202605,-0.243420,0.133726,-0.034895,-0.126105,0.036030,...,-0.309668,-0.344502,-0.056160,-0.092447,-0.015550,-0.170380,-0.080934,-0.059685,0.030254,-0.145055
ACH-000007,-0.053334,-0.120420,0.047978,0.086984,-0.018987,-0.017309,-0.000041,-0.158419,-0.169559,0.201305,...,-0.323038,-0.387265,-0.013816,0.183228,0.038424,-0.051728,-0.383499,-0.012801,-0.294771,-0.431575
ACH-000009,-0.027684,-0.144202,0.052846,0.073833,0.038823,-0.108149,0.010811,-0.088600,0.032194,0.114270,...,-0.253057,-0.159965,-0.025342,0.191500,-0.071632,-0.077843,-0.525599,0.093219,-0.029515,-0.255204
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ACH-002922,-0.057426,0.002047,-0.040007,0.231209,-0.222651,-0.071413,0.047366,-0.466785,0.004091,0.162563,...,-0.181586,-0.647909,-0.016507,0.127280,-0.224160,-0.070399,-0.264229,-0.083760,-0.071101,-0.310070
ACH-002925,-0.158397,-0.036294,0.094842,-0.043724,-0.111456,0.083429,0.116124,-0.140481,-0.027413,0.092128,...,-0.272191,-0.776332,-0.036467,0.109655,-0.215080,-0.004304,-0.247449,-0.222342,-0.079772,-0.196377
ACH-002926,-0.104912,-0.098047,0.037632,0.093602,-0.154288,-0.081111,-0.071124,-0.186031,-0.046870,0.147896,...,-0.428666,-0.621266,0.014040,0.300715,-0.149142,-0.020698,-0.245670,0.065053,-0.159881,-0.388352
ACH-002928,-0.210751,-0.083444,-0.135336,-0.056582,-0.237325,0.000792,0.096235,0.061767,-0.212580,0.120619,...,-0.147648,-0.349586,0.203556,0.038069,0.181405,0.048288,0.165457,-0.227231,-0.139241,-0.205339


### DepMap Metadata:

In [4]:
meta_df = read_metadata(DEPMAP_METDATA)
meta_df

Unnamed: 0_level_0,PatientID,CellLineName,StrippedCellLineName,DepmapModelType,OncotreeLineage,OncotreePrimaryDisease,OncotreeSubtype,OncotreeCode,LegacyMolecularSubtype,LegacySubSubtype,...,EngineeredModel,TissueOrigin,ModelDerivationMaterial,PublicComments,CCLEName,HCMIID,WTSIMasterCellID,SangerModelID,COSMICID,DateSharedIndbGaP
ModelID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ACH-000001,PT-gj46wT,NIH:OVCAR-3,NIHOVCAR3,HGSOC,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,High-Grade Serous Ovarian Cancer,HGSOC,,high_grade_serous,...,,,,,NIHOVCAR3_OVARY,,2201.0,SIDM00105,905933.0,
ACH-000002,PT-5qa3uk,HL-60,HL60,AML,Myeloid,Acute Myeloid Leukemia,Acute Myeloid Leukemia,AML,,M3,...,,,,,HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE,,55.0,SIDM00829,905938.0,
ACH-000003,PT-puKIyc,CACO2,CACO2,COAD,Bowel,Colorectal Adenocarcinoma,Colon Adenocarcinoma,COAD,,,...,,,,,CACO2_LARGE_INTESTINE,,,SIDM00891,,
ACH-000004,PT-q4K2cp,HEL,HEL,AML,Myeloid,Acute Myeloid Leukemia,Acute Myeloid Leukemia,AML,,M6,...,,,,,HEL_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE,,783.0,SIDM00594,907053.0,
ACH-000005,PT-q4K2cp,HEL 92.1.7,HEL9217,AML,Myeloid,Acute Myeloid Leukemia,Acute Myeloid Leukemia,AML,,M6,...,,,,,HEL9217_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE,,,SIDM00593,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ACH-003161,PT-or1hkT,ABM-T9430,ABMT9430,ZIMMPSC,Pancreas,Non-Cancerous,Immortalized Pancreatic Stromal Cells,,,,...,,,,,,,,,,
ACH-003181,PT-W75e4m,NRH-LMS1,NRHLMS1,LMS,Soft Tissue,Leiomyosarcoma,Leiomyosarcoma,LMS,,,...,,,,,NRH-LMS1,,,,,
ACH-003183,PT-BqidXH,NRH-MFS3,NRHMFS3,MFS,Soft Tissue,Myxofibrosarcoma,Myxofibrosarcoma,MFS,,,...,,,,,NRH-MFS3,,,,,
ACH-003184,PT-21NMVa,NRH-LMS2,NRHLMS2,LMS,Soft Tissue,Leiomyosarcoma,Leiomyosarcoma,LMS,,,...,,,,,NRH-LMS2,,,,,


### Amplicon Architect Results:

In [5]:
aa_df = read_aa(AA_RESULTS, "Classification")
aa_df

Unnamed: 0_level_0,Unnamed: 0,Sample name,AA amplicon number,Feature ID,Location,Oncogenes,Complexity score,Captured interval length,Feature median copy number,Feature maximum copy number,...,Tissue of origin,Sample type,Feature BED file,CNV BED file,AA PNG file,AA PDF file,AA summary file,Run metadata JSON,Sample metadata JSON,All genes
Classification,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ecDNA,0,5637_URINARY_TRACT,1.0,5637_URINARY_TRACT_amplicon1_ecDNA_1,['chr6:20350615-22839372'],"['E2F3', 'SOX4']",1.223167,2488757.0,29.757287,67.419329,...,urinary tract,cell line,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,"['CASC15', 'CDKAL1', 'E2F3', 'HDGFL1', 'NBAT1'..."
ecDNA,1,5637_URINARY_TRACT,1.0,5637_URINARY_TRACT_amplicon1_ecDNA_2,"['chr3:9099507-9519221', 'chr3:9521655-1018449...","['FANCD2', 'PPARG', 'RAF1', 'SRGAP3', 'VHL']",0.709581,4901014.0,13.175019,17.283490,...,urinary tract,cell line,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,"['ARPC4', 'ARPC4-TTLL3', 'ATG7', 'ATP2B2', 'AT..."
Complex non-cyclic,2,59M_OVARY,1.0,59M_OVARY_amplicon1_Complex non-cyclic_1,"['chr5:51400000-52863895', 'chr9:104220123-104...",[],0.935892,3998398.0,5.026093,5.787943,...,ovary,cell line,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,"['ABCA1', 'DNAJC25', 'DNAJC25-GNG10', 'ECPAS',..."
ecDNA,3,59M_OVARY,2.0,59M_OVARY_amplicon2_ecDNA_2,"['chr8:124484622-125582885', 'chr8:125588889-1...",['TRIB1'],0.366705,1463324.0,10.047201,10.047201,...,ovary,cell line,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,"['MTSS1', 'NDUFB9', 'NSMCE2', 'SQLE', 'TATDN1'..."
ecDNA,4,59M_OVARY,2.0,59M_OVARY_amplicon2_ecDNA_1,"['chr8:52712046-56014576', 'chr8:65533018-6557...",['TCEA1'],1.074874,3343884.0,15.129375,24.854069,...,ovary,cell line,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,"['ATP6V1H', 'LYN', 'LYPLA1', 'MRPL15', 'NPBWR1..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
,1235,A2780_OVARY,,A2780_OVARY_NA,[],[],,,,,...,ovary,cell line,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,,,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,[]
,1236,ISHIKAWAHERAKLIO02ER_ENDOMETRIUM,,ISHIKAWAHERAKLIO02ER_ENDOMETRIUM_NA,[],[],,,,,...,endometrium,cell line,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,,,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,[]
,1237,HPAC_PANCREAS,,HPAC_PANCREAS_NA,[],[],,,,,...,pancreas,cell line,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,,,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,[]
,1238,SW948_LARGE_INTESTINE,,SW948_LARGE_INTESTINE_NA,[],[],,,,,...,large intestine,cell line,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,,,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,[]


# Generating Gene Lists:

##### Using our two datasets (<u>one from literature</u>, and the other being <u>from Amplicon Architect data</u>), we can generate two independent gene lists to utilize for our analyses.

### Literature Gene List:

In [6]:
# The following function is niche and intended specifically for just this particular dataset, and as such, might not be very useful.
# Passing in the filename, column of gene names, and column of ecDNA regulation direction:
literature_gene_dict = read_literature_gene_list(LITERATURE_ECDNA_TARGET_GENES, "#Gene|GeneId", "direction(for_geneset_enrichment)")
literature_gene_list = list(literature_gene_dict.keys())
literature_gene_dict

{'RAE1': 'UP',
 'FBXO17': 'UP',
 'YTHDF1': 'UP',
 'DDX27': 'UP',
 'SEC61G': 'UP',
 'CHMP7': 'DOWN',
 'XPO7': 'DOWN',
 'INTS9': 'DOWN',
 'GJC1': 'UP',
 'METTL1': 'UP',
 'TACR1': 'DOWN',
 'FAM72B': 'UP',
 'NUP107': 'UP',
 'KIAA1967': 'DOWN',
 'C6orf145': 'UP',
 'EGFR': 'UP',
 'C20orf20': 'UP',
 'TPX2': 'UP',
 'KIF2C': 'UP',
 'HOXA4': 'UP',
 'CDK4': 'UP',
 'PCM1': 'DOWN',
 'AURKA': 'UP',
 'HOXA2': 'UP',
 'C8orf41': 'DOWN',
 'TSFM': 'UP',
 'FAM72D': 'UP',
 'HOXA5': 'UP',
 'PSMA7': 'UP',
 'MCPH1': 'DOWN',
 'CCDC25': 'DOWN',
 'INTS10': 'DOWN',
 'HOXA7': 'UP',
 'SPAG5': 'UP',
 'CNOT7': 'DOWN',
 'BUB1': 'UP',
 'HOXA3': 'UP',
 'TRIM35': 'DOWN',
 'PPP2R2A': 'DOWN',
 'WRN': nan,
 'HOXA6': 'UP',
 'VPS37A': 'DOWN',
 'TRIP13': 'UP',
 'ZNF395': 'DOWN',
 'FUT10': 'DOWN',
 'KIF4A': 'UP',
 'FDFT1': 'DOWN',
 'NCAPH': 'UP',
 'GSG2': 'UP',
 'CDCA5': 'UP',
 'MAK16': nan,
 'ERI1': nan,
 'RAD54L': 'UP',
 'FBXO25': 'DOWN',
 'PINX1': nan,
 'RAD51': 'UP',
 'MCM10': 'UP',
 'PLK1': 'UP',
 'CDCA8': 'UP',
 'ORC1L': 

### Amplicon Architect Gene List:

In [7]:
aa_gene_dict = ecdna_freq_genes(aa_df, "Sample name", "All genes", "Oncogenes")
aa_oncogene_dict = ecdna_freq_genes(aa_df, "Sample name", "All genes", "Oncogenes", True)
aa_gene_dict

{'CASC11': [26,
  {'AU565_BREAST',
   'CORL23_LUNG',
   'CORL279_LUNG',
   'CORL311_LUNG',
   'DMS273_LUNG',
   'EFM192A_BREAST',
   'FUOV1_OVARY',
   'GCIY_STOMACH',
   'HCC1419_BREAST',
   'HCC44_LUNG',
   'HGC27_STOMACH',
   'KYSE450_OESOPHAGUS',
   'MSTO211H_PLEURA',
   'NCIH1792_LUNG',
   'NCIH2122_LUNG',
   'NCIH2170_LUNG',
   'NCIH2171_LUNG',
   'NCIH23_LUNG',
   'NCIH446_LUNG',
   'NCIH460_LUNG',
   'NCIH510_LUNG',
   'NCIH524_LUNG',
   'NCIH82_LUNG',
   'SCLC21H_LUNG',
   'SNU16_STOMACH',
   'SW480_LARGE_INTESTINE'}],
 'MYC': [26,
  {'AU565_BREAST',
   'CORL23_LUNG',
   'CORL279_LUNG',
   'CORL311_LUNG',
   'DMS273_LUNG',
   'EFM192A_BREAST',
   'FUOV1_OVARY',
   'GCIY_STOMACH',
   'HCC1419_BREAST',
   'HCC44_LUNG',
   'HGC27_STOMACH',
   'KYSE450_OESOPHAGUS',
   'MSTO211H_PLEURA',
   'NCIH1792_LUNG',
   'NCIH2122_LUNG',
   'NCIH2170_LUNG',
   'NCIH2171_LUNG',
   'NCIH23_LUNG',
   'NCIH446_LUNG',
   'NCIH460_LUNG',
   'NCIH510_LUNG',
   'NCIH524_LUNG',
   'NCIH82_LUNG',
   'SC

##### Top 400 Most Frequent Genes in ecDNA (genes with >= 3 ecDNA amplicons): 

In [8]:
aa_gene_list = list(aa_gene_dict.keys())[0:400]
aa_gene_list

['CASC11',
 'MYC',
 'PVT1',
 'TMEM75',
 'CASC21',
 'CASC8',
 'CCAT2',
 'POU5F1B',
 'CASC19',
 'CCAT1',
 'PCAT2',
 'PRNCR1',
 'ERBB2',
 'PGAP3',
 'GRB7',
 'IKZF3',
 'MIEN1',
 'MYCL',
 'TRIT1',
 'PCAT1',
 'PNMT',
 'TCAP',
 'BMP8B',
 'HPCAL4',
 'NT5C1A',
 'OXCT2',
 'PPIE',
 'NEUROD2',
 'PPP1R1B',
 'STARD3',
 'MYEOV',
 'SMIM38',
 'HEYL',
 'MFSD2A',
 'ANO1',
 'CCND1',
 'FGF19',
 'FGF3',
 'FGF4',
 'LTO1',
 'TPCN2',
 'CAP1',
 'PABPC4',
 'PABPC4-AS1',
 'PPIEL',
 'RLF',
 'SNORA55',
 'ZPBP2',
 'CTTN',
 'FADD',
 'MRGPRD',
 'MRGPRF',
 'MRGPRF-AS1',
 'PPFIA1',
 'COL9A2',
 'OXCT2P1',
 'PPT1',
 'CDK12',
 'FBXL20',
 'MED1',
 'ADGRA2',
 'ADRB3',
 'ASH2L',
 'BRF2',
 'EIF4EBP1',
 'GOT1L1',
 'PLPBP',
 'RAB11FIP1',
 'STAR',
 'LRATD2',
 'ZDHHC11B',
 'TMCO2',
 'ZMPSTE24',
 'ERLIN2',
 'ZNF703',
 'BAG4',
 'C8orf86',
 'DDHD2',
 'FGFR1',
 'LETM2',
 'LSM1',
 'NSD3',
 'PLPP5',
 'IGHMBP2',
 'MRPL21',
 'BMP8A',
 'SMAP2',
 'TPPP',
 'NDUFB9',
 'TATDN1',
 'FAM91A1',
 'FER1L6',
 'FER1L6-AS1',
 'FER1L6-AS2',
 'TBC1D31',


In [9]:
aa_gene_dict["CASC11"]

[26,
 {'AU565_BREAST',
  'CORL23_LUNG',
  'CORL279_LUNG',
  'CORL311_LUNG',
  'DMS273_LUNG',
  'EFM192A_BREAST',
  'FUOV1_OVARY',
  'GCIY_STOMACH',
  'HCC1419_BREAST',
  'HCC44_LUNG',
  'HGC27_STOMACH',
  'KYSE450_OESOPHAGUS',
  'MSTO211H_PLEURA',
  'NCIH1792_LUNG',
  'NCIH2122_LUNG',
  'NCIH2170_LUNG',
  'NCIH2171_LUNG',
  'NCIH23_LUNG',
  'NCIH446_LUNG',
  'NCIH460_LUNG',
  'NCIH510_LUNG',
  'NCIH524_LUNG',
  'NCIH82_LUNG',
  'SCLC21H_LUNG',
  'SNU16_STOMACH',
  'SW480_LARGE_INTESTINE'}]

In [10]:
aa_gene_dict["CDC6"]

[3, {'NCIH2170_LUNG', 'UACC812_BREAST', 'UACC893_BREAST'}]

### With Literature Gene List:

##### Top 20 Most Frequent Oncogenes in ecDNA: 

In [11]:
aa_oncogene_list = list(aa_oncogene_dict.keys())[0:20]
aa_oncogene_list

['MYC',
 'PVT1',
 'ERBB2',
 'CCND1',
 'FGF3',
 'FGF4',
 'CTTN',
 'CDK12',
 'BRF2',
 'ZNF703',
 'DDHD2',
 'FGFR1',
 'LSM1',
 'KRAS',
 'TRIB1',
 'ZNF217',
 'CSF3',
 'THRA',
 'BMP7',
 'CYP24A1']

In [12]:
aa_oncogene_dict["MYC"]

[26,
 {'AU565_BREAST',
  'CORL23_LUNG',
  'CORL279_LUNG',
  'CORL311_LUNG',
  'DMS273_LUNG',
  'EFM192A_BREAST',
  'FUOV1_OVARY',
  'GCIY_STOMACH',
  'HCC1419_BREAST',
  'HCC44_LUNG',
  'HGC27_STOMACH',
  'KYSE450_OESOPHAGUS',
  'MSTO211H_PLEURA',
  'NCIH1792_LUNG',
  'NCIH2122_LUNG',
  'NCIH2170_LUNG',
  'NCIH2171_LUNG',
  'NCIH23_LUNG',
  'NCIH446_LUNG',
  'NCIH460_LUNG',
  'NCIH510_LUNG',
  'NCIH524_LUNG',
  'NCIH82_LUNG',
  'SCLC21H_LUNG',
  'SNU16_STOMACH',
  'SW480_LARGE_INTESTINE'}]

In [13]:
aa_oncogene_dict["ERBB2"]

[13,
 {'BT474_BREAST',
  'EFM192A_BREAST',
  'HCC1419_BREAST',
  'HCC1569_BREAST',
  'HCC202_BREAST',
  'KYSE410_OESOPHAGUS',
  'MFE280_ENDOMETRIUM',
  'MKN7_STOMACH',
  'NCIH2170_LUNG',
  'OE19_OESOPHAGUS',
  'SKOV3_OVARY',
  'UACC812_BREAST',
  'UACC893_BREAST'}]

In [14]:
aa_oncogene_dict["BMP7"]

[3, {'BT474_BREAST', 'MCF7_BREAST', 'UACC893_BREAST'}]

# Calculating Essentiality Across Cell-Lines:

##### The following gene categories are based on the median values across cell-lines as follows:
- <u>Non-essential</u> (greater than 0): genes where the knockout <b>does NOT affect</b> the cancer cell-line's <b>survival</b>
- <u>Selectively-essential</u> (between -1 and 0): genes where the knockout <b>affects the survival of most</b> cancer cell-lines
- <u>Common-essential</u> (less than -1): genes where the knockout <b>affects the survival of almost all</b> cancer cell-lines 

### With Literature Gene List:

##### The following is sorted by median value (most essential):

In [15]:
literature_avg_esen = avg_esen(literature_gene_list, esen_df)
literature_avg_esen.sort_values(by="Median")

Unnamed: 0,Gene,Mean,Median,Standard Deviation,Category
559,SNRPA1,-2.848278,-2.863018,0.418750,Common-essential
50,PLK1,-2.656581,-2.672847,0.472205,Common-essential
463,SNRPD1,-2.591417,-2.626916,0.430849,Common-essential
159,RPL3,-2.520890,-2.513504,0.336514,Common-essential
339,EEF2,-2.326072,-2.424840,0.511596,Common-essential
...,...,...,...,...,...
58,ERICH1,0.210255,0.209041,0.103781,Non-essential
176,BIN3,0.219264,0.219210,0.111792,Non-essential
15,HOXA4,0.238866,0.246487,0.122658,Non-essential
366,TLR3,0.249413,0.248230,0.096221,Non-essential


### With Amplicon Architect Most Frequent Genes List:

##### The following is sorted by median value (most essential):

In [16]:
aa_avg_esen = avg_esen(aa_gene_list, esen_df)
aa_avg_esen.sort_values(by="Median")

Unnamed: 0,Gene,Mean,Median,Standard Deviation,Category
246,PSMA6,-2.799824,-2.822722,0.412962,Common-essential
186,PSMB3,-2.690280,-2.698995,0.391548,Common-essential
187,RPL23,-2.657507,-2.680870,0.401600,Common-essential
229,PSMB4,-2.249115,-2.241404,0.358707,Common-essential
83,RPL19,-2.261818,-2.238109,0.338898,Common-essential
...,...,...,...,...,...
283,EDN3,0.196774,0.195733,0.113218,Non-essential
199,SERPINH1,0.203149,0.196871,0.093530,Non-essential
189,ADAM18,0.210093,0.214992,0.136893,Non-essential
269,CLPTM1L,0.233281,0.232301,0.107643,Non-essential


# Splitting AA Cell-Lines (ecDNA +/-)

In [17]:
ec_aa_plus_genes = set()
all_aa_genes = set()

for index, row in aa_df.iterrows():
    achilles_id = ccle_to_achilles(meta_df, "CCLEName")[row["Sample name"]]
    all_aa_genes.add(achilles_id)
    if index == "ecDNA":
        ec_aa_plus_genes.add(achilles_id)

print("Number of ecDNA+ cell-lines in AA data:", len(ec_aa_plus_genes))
print("Total number of cell-lines in AA data:", len(all_aa_genes))

Number of ecDNA+ cell-lines in AA data: 166
Total number of cell-lines in AA data: 329


# Generating Plots and Statistics:

## Individual Gene Plots: 

### AA Gene List:

In [18]:
aa_genes_plot_df = plot_genes(aa_gene_list, esen_df, ec_aa_plus_genes, "aa")

Number of ecDNA+ cell lines: 121
Number of ecDNA- cell lines: 1029


In [19]:
aa_genes_plot_df[aa_genes_plot_df['p-value'] < 0.05].sort_values(by="p-value")

Unnamed: 0,Gene,u-stat,p-value
246,PSMA6,47493.0,1.9e-05
309,MSL1,75337.0,0.000153
264,C19orf12,75066.0,0.00021
248,SNX6,73958.0,0.000708
37,MRGPRF,51623.0,0.002096
313,C1orf109,72806.0,0.002265
249,SPTSSA,72456.0,0.003159
17,PPP1R1B,52116.0,0.00335
231,RFX5,52167.0,0.003513
233,ZNF687,52504.0,0.004782


### Literature Gene List:

In [22]:
literature_genes_plot_df = plot_genes(literature_gene_list, esen_df, ec_aa_plus_genes, "literature")

Number of ecDNA+ cell lines: 121
Number of ecDNA- cell lines: 1029


In [23]:
literature_genes_plot_df[literature_genes_plot_df['p-value'] < 0.05].sort_values(by="p-value")

Unnamed: 0,Gene,u-stat,p-value
504,CIB1,75578.0,0.000116
457,MSL1,75337.0,0.000153
6,XPO7,50476.0,0.000654
404,RACGAP1,50527.0,0.000690
71,MDM2,73595.0,0.001033
...,...,...,...
400,PTCD2,55379.0,0.046655
277,DIDO1,55379.0,0.046655
492,SCNN1A,69108.0,0.047362
453,TNFRSF10B,55406.0,0.047524


## Individual Cell-Line Plots: 

### AA Gene List:

### Literature Gene List:

## Grouped Cell-Line Plots: 

### AA Gene List:

### Literature Gene List: