# EssenCDNA Analysis Notebook (DepMap 24Q2)

Adrian Layer (alayer@ucsd.edu); Anthony Vasquez (pavasquez@ucsd.edu); Yasmin A. Jaber (yjaber@ucsd.edu); Omar Halawa (ohalawa@ucsd.edu)

##### The following is an analysis notebook that utilizes the repository code to <u>analyze the correlation between CRISPR gene essentiality (dependency) and the presence (or absence) of ecDNA</u> for DepMap 24Q2.

##### By changing "Input Data" filepaths below, you may carry out the same analysis on your data of interest.

# Imports:

In [1]:
# Importing functions from utils.py supporting file
import sys
sys.path.append('src/')
from utils import *

import matplotlib
import matplotlib.pyplot as plt
print(f"matplotlib: {matplotlib.__version__}")

import pandas as pd
print(f"pandas: {pd.__version__}")

import seaborn as sns
print(f"seaborn: {sns.__version__}")

import numpy as np
print(f"numpy: {np.__version__}")

import scipy
from scipy import stats
print(f"scipy: {scipy.__version__}")

matplotlib: 3.8.4
pandas: 2.2.2
seaborn: 0.13.2
numpy: 1.26.4
scipy: 1.13.1


# Input Data:

### Change the following to match your respective data inputs:

In [2]:
# DepMap standard-format CRISPR essentiality data file
DEPMAP_CRISPR_ESEN = "data/DepMap 24Q2/CRISPRGeneEffect.csv"

# DepMap metadata file
DEPMAP_METDATA = "data/DepMap 24Q2/Model.csv"

# Amplicon Architect standard-format aggregated results file
AA_RESULTS = "data/AA/aggregated_results.csv"

# Dataset obtained from literature corresponding to most differentially-expressed genes for ecDNA up/downregulation
# See cell below for source
LITERATURE_ECDNA_TARGET_GENES = "data/Literature/ecDNA Target genes.csv"

Lin Miin S., Jo Se-Young, Luebeck Jens, Chang Howard Y., Wu Sihan, Mischel Paul S., Bafna Vineet (2023) <b>Transcriptional immune suppression and upregulation of double stranded DNA damage and repair repertoires in ecDNA-containing tumors</b> <i>eLife</i> 12:RP88895

https://doi.org/10.7554/eLife.88895.2

# Processing Data:

### DepMap CRISPR essentiality data (cell-lines along rows, genes along columns):

##### For the reason behind using the CRISPRGeneEffect.csv file as opposed to CRISPRGeneDependency, see [here.](https://forum.depmap.org/t/crisprgeneeffect-vs-crisprgenedependency/2333)

In [3]:
esen_df = read_esen(DEPMAP_CRISPR_ESEN)
esen_df.head()

Unnamed: 0,A1BG,A1CF,A2M,A2ML1,A3GALT2,A4GALT,A4GNT,AAAS,AACS,AADAC,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
ACH-000001,-0.134132,0.029103,0.016454,-0.13754,-0.047273,0.181367,-0.082437,-0.059023,0.194592,0.035473,...,-0.123528,0.08514,0.181954,0.239474,0.172965,-0.230327,0.055657,0.044296,0.107361,-0.410449
ACH-000004,-0.001436,-0.080068,-0.125263,-0.027607,-0.053838,-0.151272,0.240094,-0.038922,0.186438,0.160221,...,-0.186899,-0.359257,0.202271,0.05774,0.089295,0.086703,-0.30493,0.086858,0.254538,-0.087671
ACH-000005,-0.14494,0.026541,0.160605,0.088015,-0.202605,-0.24342,0.133726,-0.034895,-0.126105,0.03603,...,-0.309668,-0.344502,-0.05616,-0.092447,-0.01555,-0.17038,-0.080934,-0.059685,0.030254,-0.145055
ACH-000007,-0.053334,-0.12042,0.047978,0.086984,-0.018987,-0.017309,-4.1e-05,-0.158419,-0.169559,0.201305,...,-0.323038,-0.387265,-0.013816,0.183228,0.038424,-0.051728,-0.383499,-0.012801,-0.294771,-0.431575
ACH-000009,-0.027684,-0.144202,0.052846,0.073833,0.038823,-0.108149,0.010811,-0.0886,0.032194,0.11427,...,-0.253057,-0.159965,-0.025342,0.1915,-0.071632,-0.077843,-0.525599,0.093219,-0.029515,-0.255204


### DepMap Metadata:

In [4]:
meta_df = read_metadata(DEPMAP_METDATA)
meta_df.head()

Unnamed: 0_level_0,PatientID,CellLineName,StrippedCellLineName,DepmapModelType,OncotreeLineage,OncotreePrimaryDisease,OncotreeSubtype,OncotreeCode,LegacyMolecularSubtype,LegacySubSubtype,...,EngineeredModel,TissueOrigin,ModelDerivationMaterial,PublicComments,CCLEName,HCMIID,WTSIMasterCellID,SangerModelID,COSMICID,DateSharedIndbGaP
ModelID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ACH-000001,PT-gj46wT,NIH:OVCAR-3,NIHOVCAR3,HGSOC,Ovary/Fallopian Tube,Ovarian Epithelial Tumor,High-Grade Serous Ovarian Cancer,HGSOC,,high_grade_serous,...,,,,,NIHOVCAR3_OVARY,,2201.0,SIDM00105,905933.0,
ACH-000002,PT-5qa3uk,HL-60,HL60,AML,Myeloid,Acute Myeloid Leukemia,Acute Myeloid Leukemia,AML,,M3,...,,,,,HL60_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE,,55.0,SIDM00829,905938.0,
ACH-000003,PT-puKIyc,CACO2,CACO2,COAD,Bowel,Colorectal Adenocarcinoma,Colon Adenocarcinoma,COAD,,,...,,,,,CACO2_LARGE_INTESTINE,,,SIDM00891,,
ACH-000004,PT-q4K2cp,HEL,HEL,AML,Myeloid,Acute Myeloid Leukemia,Acute Myeloid Leukemia,AML,,M6,...,,,,,HEL_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE,,783.0,SIDM00594,907053.0,
ACH-000005,PT-q4K2cp,HEL 92.1.7,HEL9217,AML,Myeloid,Acute Myeloid Leukemia,Acute Myeloid Leukemia,AML,,M6,...,,,,,HEL9217_HAEMATOPOIETIC_AND_LYMPHOID_TISSUE,,,SIDM00593,,


### Amplicon Architect Results:

In [5]:
aa_df = read_aa(AA_RESULTS, "Classification")
aa_df.head()

Unnamed: 0_level_0,Unnamed: 0,Sample name,AA amplicon number,Feature ID,Location,Oncogenes,Complexity score,Captured interval length,Feature median copy number,Feature maximum copy number,...,Tissue of origin,Sample type,Feature BED file,CNV BED file,AA PNG file,AA PDF file,AA summary file,Run metadata JSON,Sample metadata JSON,All genes
Classification,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ecDNA,0,5637_URINARY_TRACT,1.0,5637_URINARY_TRACT_amplicon1_ecDNA_1,['chr6:20350615-22839372'],"['E2F3', 'SOX4']",1.223167,2488757.0,29.757287,67.419329,...,urinary tract,cell line,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,"['CASC15', 'CDKAL1', 'E2F3', 'HDGFL1', 'NBAT1'..."
ecDNA,1,5637_URINARY_TRACT,1.0,5637_URINARY_TRACT_amplicon1_ecDNA_2,"['chr3:9099507-9519221', 'chr3:9521655-1018449...","['FANCD2', 'PPARG', 'RAF1', 'SRGAP3', 'VHL']",0.709581,4901014.0,13.175019,17.28349,...,urinary tract,cell line,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,"['ARPC4', 'ARPC4-TTLL3', 'ATG7', 'ATP2B2', 'AT..."
Complex non-cyclic,2,59M_OVARY,1.0,59M_OVARY_amplicon1_Complex non-cyclic_1,"['chr5:51400000-52863895', 'chr9:104220123-104...",[],0.935892,3998398.0,5.026093,5.787943,...,ovary,cell line,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,"['ABCA1', 'DNAJC25', 'DNAJC25-GNG10', 'ECPAS',..."
ecDNA,3,59M_OVARY,2.0,59M_OVARY_amplicon2_ecDNA_2,"['chr8:124484622-125582885', 'chr8:125588889-1...",['TRIB1'],0.366705,1463324.0,10.047201,10.047201,...,ovary,cell line,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,"['MTSS1', 'NDUFB9', 'NSMCE2', 'SQLE', 'TATDN1'..."
ecDNA,4,59M_OVARY,2.0,59M_OVARY_amplicon2_ecDNA_1,"['chr8:52712046-56014576', 'chr8:65533018-6557...",['TCEA1'],1.074874,3343884.0,15.129375,24.854069,...,ovary,cell line,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/CCL...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,/Users/jluebeck/Desktop/research/CCLE/hg38/ccl...,"['ATP6V1H', 'LYN', 'LYPLA1', 'MRPL15', 'NPBWR1..."


# Generating Gene Lists:

##### Using our two datasets (<u>one from literature</u>, and the other being <u>from Amplicon Architect data</u>), we can generate two independent gene lists to utilize for our analyses.

### Literature Gene List:

In [6]:
# The following function is niche and intended specifically for just this particular dataset, and as such, might not be very useful.
# Passing in the filename, column of gene names, and column of ecDNA regulation direction:
literature_gene_dict = read_literature_gene_list(LITERATURE_ECDNA_TARGET_GENES, "#Gene|GeneId", "direction(for_geneset_enrichment)")
literature_gene_list = list(literature_gene_dict.keys())
literature_gene_dict

{'RAE1': 'UP',
 'FBXO17': 'UP',
 'YTHDF1': 'UP',
 'DDX27': 'UP',
 'SEC61G': 'UP',
 'CHMP7': 'DOWN',
 'XPO7': 'DOWN',
 'INTS9': 'DOWN',
 'GJC1': 'UP',
 'METTL1': 'UP',
 'TACR1': 'DOWN',
 'FAM72B': 'UP',
 'NUP107': 'UP',
 'KIAA1967': 'DOWN',
 'C6orf145': 'UP',
 'EGFR': 'UP',
 'C20orf20': 'UP',
 'TPX2': 'UP',
 'KIF2C': 'UP',
 'HOXA4': 'UP',
 'CDK4': 'UP',
 'PCM1': 'DOWN',
 'AURKA': 'UP',
 'HOXA2': 'UP',
 'C8orf41': 'DOWN',
 'TSFM': 'UP',
 'FAM72D': 'UP',
 'HOXA5': 'UP',
 'PSMA7': 'UP',
 'MCPH1': 'DOWN',
 'CCDC25': 'DOWN',
 'INTS10': 'DOWN',
 'HOXA7': 'UP',
 'SPAG5': 'UP',
 'CNOT7': 'DOWN',
 'BUB1': 'UP',
 'HOXA3': 'UP',
 'TRIM35': 'DOWN',
 'PPP2R2A': 'DOWN',
 'WRN': nan,
 'HOXA6': 'UP',
 'VPS37A': 'DOWN',
 'TRIP13': 'UP',
 'ZNF395': 'DOWN',
 'FUT10': 'DOWN',
 'KIF4A': 'UP',
 'FDFT1': 'DOWN',
 'NCAPH': 'UP',
 'GSG2': 'UP',
 'CDCA5': 'UP',
 'MAK16': nan,
 'ERI1': nan,
 'RAD54L': 'UP',
 'FBXO25': 'DOWN',
 'PINX1': nan,
 'RAD51': 'UP',
 'MCM10': 'UP',
 'PLK1': 'UP',
 'CDCA8': 'UP',
 'ORC1L': 

### Amplicon Architect Gene List:

In [7]:
aa_gene_dict = ecdna_freq_genes(aa_df, "Sample name", "All genes", "Oncogenes", "Feature median copy number")
aa_oncogene_dict = ecdna_freq_genes(aa_df, "Sample name", "All genes", "Oncogenes", "Feature median copy number", True)
aa_gene_dict

{'CASC11': [1459.6616842484777,
  {'AU565_BREAST',
   'CORL23_LUNG',
   'CORL279_LUNG',
   'CORL311_LUNG',
   'DMS273_LUNG',
   'EFM192A_BREAST',
   'FUOV1_OVARY',
   'GCIY_STOMACH',
   'HCC1419_BREAST',
   'HCC44_LUNG',
   'HGC27_STOMACH',
   'KYSE450_OESOPHAGUS',
   'MSTO211H_PLEURA',
   'NCIH1792_LUNG',
   'NCIH2122_LUNG',
   'NCIH2170_LUNG',
   'NCIH2171_LUNG',
   'NCIH23_LUNG',
   'NCIH446_LUNG',
   'NCIH460_LUNG',
   'NCIH510_LUNG',
   'NCIH524_LUNG',
   'NCIH82_LUNG',
   'SCLC21H_LUNG',
   'SNU16_STOMACH',
   'SW480_LARGE_INTESTINE'}],
 'MYC': [1459.6616842484777,
  {'AU565_BREAST',
   'CORL23_LUNG',
   'CORL279_LUNG',
   'CORL311_LUNG',
   'DMS273_LUNG',
   'EFM192A_BREAST',
   'FUOV1_OVARY',
   'GCIY_STOMACH',
   'HCC1419_BREAST',
   'HCC44_LUNG',
   'HGC27_STOMACH',
   'KYSE450_OESOPHAGUS',
   'MSTO211H_PLEURA',
   'NCIH1792_LUNG',
   'NCIH2122_LUNG',
   'NCIH2170_LUNG',
   'NCIH2171_LUNG',
   'NCIH23_LUNG',
   'NCIH446_LUNG',
   'NCIH460_LUNG',
   'NCIH510_LUNG',
   'NCIH524

##### Top 400 Most Frequent Genes in ecDNA:

In [8]:
aa_gene_list = list(aa_gene_dict.keys())[0:400]
aa_gene_list

['CASC11',
 'MYC',
 'PVT1',
 'CASC8',
 'POU5F1B',
 'CCAT2',
 'TMEM75',
 'CASC21',
 'ERBB2',
 'PGAP3',
 'GRB7',
 'IKZF3',
 'MIEN1',
 'PCAT1',
 'MYCL',
 'TRIT1',
 'PNMT',
 'TCAP',
 'BMP8B',
 'HPCAL4',
 'NT5C1A',
 'OXCT2',
 'PPIE',
 'ZPBP2',
 'CASC19',
 'CCAT1',
 'HEYL',
 'PCAT2',
 'PRNCR1',
 'MFSD2A',
 'RLF',
 'CDC6',
 'RAPGEFL1',
 'RARA',
 'WIPF2',
 'CAP1',
 'PABPC4',
 'PABPC4-AS1',
 'PPIEL',
 'SNORA55',
 'OXCT2P1',
 'LRATD2',
 'COL9A2',
 'BMP8A',
 'TTLL10',
 'SMAP2',
 'ZFP69',
 'ZFP69B',
 'ZNF684',
 'PPT1',
 'NEUROD2',
 'PPP1R1B',
 'STARD3',
 'TMCO2',
 'ZMPSTE24',
 'DDX1',
 'GACAT3',
 'MYCN',
 'MYCNOS',
 'MYCNUT',
 'FAM91A1',
 'FER1L6',
 'FER1L6-AS1',
 'FER1L6-AS2',
 'NFYC',
 'NFYC-AS1',
 'RIMS3',
 'CDK12',
 'FBXL20',
 'MED1',
 'KIAA0754',
 'APIP',
 'CD44',
 'FJX1',
 'LDLRAD3',
 'PAMR1',
 'PDHX',
 'SLC1A2',
 'SNORD164',
 'TRIM44',
 'GJA9',
 'GJA9-MYCBP',
 'FGFR2',
 'PLPP4',
 'RPL21',
 'WDR11',
 'WDR11-AS1',
 'MYEOV',
 'SMIM38',
 'ANO1',
 'CASC1',
 'ETFRF1',
 'KRAS',
 'LMNTD1',
 'EXO5',

In [9]:
aa_gene_dict["CASC11"]

[1459.6616842484777,
 {'AU565_BREAST',
  'CORL23_LUNG',
  'CORL279_LUNG',
  'CORL311_LUNG',
  'DMS273_LUNG',
  'EFM192A_BREAST',
  'FUOV1_OVARY',
  'GCIY_STOMACH',
  'HCC1419_BREAST',
  'HCC44_LUNG',
  'HGC27_STOMACH',
  'KYSE450_OESOPHAGUS',
  'MSTO211H_PLEURA',
  'NCIH1792_LUNG',
  'NCIH2122_LUNG',
  'NCIH2170_LUNG',
  'NCIH2171_LUNG',
  'NCIH23_LUNG',
  'NCIH446_LUNG',
  'NCIH460_LUNG',
  'NCIH510_LUNG',
  'NCIH524_LUNG',
  'NCIH82_LUNG',
  'SCLC21H_LUNG',
  'SNU16_STOMACH',
  'SW480_LARGE_INTESTINE'}]

In [10]:
aa_gene_dict["CFTR"]

[38.10721405691402, {'HS746T_STOMACH', 'MKN45_STOMACH'}]

### With Literature Gene List:

##### Top 20 Most Frequent Oncogenes in ecDNA: 

In [11]:
aa_oncogene_list = list(aa_oncogene_dict.keys())[0:20]
aa_oncogene_list

['MYC',
 'PVT1',
 'ERBB2',
 'CDC6',
 'RARA',
 'MYCN',
 'CDK12',
 'FGFR2',
 'KRAS',
 'EGFR',
 'CCND1',
 'FGF3',
 'FGF4',
 'CTTN',
 'TRIB2',
 'BRF2',
 'CSF3',
 'THRA',
 'CCNE1',
 'ZNF217']

In [12]:
aa_oncogene_dict["MYC"]

[1459.6616842484777,
 {'AU565_BREAST',
  'CORL23_LUNG',
  'CORL279_LUNG',
  'CORL311_LUNG',
  'DMS273_LUNG',
  'EFM192A_BREAST',
  'FUOV1_OVARY',
  'GCIY_STOMACH',
  'HCC1419_BREAST',
  'HCC44_LUNG',
  'HGC27_STOMACH',
  'KYSE450_OESOPHAGUS',
  'MSTO211H_PLEURA',
  'NCIH1792_LUNG',
  'NCIH2122_LUNG',
  'NCIH2170_LUNG',
  'NCIH2171_LUNG',
  'NCIH23_LUNG',
  'NCIH446_LUNG',
  'NCIH460_LUNG',
  'NCIH510_LUNG',
  'NCIH524_LUNG',
  'NCIH82_LUNG',
  'SCLC21H_LUNG',
  'SNU16_STOMACH',
  'SW480_LARGE_INTESTINE'}]

In [13]:
aa_oncogene_dict["ERBB2"]

[493.95832185281716,
 {'BT474_BREAST',
  'EFM192A_BREAST',
  'HCC1419_BREAST',
  'HCC1569_BREAST',
  'HCC202_BREAST',
  'KYSE410_OESOPHAGUS',
  'MFE280_ENDOMETRIUM',
  'MKN7_STOMACH',
  'NCIH2170_LUNG',
  'OE19_OESOPHAGUS',
  'SKOV3_OVARY',
  'UACC812_BREAST',
  'UACC893_BREAST'}]

In [14]:
aa_oncogene_dict["EGFR"]

[93.32824189532512, {'HCC827_LUNG', 'KYSE520_OESOPHAGUS', 'NCIH3255_LUNG'}]

In [15]:
aa_oncogene_dict["KRAS"]

[95.90981469728438,
 {'CORL23_LUNG',
  'DMS53_LUNG',
  'KE39_STOMACH',
  'KLE_ENDOMETRIUM',
  'MKN1_STOMACH'}]

# Splitting AA Cell-Lines (ecDNA +/-)

##### We split the AA data rows into cell-lines corresponding to ecDNA presence (+) or absence (-).

In [16]:
ec_aa_plus_ids = set()
all_aa_ids = set()
id_dict = ccle_to_achilles(meta_df, "CCLEName")

for index, row in aa_df.iterrows():
    achilles_id = id_dict[row["Sample name"]]
    all_aa_ids.add(achilles_id)
    if index == "ecDNA":
        ec_aa_plus_ids.add(achilles_id)

print("Number of ecDNA+ cell-lines in AA data:", len(ec_aa_plus_ids))
print("Total number of cell-lines in AA data:", len(all_aa_ids))

Number of ecDNA+ cell-lines in AA data: 166
Total number of cell-lines in AA data: 329


# Trimming Essentiality Data:

##### The essentiality data <b>must</b> be trimmed to only include the AA run's cell-lines as those are the only cell-lines we are certain about the identity of their ecDNA presence or absence.

In [17]:
trimmed_esen_df = esen_df[esen_df.index.isin(all_aa_ids)]

In [18]:
trimmed_esen_df.head()

Unnamed: 0,A1BG,A1CF,A2M,A2ML1,A3GALT2,A4GALT,A4GNT,AAAS,AACS,AADAC,...,ZWILCH,ZWINT,ZXDA,ZXDB,ZXDC,ZYG11A,ZYG11B,ZYX,ZZEF1,ZZZ3
ACH-000001,-0.134132,0.029103,0.016454,-0.13754,-0.047273,0.181367,-0.082437,-0.059023,0.194592,0.035473,...,-0.123528,0.08514,0.181954,0.239474,0.172965,-0.230327,0.055657,0.044296,0.107361,-0.410449
ACH-000005,-0.14494,0.026541,0.160605,0.088015,-0.202605,-0.24342,0.133726,-0.034895,-0.126105,0.03603,...,-0.309668,-0.344502,-0.05616,-0.092447,-0.01555,-0.17038,-0.080934,-0.059685,0.030254,-0.145055
ACH-000007,-0.053334,-0.12042,0.047978,0.086984,-0.018987,-0.017309,-4.1e-05,-0.158419,-0.169559,0.201305,...,-0.323038,-0.387265,-0.013816,0.183228,0.038424,-0.051728,-0.383499,-0.012801,-0.294771,-0.431575
ACH-000009,-0.027684,-0.144202,0.052846,0.073833,0.038823,-0.108149,0.010811,-0.0886,0.032194,0.11427,...,-0.253057,-0.159965,-0.025342,0.1915,-0.071632,-0.077843,-0.525599,0.093219,-0.029515,-0.255204
ACH-000012,-0.139877,-0.002344,0.011211,0.048813,0.04977,-0.179609,0.113692,-0.039238,-0.069528,0.235269,...,-0.524291,-0.623566,0.08967,-0.002416,-0.081465,0.061536,-0.060163,-0.091458,-0.157398,-0.233239


# Calculating Essentiality Across Cell-Lines:

##### The following gene categories are based on the median values across cell-lines as follows:
- <u>Non-essential</u> (greater than 0): genes where the knockout <b>does NOT affect</b> the cancer cell-line's <b>survival</b>
- <u>Selectively-essential</u> (between -1 and 0): genes where the knockout <b>affects the survival of most</b> cancer cell-lines
- <u>Common-essential</u> (less than -1): genes where the knockout <b>affects the survival of almost all</b> cancer cell-lines 

### With Literature Gene List:

##### The following is sorted by median value (most essential):

In [19]:
literature_avg_esen = avg_esen(literature_gene_list, trimmed_esen_df)
literature_avg_esen.sort_values(by="Median").head()

Unnamed: 0,Gene,Mean,Median,Standard Deviation,Category
559,SNRPA1,-2.90544,-2.911526,0.41676,Common-essential
50,PLK1,-2.724873,-2.711673,0.479285,Common-essential
463,SNRPD1,-2.661759,-2.673431,0.456231,Common-essential
159,RPL3,-2.534814,-2.50625,0.331999,Common-essential
339,EEF2,-2.318079,-2.429119,0.537582,Common-essential


### With Amplicon Architect Most Frequent Genes List:

##### The following is sorted by median value (most essential):

In [20]:
aa_avg_esen = avg_esen(aa_gene_list, trimmed_esen_df)
aa_avg_esen.sort_values(by="Median").head()

Unnamed: 0,Gene,Mean,Median,Standard Deviation,Category
60,RPL21,-2.709954,-2.80794,0.494277,Common-essential
293,PSMB3,-2.722501,-2.723508,0.408371,Common-essential
294,RPL23,-2.66305,-2.70147,0.422596,Common-essential
229,PSMB4,-2.32561,-2.304537,0.363837,Common-essential
133,RPL19,-2.281648,-2.255677,0.351545,Common-essential


# Generating Plots and Statistics:

## Individual Gene Plots: 

### AA Gene List:

In [21]:
aa_genes_plot_df = plot_genes(aa_gene_list, trimmed_esen_df, ec_aa_plus_ids, "aa")

Number of ecDNA+ cell lines: 121
Number of ecDNA- cell lines: 124


In [22]:
aa_genes_plot_df[aa_genes_plot_df['p-value'] < 0.5]

Unnamed: 0,Gene,u-stat,p-value
116,CITED4,5564.0,0.000477
160,MSL1,9379.0,0.000716
117,C19orf12,9368.0,0.000769
302,TACC1,9368.0,0.000769
119,URI1,5637.0,0.000774
...,...,...,...
289,CWC25,7105.0,0.474653
89,CTTN,7117.0,0.488127
279,B4GALNT2,7119.0,0.490392
180,DISC1,7120.0,0.491527


### Literature Gene List:

In [23]:
literature_genes_plot_df = plot_genes(literature_gene_list, trimmed_esen_df, ec_aa_plus_ids, "literature")

Number of ecDNA+ cell lines: 121
Number of ecDNA- cell lines: 124


In [24]:
literature_genes_plot_df[literature_genes_plot_df['p-value'] < 0.05]

Unnamed: 0,Gene,u-stat,p-value
347,TMEM199,9622.0,0.000133
536,LIPG,9434.0,0.000496
457,MSL1,9379.0,0.000716
222,ATP6V1B2,9325.0,0.001016
254,DAPL1,9274.0,0.001402
387,CENPO,9269.0,0.001447
432,FZD7,5752.0,0.001608
33,WRN,9184.0,0.00243
71,MDM2,9170.0,0.002641
349,KCNMA1,5836.0,0.002673


## Individual Gene Plot (cell-lines with MYC ecDNA vs all else):

### Obtaining cell-lines that express MYC in their ecDNA amplicons:

In [25]:
ec_aa_MYC_pos_ids = []
for ccle in (aa_gene_dict["MYC"][1]):
    ec_aa_MYC_pos_ids.append(id_dict[ccle])

ec_aa_MYC_pos_ids = list(set(ec_aa_MYC_pos_ids))

In [26]:
len(ec_aa_MYC_pos_ids)

26

### Generating plot:

In [27]:
aa_genes_plot_df = plot_genes(["MYC"], trimmed_esen_df, ec_aa_MYC_pos_ids, "MYC_RUN")
aa_genes_plot_df

Number of ecDNA+ cell lines: 21
Number of ecDNA- cell lines: 224


Unnamed: 0,Gene,u-stat,p-value
0,MYC,1619.0,0.018333


## Individual Cell-Line Plots: 

### Grouping Genes by Essentiality Categories:

### Literature Gene List:

#### Extracting the common-essential, selectively-essential, and non-essential genes:

In [28]:
common_esen_genes_list = list(literature_avg_esen[literature_avg_esen['Category'] == 'Common-essential']['Gene'])
selec_esen_genes_list = list(literature_avg_esen[literature_avg_esen['Category'] == 'Selectively-essential']['Gene'])
non_esen_genes_list = list(literature_avg_esen[literature_avg_esen['Category'] == 'Non-essential']['Gene'])

In [29]:
common_esen_genes_dict = {key: literature_gene_dict[key] for key in common_esen_genes_list if key in literature_gene_dict}
selec_esen_genes_dict = {key: literature_gene_dict[key] for key in selec_esen_genes_list if key in literature_gene_dict}
non_esen_genes_dict = {key: literature_gene_dict[key] for key in non_esen_genes_list if key in literature_gene_dict}

In [30]:
len(common_esen_genes_dict)

55

In [31]:
len(selec_esen_genes_dict)

333

In [32]:
len(non_esen_genes_dict)

172

#### Common-Essential Genes Plot:

In [33]:
CE_cell_line_df = cell_line_plot(common_esen_genes_dict, trimmed_esen_df, all_aa_ids, ec_aa_plus_ids, "CE_cell_line")

  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edg

In [34]:
CE_cell_line_df[CE_cell_line_df['p-value'] < 0.05]

Unnamed: 0,Cell-Line,H-stat,p-value
242,ACH-000783 (ecDNA present),6.930234,0.031269


#### Selectively-Essential Genes Plot:

In [35]:
SE_cell_line_df = cell_line_plot(selec_esen_genes_dict, trimmed_esen_df, all_aa_ids, ec_aa_plus_ids, "SE_cell_line")

  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edg

In [36]:
SE_cell_line_df[SE_cell_line_df['p-value'] < 0.05]

Unnamed: 0,Cell-Line,H-stat,p-value
80,ACH-000595 (ecDNA absent),51.613126,6.199434e-12
203,ACH-000738 (ecDNA present),41.599279,9.264702e-10
69,ACH-000900 (ecDNA present),40.819003,1.368567e-09
21,ACH-000948 (ecDNA absent),40.539909,1.573514e-09
75,ACH-000756 (ecDNA present),40.263163,1.807029e-09
...,...,...,...
30,ACH-001129 (ecDNA absent),8.843757,1.201165e-02
78,ACH-000985 (ecDNA absent),7.908638,1.917172e-02
163,ACH-000139 (ecDNA absent),7.114331,2.851955e-02
17,ACH-000311 (ecDNA present),6.831990,3.284371e-02


#### Non-Essential Genes Plot:

In [37]:
NE_cell_line_df = cell_line_plot(non_esen_genes_dict, trimmed_esen_df, all_aa_ids, ec_aa_plus_ids, "NE_cell_line")

  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edg

In [38]:
NE_cell_line_df[NE_cell_line_df['p-value'] < 0.05]

Unnamed: 0,Cell-Line,H-stat,p-value
121,ACH-000713 (ecDNA present),10.504596,0.005235
65,ACH-000943 (ecDNA absent),10.281577,0.005853
180,ACH-000486 (ecDNA absent),9.218197,0.009961
68,ACH-000939 (ecDNA absent),9.154954,0.010281
177,ACH-000986 (ecDNA present),8.559433,0.013847
176,ACH-000894 (ecDNA present),7.513751,0.023357
27,ACH-000343 (ecDNA absent),6.889644,0.03191
109,ACH-000219 (ecDNA absent),6.814067,0.033139
78,ACH-000985 (ecDNA absent),6.526741,0.038259


## Grouped Cell-Line Plots: 

### Calculating mean essentiality values across <b>all</b> cell-lines for each category:

In [39]:
meta_df_reset = meta_df.reset_index()
oncotree_lineage_dict = group_cell_lines(meta_df_reset, "ModelID", "OncotreeLineage")

In [40]:
oncotree_lineage_dict

{'Adrenal Gland': ['ACH-001401', 'ACH-001598'],
 'Ampulla of Vater': ['ACH-000182', 'ACH-000377', 'ACH-001862', 'ACH-002023'],
 'Biliary Tract': ['ACH-000141',
  'ACH-000209',
  'ACH-000268',
  'ACH-000461',
  'ACH-000808',
  'ACH-000976',
  'ACH-001494',
  'ACH-001536',
  'ACH-001537',
  'ACH-001538',
  'ACH-001607',
  'ACH-001619',
  'ACH-001673',
  'ACH-001834',
  'ACH-001835',
  'ACH-001836',
  'ACH-001838',
  'ACH-001839',
  'ACH-001841',
  'ACH-001842',
  'ACH-001843',
  'ACH-001844',
  'ACH-001845',
  'ACH-001846',
  'ACH-001847',
  'ACH-001848',
  'ACH-001849',
  'ACH-001850',
  'ACH-001852',
  'ACH-001854',
  'ACH-001855',
  'ACH-001856',
  'ACH-001857',
  'ACH-001858',
  'ACH-001861',
  'ACH-001863',
  'ACH-001864',
  'ACH-001959',
  'ACH-001960',
  'ACH-001961',
  'ACH-001997',
  'ACH-002237',
  'ACH-002312',
  'ACH-002647'],
 'Bladder/Urinary Tract': ['ACH-000011',
  'ACH-000018',
  'ACH-000026',
  'ACH-000127',
  'ACH-000142',
  'ACH-000242',
  'ACH-000384',
  'ACH-000396'

In [41]:
categories = oncotree_lineage_dict.keys()
avg_trimmed_esen_df = pd.DataFrame(index=categories, columns=literature_gene_list)

In [42]:
# Calculate the average essentiality scores for each category
for category, cell_lines in oncotree_lineage_dict.items():
    # Filter out cell lines not present in trimmed_esen_df
    valid_cell_lines = [cell_line for cell_line in cell_lines if cell_line in trimmed_esen_df.index]
    if valid_cell_lines:  # Only calculate if there are valid cell lines
        avg_trimmed_esen_df.loc[category] = trimmed_esen_df.loc[valid_cell_lines].mean()

In [43]:
avg_trimmed_esen_df.dropna(how='all', inplace=True)

Renaming with "-" rather than "/" to avoid errors:

In [44]:
avg_trimmed_esen_df.index = avg_trimmed_esen_df.index.str.replace('/', '-')

In [45]:
avg_trimmed_esen_df

Unnamed: 0,RAE1,FBXO17,YTHDF1,DDX27,SEC61G,CHMP7,XPO7,INTS9,GJC1,METTL1,...,CD48,RBMXL1,SNRPG,GMIP,DEK,MITD1,SLAMF7,LY9,C10orf125,SNRPA1
Bladder-Urinary Tract,-1.385351,-0.109342,0.237001,-1.080879,-1.190801,-0.634477,-0.123798,-1.592092,-0.067727,-0.618874,...,-0.07569,-0.704007,-2.17096,0.06253,0.112852,-0.11852,0.06375,-0.056838,,-2.551288
Bone,-1.557482,0.056929,0.004858,-0.996598,-1.022146,-0.724917,-0.30894,-1.985228,-0.102306,-0.931291,...,-0.149628,-0.664241,-2.041113,0.239531,0.121346,-0.243796,-0.012276,-0.098541,,-2.448317
Bowel,-1.478825,0.026605,0.182104,-1.298502,-1.10381,-0.454969,-0.088147,-1.615264,-0.092421,-0.476383,...,-0.106845,-0.635035,-2.14239,0.006667,0.121557,-0.051641,-0.043331,-0.111743,,-2.891235
Breast,-1.357247,0.008662,0.183749,-0.985533,-1.131014,-0.695307,-0.059311,-1.739938,-0.109713,-0.279289,...,-0.064765,-0.868978,-2.174167,0.025725,0.124396,-0.097753,-0.060342,-0.138997,,-3.00402
CNS-Brain,-1.5105,-0.103304,0.122558,-1.130269,-1.169407,-0.734394,-0.061533,-1.736273,-0.17996,-0.381729,...,-0.065231,-0.643529,-2.076679,0.028034,0.123782,-0.10399,0.02236,-0.128562,,-2.938365
Esophagus-Stomach,-1.461781,-0.051865,0.185681,-1.085251,-1.188414,-0.53965,-0.051487,-1.67677,-0.132192,-0.516337,...,-0.08242,-0.737726,-2.345009,0.03179,0.064371,-0.09314,-0.007267,-0.157906,,-2.942543
Head and Neck,-1.372551,-0.041591,0.164947,-1.035485,-1.274613,-0.788483,-0.098242,-1.852179,-0.187559,-0.486829,...,-0.134098,-0.793694,-2.008479,0.048077,0.072998,-0.099401,-0.039333,-0.063858,,-3.014571
Kidney,-1.460705,-0.014413,0.191184,-1.160363,-1.17639,-0.648485,0.095923,-1.724723,-0.087908,-0.519891,...,-0.085785,-0.809827,-2.224849,0.035601,0.02152,-0.077623,-0.031579,-0.160188,,-2.911835
Liver,-1.346214,-0.088742,0.180911,-1.054046,-1.276581,-0.580684,-0.02627,-1.905448,-0.118465,-0.3902,...,-0.05274,-0.824127,-2.088846,0.029885,0.116234,-0.072034,-0.046456,-0.188249,,-2.99713
Lung,-1.341175,-0.07513,0.183699,-1.06693,-1.203197,-0.666819,-0.056639,-1.722237,-0.166471,-0.392899,...,-0.095072,-0.792325,-2.045227,0.010372,0.114477,-0.094676,-0.051848,-0.123242,,-2.833861


### Calculating mean essentiality values across <b>ecDNA-present</b> cell-lines for each category:

In [46]:
ecdna_present_lineage_dict = {}

for key,list in oncotree_lineage_dict.items():
    ecdna_present_lineage_dict[key] = [id for id in list if id in ec_aa_plus_ids] 

In [47]:
ecdna_present_lineage_dict

{'Adrenal Gland': [],
 'Ampulla of Vater': [],
 'Biliary Tract': [],
 'Bladder/Urinary Tract': ['ACH-000473',
  'ACH-000724',
  'ACH-000839',
  'ACH-000905'],
 'Bone': [],
 'Bowel': ['ACH-000253', 'ACH-000360', 'ACH-000842', 'ACH-000986'],
 'Breast': ['ACH-000019',
  'ACH-000044',
  'ACH-000097',
  'ACH-000111',
  'ACH-000117',
  'ACH-000147',
  'ACH-000196',
  'ACH-000212',
  'ACH-000223',
  'ACH-000248',
  'ACH-000276',
  'ACH-000277',
  'ACH-000330',
  'ACH-000349',
  'ACH-000536',
  'ACH-000554',
  'ACH-000568',
  'ACH-000621',
  'ACH-000643',
  'ACH-000691',
  'ACH-000725',
  'ACH-000759',
  'ACH-000783',
  'ACH-000818',
  'ACH-000857',
  'ACH-000859',
  'ACH-000910',
  'ACH-000927',
  'ACH-000930',
  'ACH-000934'],
 'CNS/Brain': ['ACH-000232',
  'ACH-000376',
  'ACH-000738',
  'ACH-000756',
  'ACH-000883'],
 'Cervix': [],
 'Embryonal': [],
 'Esophagus/Stomach': ['ACH-000047',
  'ACH-000144',
  'ACH-000351',
  'ACH-000356',
  'ACH-000408',
  'ACH-000488',
  'ACH-000507',
  'ACH-00

In [48]:
ec_dna_present_categories = ecdna_present_lineage_dict.keys()
avg_ecdna_present_trimmed_esen_df = pd.DataFrame(index=ec_dna_present_categories, columns=literature_gene_list)

In [49]:
# Calculate the average essentiality scores for each category
for category, cell_lines in ecdna_present_lineage_dict.items():
    # Filter out cell lines not present in trimmed_esen_df
    valid_ecdna_present_cell_lines = [cell_line for cell_line in cell_lines if cell_line in trimmed_esen_df.index]
    if valid_ecdna_present_cell_lines:  # Only calculate if there are valid cell lines
        avg_ecdna_present_trimmed_esen_df.loc[category] = trimmed_esen_df.loc[valid_ecdna_present_cell_lines].mean()

In [50]:
avg_ecdna_present_trimmed_esen_df.dropna(how='all', inplace=True)

Renaming with "-" rather than "/" to avoid errors:

In [51]:
avg_ecdna_present_trimmed_esen_df.index = avg_ecdna_present_trimmed_esen_df.index.str.replace('/', '-')

Adding an "ecDNA-present" tag to each category:

In [52]:
avg_ecdna_present_trimmed_esen_df.index = [str(index) + " (ecDNA-present)" for index in avg_ecdna_present_trimmed_esen_df.index]

In [53]:
avg_ecdna_present_trimmed_esen_df

Unnamed: 0,RAE1,FBXO17,YTHDF1,DDX27,SEC61G,CHMP7,XPO7,INTS9,GJC1,METTL1,...,CD48,RBMXL1,SNRPG,GMIP,DEK,MITD1,SLAMF7,LY9,C10orf125,SNRPA1
Bladder-Urinary Tract (ecDNA-present),-1.435676,-0.122214,0.291812,-0.943421,-1.26507,-0.552624,-0.110608,-1.590697,0.00665,-0.532538,...,-0.072827,-0.753185,-2.251058,0.079474,0.069233,-0.12556,0.10197,-0.018174,,-2.442733
Bowel (ecDNA-present),-1.594722,-0.037046,0.192765,-1.114558,-1.218088,-0.491755,-0.120123,-1.682627,-0.142233,-0.361603,...,-0.10836,-0.520982,-2.299181,-0.095669,0.045609,-0.039505,-0.016605,-0.057773,,-2.495908
Breast (ecDNA-present),-1.342546,0.013041,0.19115,-0.98242,-1.14297,-0.747882,-0.099879,-1.759594,-0.083409,-0.234939,...,-0.023669,-0.855086,-2.233241,0.015364,0.103057,-0.120455,-0.056963,-0.15912,,-3.039817
CNS-Brain (ecDNA-present),-1.340171,-0.090806,0.10218,-1.150205,-1.086505,-0.803394,0.016277,-1.503501,-0.156247,-0.47458,...,-0.035369,-0.617705,-2.206709,-0.001816,0.120474,-0.07546,0.009886,-0.146377,,-2.653801
Esophagus-Stomach (ecDNA-present),-1.50832,-0.040027,0.188233,-1.044836,-1.163893,-0.543904,-0.072742,-1.733175,-0.129732,-0.510876,...,-0.066614,-0.731809,-2.230609,0.041506,0.064018,-0.066949,-0.020541,-0.142646,,-2.96949
Head and Neck (ecDNA-present),-1.231776,0.018166,0.170376,-0.675031,-1.295484,-0.758057,-0.243278,-2.022795,-0.42261,-0.702315,...,0.032978,-0.452239,-1.553252,0.043443,-0.129803,0.030778,0.111898,-0.074504,,-3.149165
Kidney (ecDNA-present),-1.488507,-0.056142,0.194117,-1.092193,-1.296737,-0.582387,-0.037756,-2.064431,-0.042499,-0.496695,...,-0.169142,-0.821899,-2.174614,-0.06455,0.086211,-0.149975,-0.073592,-0.170988,,-3.061135
Liver (ecDNA-present),-1.311143,-0.153977,0.233477,-1.073833,-1.347051,-0.552038,-0.093409,-2.078646,-0.189642,-0.376786,...,-0.06265,-0.783201,-1.963516,0.000161,0.108069,-0.12229,-0.030042,-0.179248,,-2.954553
Lung (ecDNA-present),-1.334061,-0.085929,0.187267,-1.044751,-1.187783,-0.644205,-0.067149,-1.732749,-0.162725,-0.340142,...,-0.101798,-0.774632,-2.07825,0.010642,0.125405,-0.082617,-0.054034,-0.110666,,-2.883159
Lymphoid (ecDNA-present),-1.346705,-0.121636,0.249524,-0.898885,-1.193571,-0.310961,0.016591,-1.982902,-0.070356,-0.284596,...,-0.071388,-0.713319,-2.120497,0.014497,0.173001,-0.138473,0.022989,-0.21606,,-2.618652


### Calculating mean essentiality values across <b>ecDNA-absent</b> cell-lines for each category:

In [54]:
ecdna_absent_lineage_dict = {}

for key,lst in oncotree_lineage_dict.items():
    ecdna_absent_lineage_dict[key] = [id for id in lst if id not in ec_aa_plus_ids] 

In [55]:
ecdna_absent_lineage_dict

{'Adrenal Gland': ['ACH-001401', 'ACH-001598'],
 'Ampulla of Vater': ['ACH-000182', 'ACH-000377', 'ACH-001862', 'ACH-002023'],
 'Biliary Tract': ['ACH-000141',
  'ACH-000209',
  'ACH-000268',
  'ACH-000461',
  'ACH-000808',
  'ACH-000976',
  'ACH-001494',
  'ACH-001536',
  'ACH-001537',
  'ACH-001538',
  'ACH-001607',
  'ACH-001619',
  'ACH-001673',
  'ACH-001834',
  'ACH-001835',
  'ACH-001836',
  'ACH-001838',
  'ACH-001839',
  'ACH-001841',
  'ACH-001842',
  'ACH-001843',
  'ACH-001844',
  'ACH-001845',
  'ACH-001846',
  'ACH-001847',
  'ACH-001848',
  'ACH-001849',
  'ACH-001850',
  'ACH-001852',
  'ACH-001854',
  'ACH-001855',
  'ACH-001856',
  'ACH-001857',
  'ACH-001858',
  'ACH-001861',
  'ACH-001863',
  'ACH-001864',
  'ACH-001959',
  'ACH-001960',
  'ACH-001961',
  'ACH-001997',
  'ACH-002237',
  'ACH-002312',
  'ACH-002647'],
 'Bladder/Urinary Tract': ['ACH-000011',
  'ACH-000018',
  'ACH-000026',
  'ACH-000127',
  'ACH-000142',
  'ACH-000242',
  'ACH-000384',
  'ACH-000396'

In [56]:
ec_dna_absent_categories = ecdna_absent_lineage_dict.keys()
avg_ecdna_absent_trimmed_esen_df = pd.DataFrame(index=ec_dna_absent_categories, columns=literature_gene_list)

In [57]:
# Calculate the average essentiality scores for each category
for category, cell_lines in ecdna_absent_lineage_dict.items():
    # Filter out cell lines not present in trimmed_esen_df
    valid_ecdna_absent_cell_lines = [cell_line for cell_line in cell_lines if cell_line in trimmed_esen_df.index]
    if valid_ecdna_absent_cell_lines:  # Only calculate if there are valid cell lines
        avg_ecdna_absent_trimmed_esen_df.loc[category] = trimmed_esen_df.loc[valid_ecdna_absent_cell_lines].mean()

In [58]:
avg_ecdna_absent_trimmed_esen_df.dropna(how='all', inplace=True)

Renaming with "-" rather than "/" to avoid errors:

In [59]:
avg_ecdna_absent_trimmed_esen_df.index = avg_ecdna_absent_trimmed_esen_df.index.str.replace('/', '-')

Adding an "ecDNA-absent" tag to each category:

In [60]:
avg_ecdna_absent_trimmed_esen_df.index = [str(index) + " (ecDNA-absent)" for index in avg_ecdna_absent_trimmed_esen_df.index]

In [61]:
avg_ecdna_absent_trimmed_esen_df

Unnamed: 0,RAE1,FBXO17,YTHDF1,DDX27,SEC61G,CHMP7,XPO7,INTS9,GJC1,METTL1,...,CD48,RBMXL1,SNRPG,GMIP,DEK,MITD1,SLAMF7,LY9,C10orf125,SNRPA1
Bladder-Urinary Tract (ecDNA-absent),-1.318251,-0.092179,0.163919,-1.264157,-1.091776,-0.743615,-0.141383,-1.593953,-0.166896,-0.733989,...,-0.079506,-0.638436,-2.064163,0.039938,0.171011,-0.109133,0.01279,-0.108389,,-2.696028
Bone (ecDNA-absent),-1.557482,0.056929,0.004858,-0.996598,-1.022146,-0.724917,-0.30894,-1.985228,-0.102306,-0.931291,...,-0.149628,-0.664241,-2.041113,0.239531,0.121346,-0.243796,-0.012276,-0.098541,,-2.448317
Bowel (ecDNA-absent),-1.462269,0.035698,0.180581,-1.324779,-1.087485,-0.449714,-0.083579,-1.605641,-0.085305,-0.49278,...,-0.106628,-0.651329,-2.113883,0.021286,0.132406,-0.053375,-0.047149,-0.119453,,-2.94771
Breast (ecDNA-absent),-1.407231,-0.006225,0.158584,-0.996117,-1.090367,-0.516553,0.078618,-1.67311,-0.199145,-0.43008,...,-0.204493,-0.916208,-1.952636,0.060952,0.196948,-0.020565,-0.071833,-0.070579,,-2.882308
CNS-Brain (ecDNA-absent),-1.578632,-0.108303,0.130709,-1.122295,-1.202568,-0.706795,-0.092658,-1.829382,-0.189446,-0.344589,...,-0.077176,-0.653858,-2.024668,0.039974,0.125105,-0.115402,0.02735,-0.121437,,-3.05219
Esophagus-Stomach (ecDNA-absent),-1.351251,-0.07998,0.17962,-1.181236,-1.246651,-0.529545,-0.001007,-1.542806,-0.138035,-0.529308,...,-0.11996,-0.751778,-2.602409,0.008714,0.065212,-0.155344,0.024259,-0.19415,,-2.878544
Head and Neck (ecDNA-absent),-1.396014,-0.05155,0.164043,-1.095561,-1.271134,-0.793555,-0.074069,-1.823743,-0.148384,-0.450915,...,-0.161944,-0.850603,-2.08435,0.048849,0.106799,-0.121098,-0.064538,-0.062084,,-2.992139
Kidney (ecDNA-absent),-1.449584,0.002279,0.190011,-1.187631,-1.128251,-0.674924,0.149395,-1.58884,-0.106072,-0.52917,...,-0.052442,-0.804998,-2.244943,0.075661,-0.004356,-0.048682,-0.014774,-0.155868,,-2.852115
Liver (ecDNA-absent),-1.369595,-0.045252,0.145867,-1.040854,-1.229601,-0.599781,0.01849,-1.789982,-0.071014,-0.399142,...,-0.046134,-0.85141,-2.1724,0.049701,0.121678,-0.038531,-0.057398,-0.19425,,-3.025515
Lung (ecDNA-absent),-1.357774,-0.049934,0.175372,-1.11868,-1.239163,-0.719583,-0.032114,-1.697709,-0.17521,-0.516001,...,-0.079378,-0.833608,-1.973679,0.009743,0.088979,-0.122815,-0.046748,-0.152588,,-2.718834


### Literature Gene List (All Cell-Lines):

#### Common-Essential Genes Plot:

In [78]:
CE_grouped_df = grouped_cell_line_plot(common_esen_genes_dict, avg_trimmed_esen_df, avg_trimmed_esen_df.index, "CE_grouped_all_cell_lines")

  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edg

In [79]:
CE_grouped_df[CE_grouped_df['p-value'] < 0.05]

Unnamed: 0,Cell-Line,H-stat,p-value


#### Selectively-Essential Genes Plot:

In [80]:
SE_grouped_df = grouped_cell_line_plot(selec_esen_genes_dict, avg_trimmed_esen_df, avg_trimmed_esen_df.index, "SE_grouped_all_cell_lines")

  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edg

In [81]:
SE_grouped_df[SE_grouped_df['p-value'] < 0.05]

Unnamed: 0,Cell-Line,H-stat,p-value
20,Uterus,52.762279,3.489929e-12
4,CNS-Brain,50.691666,9.827535e-12
6,Head and Neck,48.247153,3.33629e-11
5,Esophagus-Stomach,46.392351,8.433909e-11
7,Kidney,43.819938,3.052259e-10
17,Skin,41.0868,1.197057e-09
0,Bladder-Urinary Tract,41.00874,1.244702e-09
8,Liver,40.804011,1.378864e-09
9,Lung,38.946509,3.490382e-09
12,Ovary-Fallopian Tube,38.857785,3.648708e-09


#### Non-Essential Genes Plot:

In [82]:
NE_grouped_df = grouped_cell_line_plot(non_esen_genes_dict, avg_trimmed_esen_df, avg_trimmed_esen_df.index, "NE_grouped_all_cell_lines")

  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edg

In [83]:
NE_grouped_df[NE_grouped_df['p-value'] < 0.05]

Unnamed: 0,Cell-Line,H-stat,p-value


### Literature Gene List (ecDNA-present Cell Lines):

#### Common-Essential Genes Plot:

In [84]:
CE_ecdna_present_grouped_df = grouped_cell_line_plot(common_esen_genes_dict, avg_ecdna_present_trimmed_esen_df, avg_ecdna_present_trimmed_esen_df.index, "CE_grouped_ecDNA_present_cell_lines")

  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edg

In [85]:
CE_ecdna_present_grouped_df[CE_ecdna_present_grouped_df['p-value'] < 0.05]

Unnamed: 0,Cell-Line,H-stat,p-value


In [111]:
CE_ecdna_present_grouped_df

Unnamed: 0,Cell-Line,H-stat,p-value
1,Bowel (ecDNA-present),2.320034,0.313481
2,Breast (ecDNA-present),2.067996,0.355582
9,Lymphoid (ecDNA-present),1.618546,0.445182
14,Pleura (ecDNA-present),1.473485,0.478671
15,Prostate (ecDNA-present),1.415146,0.492839
10,Myeloid (ecDNA-present),1.337048,0.512465
17,Thyroid (ecDNA-present),1.097882,0.577561
0,Bladder-Urinary Tract (ecDNA-present),1.092568,0.579098
13,Peripheral Nervous System (ecDNA-present),0.949907,0.621914
18,Uterus (ecDNA-present),0.843043,0.656048


#### Selectively-Essential Genes Plot:

In [86]:
SE_ecdna_present_grouped_df = grouped_cell_line_plot(selec_esen_genes_dict, avg_ecdna_present_trimmed_esen_df, avg_ecdna_present_trimmed_esen_df.index, "SE_grouped_ecDNA_present_cell_lines")

  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edg

In [87]:
SE_ecdna_present_grouped_df[SE_ecdna_present_grouped_df['p-value'] < 0.05]

Unnamed: 0,Cell-Line,H-stat,p-value
16,Skin (ecDNA-present),53.385127,2.556036e-12
3,CNS-Brain (ecDNA-present),51.376667,6.97748e-12
6,Kidney (ecDNA-present),44.8753,1.800746e-10
4,Esophagus-Stomach (ecDNA-present),44.00968,2.775999e-10
7,Liver (ecDNA-present),40.876941,1.32949e-09
11,Ovary-Fallopian Tube (ecDNA-present),39.776378,2.304991e-09
2,Breast (ecDNA-present),38.991091,3.413439e-09
18,Uterus (ecDNA-present),38.761213,3.829213e-09
8,Lung (ecDNA-present),37.836541,6.079944e-09
5,Head and Neck (ecDNA-present),32.711469,7.884875e-08


In [112]:
SE_ecdna_present_grouped_df

Unnamed: 0,Cell-Line,H-stat,p-value
16,Skin (ecDNA-present),53.385127,2.556036e-12
3,CNS-Brain (ecDNA-present),51.376667,6.97748e-12
6,Kidney (ecDNA-present),44.8753,1.800746e-10
4,Esophagus-Stomach (ecDNA-present),44.00968,2.775999e-10
7,Liver (ecDNA-present),40.876941,1.32949e-09
11,Ovary-Fallopian Tube (ecDNA-present),39.776378,2.304991e-09
2,Breast (ecDNA-present),38.991091,3.413439e-09
18,Uterus (ecDNA-present),38.761213,3.829213e-09
8,Lung (ecDNA-present),37.836541,6.079944e-09
5,Head and Neck (ecDNA-present),32.711469,7.884875e-08


#### Non-Essential Genes Plot:

In [88]:
NE_ecdna_present_grouped_df = grouped_cell_line_plot(non_esen_genes_dict, avg_ecdna_present_trimmed_esen_df, avg_ecdna_present_trimmed_esen_df.index, "NE_grouped_ecDNA_present_cell_lines")

  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edg

In [89]:
NE_grouped_df[NE_grouped_df['p-value'] < 0.05]

Unnamed: 0,Cell-Line,H-stat,p-value


In [113]:
NE_grouped_df

Unnamed: 0,Cell-Line,H-stat,p-value
13,Pancreas,5.738033,0.056755
18,Soft Tissue,4.939584,0.084602
2,Bowel,4.017731,0.134141
3,Breast,3.118024,0.210344
9,Lung,2.013987,0.365316
16,Prostate,1.832198,0.400077
17,Skin,1.800219,0.406525
6,Head and Neck,1.095159,0.578348
12,Ovary-Fallopian Tube,1.062437,0.587888
11,Myeloid,0.904818,0.636094


#### Common-Essential Genes Plot:

In [90]:
CE_ecdna_absent_grouped_df = grouped_cell_line_plot(common_esen_genes_dict, avg_ecdna_absent_trimmed_esen_df, avg_ecdna_absent_trimmed_esen_df.index, "CE_grouped_ecDNA_absent_cell_lines")

  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edg

In [91]:
CE_ecdna_absent_grouped_df[CE_ecdna_absent_grouped_df['p-value'] < 0.05]

Unnamed: 0,Cell-Line,H-stat,p-value


In [114]:
CE_ecdna_absent_grouped_df

Unnamed: 0,Cell-Line,H-stat,p-value
1,Bone (ecDNA-absent),2.373073,0.305277
15,Pleura (ecDNA-absent),1.621985,0.444417
4,CNS-Brain (ecDNA-absent),1.357328,0.507294
14,Peripheral Nervous System (ecDNA-absent),1.266048,0.530984
10,Lymphoid (ecDNA-absent),1.236364,0.538923
18,Soft Tissue (ecDNA-absent),1.053803,0.590431
16,Prostate (ecDNA-absent),0.947705,0.622599
11,Myeloid (ecDNA-absent),0.917305,0.632135
12,Ovary-Fallopian Tube (ecDNA-absent),0.664193,0.717418
0,Bladder-Urinary Tract (ecDNA-absent),0.588407,0.745125


#### Selectively-Essential Genes Plot:

In [92]:
SE_ecdna_absent_grouped_df = grouped_cell_line_plot(selec_esen_genes_dict, avg_ecdna_absent_trimmed_esen_df, avg_ecdna_absent_trimmed_esen_df.index, "SE_grouped_ecDNA_absent_cell_lines")

  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edg

In [94]:
SE_ecdna_absent_grouped_df[SE_ecdna_absent_grouped_df['p-value'] < 0.05]

Unnamed: 0,Cell-Line,H-stat,p-value
19,Uterus (ecDNA-absent),50.827219,9.18353e-12
4,CNS-Brain (ecDNA-absent),47.686684,4.415381e-11
6,Head and Neck (ecDNA-absent),45.474787,1.334368e-10
0,Bladder-Urinary Tract (ecDNA-absent),44.917568,1.763088e-10
5,Esophagus-Stomach (ecDNA-absent),43.862295,2.988297e-10
9,Lung (ecDNA-absent),39.869477,2.200154e-09
1,Bone (ecDNA-absent),37.863845,5.997502e-09
7,Kidney (ecDNA-absent),37.536917,7.062556e-09
8,Liver (ecDNA-absent),36.290846,1.31687e-08
17,Skin (ecDNA-absent),35.897974,1.602707e-08


In [115]:
SE_ecdna_absent_grouped_df

Unnamed: 0,Cell-Line,H-stat,p-value
19,Uterus (ecDNA-absent),50.827219,9.18353e-12
4,CNS-Brain (ecDNA-absent),47.686684,4.415381e-11
6,Head and Neck (ecDNA-absent),45.474787,1.334368e-10
0,Bladder-Urinary Tract (ecDNA-absent),44.917568,1.763088e-10
5,Esophagus-Stomach (ecDNA-absent),43.862295,2.988297e-10
9,Lung (ecDNA-absent),39.869477,2.200154e-09
1,Bone (ecDNA-absent),37.863845,5.997502e-09
7,Kidney (ecDNA-absent),37.536917,7.062556e-09
8,Liver (ecDNA-absent),36.290846,1.31687e-08
17,Skin (ecDNA-absent),35.897974,1.602707e-08


#### Non-Essential Genes Plot:

In [95]:
NE_ecdna_absent_grouped_df = grouped_cell_line_plot(non_esen_genes_dict, avg_ecdna_absent_trimmed_esen_df, avg_ecdna_absent_trimmed_esen_df.index, "NE_grouped_ecDNA_absent_cell_lines")

  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edgecolor='gray', alpha=0.1)
  sns.stripplot(x='Cell_Type', y='ecDNA_Score', data=df, jitter=True, color='black', edg

In [96]:
NE_ecdna_absent_grouped_df[NE_ecdna_absent_grouped_df['p-value'] < 0.05]

Unnamed: 0,Cell-Line,H-stat,p-value


In [116]:
NE_ecdna_absent_grouped_df

Unnamed: 0,Cell-Line,H-stat,p-value
13,Pancreas (ecDNA-absent),5.657954,0.059073
18,Soft Tissue (ecDNA-absent),4.939584,0.084602
2,Bowel (ecDNA-absent),4.205037,0.122148
3,Breast (ecDNA-absent),3.083176,0.214041
9,Lung (ecDNA-absent),2.995492,0.223634
16,Prostate (ecDNA-absent),2.929517,0.231134
17,Skin (ecDNA-absent),2.640736,0.267037
19,Uterus (ecDNA-absent),2.337979,0.310681
12,Ovary-Fallopian Tube (ecDNA-absent),1.92819,0.381328
0,Bladder-Urinary Tract (ecDNA-absent),1.458729,0.482215


# Further Oncogroup Analysis:

In [None]:
CE_ecdna_present_grouped_names = CE_ecdna_present_grouped_df["Cell-Line"].tolist()
CE_ecdna_present_grouped_names = [name.split("(")[0][:-1] for name in CE_ecdna_present_grouped_names]
CE_ecdna_absent_grouped_names = CE_ecdna_absent_grouped_df["Cell-Line"].tolist()
CE_ecdna_absent_grouped_names = [name.split("(")[0][:-1] for name in CE_ecdna_absent_grouped_names]

SE_ecdna_present_grouped_names = SE_ecdna_present_grouped_df["Cell-Line"].tolist()
SE_ecdna_present_grouped_names = [name.split("(")[0][:-1] for name in SE_ecdna_present_grouped_names]
SE_ecdna_absent_grouped_names = SE_ecdna_absent_grouped_df["Cell-Line"].tolist()
SE_ecdna_absent_grouped_names = [name.split("(")[0][:-1] for name in SE_ecdna_absent_grouped_names]

NE_ecdna_present_grouped_names = NE_ecdna_present_grouped_df["Cell-Line"].tolist()
NE_ecdna_present_grouped_names = [name.split("(")[0][:-1] for name in SE_ecdna_present_grouped_names]
NE_ecdna_absent_grouped_names = NE_ecdna_absent_grouped_df["Cell-Line"].tolist()
NE_ecdna_absent_grouped_names = [name.split("(")[0][:-1] for name in SE_ecdna_absent_grouped_names]

## CE genes:

In [172]:
for group in oncotree_lineage_dict:
    if (group in CE_ecdna_present_grouped_names) and (group in CE_ecdna_absent_grouped_names):
        print(group + ":")
        print(CE_ecdna_present_grouped_df[CE_ecdna_present_grouped_df['Cell-Line'] == (group + " (ecDNA-present)")])
        print(CE_ecdna_absent_grouped_df[CE_ecdna_absent_grouped_df['Cell-Line'] == (group + " (ecDNA-absent)")])
        print("---------------------------------------------------")

Bowel:
               Cell-Line    H-stat   p-value
1  Bowel (ecDNA-present)  2.320034  0.313481
              Cell-Line    H-stat   p-value
2  Bowel (ecDNA-absent)  0.433395  0.805173
---------------------------------------------------
Breast:
                Cell-Line    H-stat   p-value
2  Breast (ecDNA-present)  2.067996  0.355582
               Cell-Line    H-stat   p-value
3  Breast (ecDNA-absent)  0.230798  0.891011
---------------------------------------------------
Head and Neck:
                       Cell-Line    H-stat   p-value
5  Head and Neck (ecDNA-present)  0.367006  0.832349
                      Cell-Line    H-stat   p-value
6  Head and Neck (ecDNA-absent)  0.100649  0.950921
---------------------------------------------------
Kidney:
                Cell-Line    H-stat   p-value
6  Kidney (ecDNA-present)  0.253368  0.881012
               Cell-Line    H-stat   p-value
7  Kidney (ecDNA-absent)  0.310111  0.856368
---------------------------------------------------
Li

## SE genes:

In [171]:
for group in oncotree_lineage_dict:
    if (group in SE_ecdna_present_grouped_names) and (group in SE_ecdna_absent_grouped_names):
        print(group + ":")
        print(SE_ecdna_present_grouped_df[SE_ecdna_present_grouped_df['Cell-Line'] == (group + " (ecDNA-present)")])
        print(SE_ecdna_absent_grouped_df[SE_ecdna_absent_grouped_df['Cell-Line'] == (group + " (ecDNA-absent)")])
        print("---------------------------------------------------")

Bowel:
               Cell-Line     H-stat       p-value
1  Bowel (ecDNA-present)  27.886451  8.801043e-07
              Cell-Line     H-stat       p-value
2  Bowel (ecDNA-absent)  28.052163  8.101217e-07
---------------------------------------------------
Breast:
                Cell-Line     H-stat       p-value
2  Breast (ecDNA-present)  38.991091  3.413439e-09
               Cell-Line     H-stat       p-value
3  Breast (ecDNA-absent)  28.965316  5.131702e-07
---------------------------------------------------
Head and Neck:
                       Cell-Line     H-stat       p-value
5  Head and Neck (ecDNA-present)  32.711469  7.884875e-08
                      Cell-Line     H-stat       p-value
6  Head and Neck (ecDNA-absent)  45.474787  1.334368e-10
---------------------------------------------------
Kidney:
                Cell-Line   H-stat       p-value
6  Kidney (ecDNA-present)  44.8753  1.800746e-10
               Cell-Line     H-stat       p-value
7  Kidney (ecDNA-absent)  37

## NE genes:

In [170]:
for group in oncotree_lineage_dict:
    if (group in NE_ecdna_present_grouped_names) and (group in NE_ecdna_absent_grouped_names):
        print(group + ":")
        print(NE_ecdna_present_grouped_df[NE_ecdna_present_grouped_df['Cell-Line'] == (group + " (ecDNA-present)")])
        print(NE_ecdna_absent_grouped_df[NE_ecdna_absent_grouped_df['Cell-Line'] == (group + " (ecDNA-absent)")])
        print("---------------------------------------------------")