# medullo-diff
Medulloblastomas were estimated at 18% in previous N.G. paper; however, pedpancan is at 16%.
- Is this statistically different?
- Why the difference?
    - ICGC are the different samples
    - Could be hg19 or could be adult
    
## Conclusion
The difference is driven by 7 tumors previously classified as ecDNA+ MB and now classified as ecDNA-. In 6 of 7 cases, a low-copy cyclic amplicon was previously detected which may alternately represent ecDNA or HSR. Due to differences in our updated methods, these low-copy amplifications are not detected in this analysis. One tumor was previously classified as ecDNA+ based on weak evidence of a cyclic amplification, which probably represents a false positive and is corrected herein.

In [None]:
import pandas as pd
import sys
sys.path.append('../src')
from data_imports import *
import scipy.stats


In [None]:
MEDULLO_TABLES_PATH="../data/local/41588_2023_1551_MOESM4_ESM.xlsx"
def import_medullo_biosamples():
    return pd.read_excel(MEDULLO_TABLES_PATH,sheet_name="2 WGS Sample Cohort",index_col=0)
def import_medullo_patients():
    return pd.read_excel(MEDULLO_TABLES_PATH,sheet_name="1 WGS Patient Cohort",index_col=0)

In [None]:
bs = import_biosamples()
bsm = import_medullo_biosamples()
bs = bs[bs.cancer_type == "MBL"]
bsm = bsm[~bsm.index.isin(bs[~bs.in_unique_patient_set].index)]
bs = bs[bs.in_unique_patient_set]

In [None]:
medullo_only = set(bsm.index)-set(bs.index)
in_both = set(bsm.index)&set(bs.index)
pedpancan_only = set(bs.index)-set(bsm.index)

In [None]:
print(f"samples in pedpancan not in medullo: {len(pedpancan_only)}")
# Not really sure why BS_M16CDR44 is not included in the medullo dataset but no consequence.
print(f"samples in medullo not in pedpancan: {len(medullo_only)}")

In [None]:
# Any different classifications?
bsm_bs = bsm[bsm.index.isin(in_both)].copy()
bsm_bs["medullo_ecDNA"] = bsm_bs.ecDNA > 0
bsm_bs["pedpancan_ecDNA"] = bs.loc[bsm_bs.index,"ecDNA_sequences_detected"] > 0
bsm_bs.drop(["ecDNA","Aliases"],axis=1,inplace=True)
print(pd.crosstab(bsm_bs.medullo_ecDNA, bsm_bs.pedpancan_ecDNA)) # oh dear
bsm_bs[bsm_bs.medullo_ecDNA != bsm_bs.pedpancan_ecDNA]

In [None]:
# Is the prevalence of ecDNA significantly different between the two cohorts?
def get_all_biosamples(bs,bsm,in_both):
    bs['ecDNA']=bs.amplicon_class=='ecDNA'
    bs['cohort']=bs.index.map(lambda x: 'both' if x in in_both else 'pedpancan')
    bs=bs[['ecDNA','cohort']]
    bsm['ecDNA']=bsm.ecDNA > 0
    bsm['cohort']='medullo'
    bsm=bsm[~bsm.index.isin(in_both)]
    bsm=bsm[['ecDNA','cohort']]
    return pd.concat([bs,bsm])
bs_all = get_all_biosamples(bs,bsm,in_both)

contingency_tbl = pd.crosstab(bs_all.ecDNA > 0, bs_all.cohort)
print(scipy.stats.chi2_contingency(contingency_tbl))
print(contingency_tbl)

In [None]:
55/(55+144+143)