This script prepares summary statistics files to be later used as input for Popcorn, a program for estimating the correlation of causal variant effect, on the Pre-Confluence Project Data

In [1]:
import os
import pandas as pd

!dx ls

[1m[34m.Notebook_archive/[0m
[1m[34m.Notebook_snapshots/[0m
[1m[34mAim1_Heritability/[0m
[1m[34mAim2_Polygenicity/[0m
[1m[34mAim3_WithinAncestryGeneticCorrelation/[0m
[1m[34mAim4_CrossAncestryGeneticCorrelation/[0m
[1m[34mAim5_HeritabilityByFunctionalAnnotations/[0m
[1m[34mLD_Score/[0m
[1m[34mimages/[0m


In [2]:
!dx cd
!dx cd Aim4_CrossAncestryGeneticCorrelation/jahagirdarob/sumstats_files

# Prepare Population Overall Breast Cancer Meta Summary Statistics Data

In [3]:
!ls "/mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data"     # lists contents of cleaned summary statistics folder

AA_ERneg_sumdata.txt	       European_BCAC_icogs_onco_sumdata.txt
AA_ERpos_sumdata.txt	       European_ERneg_BCAC_icogs_onco_sumdata.txt
AA_overall_sumdata.txt	       European_ERpos_BCAC_icogs_onco_sumdata.txt
EAS_BCAC_BBJ_meta_sumdata.txt


## African Data

Take a look at the summary statistics format.

In [15]:
!head /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/AA_overall_sumdata.txt

ID CHR POS effect_allele non_effect_allele Freq_effect BETA SE P imputation_r2 N_case N_control N_eff
1:752721:A:G 1 752721 G A 0.33238160 -0.01572389 0.02353067 0.50398718 0.93063200 9235 10184 4069
1:753474:C:G 1 753474 G C 0.25716098 -0.01792483 0.02616185 0.49324830 0.87351600 9235 10184 3824
1:754182:A:G 1 754182 G A 0.40037756 -0.02450959 0.02328748 0.29257958 0.84690800 9235 10184 3840
1:754192:A:G 1 754192 G A 0.40039101 -0.02450492 0.02328869 0.29269662 0.84675800 9235 10184 3839
1:754334:T:C 1 754334 C T 0.40098838 -0.02574705 0.02326093 0.26834600 0.84834800 9235 10184 3847
1:754503:G:A 1 754503 A G 0.38182553 -0.02245379 0.02365361 0.34248058 0.84828800 9235 10184 3786
1:754964:C:T 1 754964 T C 0.38158567 -0.02302763 0.02368161 0.33085934 0.84523800 9235 10184 3778
1:757640:G:A 1 757640 A G 0.29703346 -0.01269722 0.02507091 0.61253914 0.85268200 9235 10184 3809
1:761732:C:T 1 761732 T C 0.26941848 -0.02283542 0.02595350 0.37893576 0.83990400 9235 10184 3771


Modify the format of these summary statistics to be compatible as popcorn input.

We use the the bim file from our LD reference to derive the rsids from the chr/pos info.

In [4]:
def get_bim_df(bim_dir):
    """
    given a .bim file, returns a dataframe mapping rsids to chr/pos
    """
    bim_df = pd.read_csv(bim_dir, sep='\t', names=["chr", "rsid", "morg", "pos", "a1", "a2"], usecols=["chr", "pos", "rsid"])
    return bim_df

Load the African bim file into a dataframe.

In [5]:
afr_bim_dir = "/mnt/project/Aim4_CrossAncestryGeneticCorrelation/jahagirdarob/REF/AFR.bim"
afr_bim_df = get_bim_df(afr_bim_dir)
print(afr_bim_df)

          chr         rsid       pos
0           1  rs558604819     10642
1           1  rs575272151     11008
2           1  rs544419019     11012
3           1  rs561109771     11063
4           1   rs62635286     13116
...       ...          ...       ...
14806108   22    rs8142977  51239304
14806109   22  rs561893765  51239794
14806110   22  rs202228854  51240820
14806111   22    rs7287738  51241285
14806112   22  rs568168135  51241386

[14806113 rows x 3 columns]


In [6]:
def get_sumstats_df_w_rsids(sumstats_df, bim_df):
    """
    sumstats_df: summary statistics pandas dataframe w/ chr and pos columns
    bim_df: pandas dataframe w/ chr, pos, and rsid columns
    
    returns the sumstats_df with an added rsid column
    """
    sumstats_df_w_rsids = sumstats_df.merge(bim_df, left_on=["chr", "pos"], right_on=["chr", "pos"], suffixes=('', ''))
    return sumstats_df_w_rsids

In [7]:
def sort_alleles(sumstats_df):
    """
    sorts the a1 and a2 alphabetically and changes beta and AF values as necessary such that the identity of
    a1 and a2 are consistent across different summary stat files no matter how they were created
    
    sumstats_df: summary statistics pandas dataframe w/ a1 and a2 columns

    returns the sumstats_df with sorted allele columns a1 and a2
    """
    # creates an update_df dataframe including all rows where a1 comes alphabetically after a2
    not_sorted = sumstats_df["a1"] > sumstats_df["a2"]
    update_df = sumstats_df.loc[not_sorted].copy()
    
    # modifies update_df such that alleles are sorted alphabetically
    update_df.rename(columns={"a1": "a2", "a2": "a1"}, inplace=True)
    update_df["beta"] = -update_df["beta"]
    if "AF" in update_df.columns:
        update_df["AF"] = 1 - update_df["AF"]
    
    # updates sumstats_df to sort alleles of unsorted rows
    sumstats_df.loc[not_sorted] = update_df
    
    return sumstats_df
    

In [8]:
def get_popcorn_input_from_afr_meta_sumstats(sumstats_dir, out_dir, bim_df=None, use_tot_samples=False):
    """
    sumstats_dir: directory of African summary statistics file
    out_dir: directory to which the popcorn input will be written
    bim_df: (optional) pandas dataframe w/ chr, pos, and rsid columns; rsid column is added to output file if included
    """
    print(f"get_popcorn_input_from_afr_meta_sumstats: reading sumstats from {sumstats_dir}")
    sumstats_df = pd.read_csv(sumstats_dir, delim_whitespace=True)
    
    # renames columns to conform to popcorn input format
    # CHR POS effect_allele non_effect_allele Freq_effect BETA SE P imputation_r2 N_case N_control N_eff
    print(f"get_popcorn_input_from_afr_meta_sumstats: processing summary statistics")
    col_rename_dict = {"CHR": "chr", "POS": "pos", "non_effect_allele": "a1", "effect_allele": "a2", "BETA": "beta", "Freq_effect": "AF", "N_eff": "N"}
    sumstats_df.rename(columns=col_rename_dict, inplace=True)
    
    if use_tot_samples:     # calculates N summing number of cases and controls if option specified
        sumstats_df['N'] = sumstats_df['N_case'] + sumstats_df['N_control']
        
    sort_alleles(sumstats_df)      # sorts alleles
    
    out_cols = ["chr", "pos", "a1", "a2", "beta", "SE", 'N', "AF"]
    if bim_df is not None:    # adds rsid column if bim_df is provided
        sumstats_df = get_sumstats_df_w_rsids(sumstats_df, bim_df)
        out_cols = ["rsid", "a1", "a2", "beta", "SE", 'N', "AF"]
    
    # writes file in popcorn input format
    print(f"get_popcorn_input_from_afr_meta_sumstats: writing sumstats to {out_dir}")
    out_df = sumstats_df[out_cols]
    out_df.to_csv(out_dir, sep='\t', index=False)
    
    print(out_df)

In [13]:
afr_meta_sumstats_dir = "/mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/AA_overall_sumdata.txt"
get_popcorn_input_from_afr_meta_sumstats(afr_meta_sumstats_dir, "AFR_meta.sumstats.txt", bim_df=afr_bim_df)

get_popcorn_input_from_afr_meta_sumstats: reading sumstats from /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/AA_overall_sumdata.txt


  sumstats_df = pd.read_csv(sumstats_dir, delim_whitespace=True)


get_popcorn_input_from_afr_meta_sumstats: processing summary statistics
                       ID chr        pos a2 a1        AF      beta        SE  \
0            1:752721:A:G   1     752721  G  A  0.332382 -0.015724  0.023531   
1            1:753474:C:G   1     753474  G  C  0.257161 -0.017925  0.026162   
2            1:754182:A:G   1     754182  G  A  0.400378 -0.024510  0.023287   
3            1:754192:A:G   1     754192  G  A  0.400391 -0.024505  0.023289   
4            1:754334:T:C   1     754334  T  C  0.599012  0.025747  0.023261   
...                   ...  ..        ... .. ..       ...       ...       ...   
18898809  9:141067985:T:C   9  141067985  T  C  0.372259 -0.005047  0.024332   
18898810  9:141068620:A:T   9  141068620  T  A  0.335211  0.004578  0.025237   
18898811  9:141068637:G:A   9  141068637  G  A  0.509695 -0.029646  0.023995   
18898812  9:141069420:G:A   9  141069420  G  A  0.674135  0.002055  0.026133   
18898813  9:141069424:C:G   9  141069424  G  C  

Take a look at the African summary statistics cleaned for popcorn.

In [14]:
!head "AFR_meta.sumstats.txt"

rsid	a1	a2	beta	SE	N	AF
rs3131972	A	G	-0.01572389	0.02353067	4069.0	0.3323816
rs2073814	C	G	-0.01792483	0.02616185	3824.0	0.25716098
rs3131969	A	G	-0.02450959	0.02328748	3840.0	0.40037756
rs3131968	A	G	-0.02450492	0.02328869	3839.0	0.40039101
rs3131967	C	T	0.02574705	0.02326093	3847.0	0.59901162
rs3115859	A	G	0.02245379	0.02365361	3786.0	0.61817447
rs3131966	C	T	-0.02302763	0.02368161	3778.0	0.38158567
rs3115853	A	G	0.01269722	0.02507091	3809.0	0.70296654
rs2286139	C	T	-0.02283542	0.0259535	3771.0	0.26941848


In [15]:
!dx upload "AFR_meta.sumstats.txt"    # uploads the popcorn input formatted file

ID                          file-GbGPQZQ4BzFzJzqQ8pkzXg5z
Class                       file
Project                     project-GYBZ3yj4BzFgB61K7vX9Fq9K
Folder                      /Aim4_CrossAncestryGeneticCorrelation/jahagirdarob
                            /sumstats_files
Name                        AFR_meta.sumstats.txt
State                       [33mclosing[0m
Visibility                  visible
Types                       -
Properties                  -
Tags                        -
Outgoing links              -
Created                     Sat Nov 18 17:45:54 2023
Created by                  omjahagirdar
 via the job                job-GbGKY204BzFpQxkqYBF2V6y9
Last modified               Sat Nov 18 17:45:57 2023
Media type                  
archivalState               "live"
cloudAccount                "cloudaccount-dnanexus"


## European Data

Take a look at the summary statistics format.

In [16]:
!head /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/European_BCAC_icogs_onco_sumdata.txt

var_name SNP_ID CHR POS effect_allele_iCOGs non_effect_allele_iCOGs Freq_effect_iCOGs Imputation_r2_iCOGs BETA_iCOGs SE_iCOGs P_iCOGs effect_allele_Onco non_effect_allele_Onco Freq_effect_Onco Imputation_r2_Onco BETA_Onco SE_Onco P_Onco effect_allele_meta non_effect_allele_meta BETA_meta SE_meta P_meta N_eff_meta
1_11008_C_G chr1:11008 1 11008 G C 0.0898625 0.335426 0.00688035 0.0315062 0.827133 G C 0.0974118 0.579745 -0.0418897 0.0175537 0.0170151 G C -0.030336853 0.015334299 0.047887458 24614
1_11012_C_G chr1:11012 1 11012 G C 0.0898625 0.335426 0.00688035 0.0315062 0.827133 G C 0.0974118 0.579746 -0.0418896 0.0175537 0.0170154 G C -0.030336777 0.015334299 0.047888019 24614
1_13116_T_G rs201725126 1 13116 G T 0.162848 0.315307 -0.0107051 0.0252618 0.671738 G T 0.175134 0.388268 0.0188337 0.0166418 0.257756 G T 0.0098940386 0.013897235 0.4765001 18244
1_13118_A_G rs200579949 1 13118 G A 0.162848 0.315307 -0.0107049 0.0252618 0.671743 G A 0.175134 0.388268 0.0188338 0.0166418 0.257754 

Load the European bim file into a dataframe.

In [9]:
eur_bim_dir = "/mnt/project/Aim4_CrossAncestryGeneticCorrelation/jahagirdarob/REF/EUR.bim"
eur_bim_df = get_bim_df(eur_bim_dir)
print(eur_bim_df)

         chr         rsid       pos
0          1  rs575272151     11008
1          1  rs544419019     11012
2          1  rs540538026     13110
3          1   rs62635286     13116
4          1  rs200579949     13118
...      ...          ...       ...
8550151   22   rs62240045  51235979
8550152   22    rs3896457  51237063
8550153   22  rs200607599  51237364
8550154   22  rs370652263  51237712
8550155   22  rs202228854  51240820

[8550156 rows x 3 columns]


Modify the format of these summary statistics to be compatible as popcorn input.

In [10]:
def get_popcorn_input_from_eur_meta_sumstats(sumstats_dir, out_dir, bim_df=None, overall=True, num_samples=None):
    """
    sumstats_dir: directory of Zhang2020 summary statistics file
    out_dir: directory to which the popcorn input will be written
    bim_df: (optional) pandas dataframe w/ chr, pos, and rsid columns; rsid column is added to output file if included
    overall: Should be set True (default) if summary statistics are overall (ER+ and ER- combined), False otherwise
    """
    print(f"get_popcorn_input_from_eur_meta_sumstats: reading sumstats from {sumstats_dir}")
    sumstats_df = pd.read_csv(sumstats_dir, delim_whitespace=True)
    
    # renames columns to conform to popcorn input format
    # var_name SNP_ID CHR POS effect_allele_iCOGs non_effect_allele_iCOGs Freq_effect_iCOGs Imputation_r2_iCOGs BETA_iCOGs SE_iCOGs P_iCOGs effect_allele_Onco non_effect_allele_Onco Freq_effect_Onco Imputation_r2_Onco BETA_Onco SE_Onco P_Onco effect_allele_meta non_effect_allele_meta BETA_meta SE_meta P_meta N_eff_meta
    print(f"get_popcorn_input_from_eur_meta_sumstats: processing summary statistics")
    pos_hdr = "POS" if overall else "position_b37"     # accounts for different input column header if not overall summary statistics
    col_rename_dict = {"CHR": "chr", pos_hdr: "pos", "non_effect_allele_meta": "a1", "effect_allele_meta": "a2", "BETA_meta": "beta", "SE_meta": "SE", "N_eff_meta": "N"}
    sumstats_df.rename(columns=col_rename_dict, inplace=True)
    
    if num_samples is not None:
        sumstats_df["N"] = num_samples
    
    sort_alleles(sumstats_df)      # sorts alleles
    
    out_cols = ["chr", "pos", "a1", "a2", "beta", "SE", 'N']
    if bim_df is not None:    # adds rsid column if bim_df is provided
        sumstats_df = get_sumstats_df_w_rsids(sumstats_df, bim_df)
        out_cols = ["rsid", "a1", "a2", "beta", "SE", 'N']
    
    # writes file in popcorn input format
    print(f"get_popcorn_input_from_eur_meta_sumstats: writing sumstats to {out_dir}")
    out_df = sumstats_df[out_cols]
    out_df.to_csv(out_dir, sep='\t', index=False)
    
    print(out_df)

In [20]:
eur_meta_sumstats_dir = "/mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/European_BCAC_icogs_onco_sumdata.txt"
get_popcorn_input_from_eur_meta_sumstats(eur_meta_sumstats_dir, "EUR_meta.sumstats.txt", bim_df=eur_bim_df)

get_popcorn_input_from_eur_meta_sumstats: reading sumstats from /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/European_BCAC_icogs_onco_sumdata.txt
get_popcorn_input_from_eur_meta_sumstats: processing summary statistics
get_popcorn_input_from_eur_meta_sumstats: writing sumstats to EUR_meta.sumstats.txt
                rsid a1 a2      beta        SE      N
0        rs575272151  C  G -0.030337  0.015334  24614
1        rs544419019  C  G -0.030337  0.015334  24614
2         rs62635286  G  T -0.009894  0.013897  18244
3        rs200579949  A  G  0.009894  0.013897  18244
4        rs531730856  C  G -0.012928  0.015374  17278
...              ... .. ..       ...       ...    ...
8458839  rs200189535  C  T  0.008805  0.010918  25417
8458840   rs62240045  A  G  0.007712  0.012762  16392
8458841    rs3896457  C  T  0.002210  0.009466  29057
8458842  rs200607599  A  G  0.054971  0.040306  20452
8458843  rs370652263  A  G -0.010605  0.016027  34363

[8458844 rows x 6 columns]


Take a look at the European summary statistics cleaned for popcorn.

In [21]:
!head "EUR_meta.sumstats.txt"

rsid	a1	a2	beta	SE	N
rs575272151	C	G	-0.030336853	0.015334299	24614
rs544419019	C	G	-0.030336777	0.015334299	24614
rs62635286	G	T	-0.0098940386	0.013897235	18244
rs200579949	A	G	0.0098941688	0.013897235	18244
rs531730856	C	G	-0.012928126	0.015374213	17278
rs546169444	A	T	-0.0097661803	0.014745373	16966
rs75454623	A	G	-0.0062198488	0.010234965	19135
rs199856693	A	G	-0.0113219	0.027799162	16770
rs78601809	G	T	-0.0061639233	0.011607748	18137


In [22]:
!dx upload "EUR_meta.sumstats.txt"

ID                          file-GbGPVk84BzFf1v8Kzj5815vV
Class                       file
Project                     project-GYBZ3yj4BzFgB61K7vX9Fq9K
Folder                      /Aim4_CrossAncestryGeneticCorrelation/jahagirdarob
                            /sumstats_files
Name                        EUR_meta.sumstats.txt
State                       [33mclosing[0m
Visibility                  visible
Types                       -
Properties                  -
Tags                        -
Outgoing links              -
Created                     Sat Nov 18 17:48:21 2023
Created by                  omjahagirdar
 via the job                job-GbGKY204BzFpQxkqYBF2V6y9
Last modified               Sat Nov 18 17:48:24 2023
Media type                  
archivalState               "live"
cloudAccount                "cloudaccount-dnanexus"


## East Asian Data

Take a look at the summary statistics format.

In [8]:
!head /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/EAS_BCAC_BBJ_meta_sumdata.txt

unique_SNP_id effect_allele_BBJ non_effect_allele_BBJ Freq_effect_BBJ BETA_BBJ SE_BBJ P_BBJ N_eff_BBJ SNPID effect_allele_BCAC non_effect_allele_BCAC Freq_effect_BCAC BETA_BCAC SE_BCAC P_BCAC N_eff_BCAC flip_sign w_BBJ w_BCAC BETA_meta SE_meta P_meta effect_allele_meta non_effect_allele_meta N_eff_meta
1_751343_A_T A T 0.148702883270071 -0.0480527285854757 0.0281407682837429 0.0877135400414866 4987 rs28544273 A T 0.1237 -0.0396 0.0405 0.3277 2812.14329403978 FALSE 1262.78117717823 609.66316110349 -0.0453005414448422 0.0231097656695597 0.0499684813540172 A T 7799.14329403978
1_751756_C_T C T 0.148244184256154 -0.0479579223055557 0.0280750073417772 0.0875979517590267 5023 rs28527770 T C 0.8764 -0.0397 0.0405 0.3268 2814.09735686651 TRUE 1268.70380681243 609.66316110349 -0.0452776414536527 0.0230733035068074 0.0497230389759873 C T 7837.09735686651
1_752566_A_G A G 0.841787774720677 0.0424441957339612 0.0260038935605772 0.102632172474615 5552 rs3094315 A G 0.8665 0.0469 0.0393 0.2328 2798.

Load the East Asian bim file into a dataframe.

In [9]:
eas_bim_dir = "/mnt/project/Aim4_CrossAncestryGeneticCorrelation/jahagirdarob/REF/EAS.bim"
eas_bim_df = get_bim_df(eas_bim_dir)
print(eas_bim_df)

         chr         rsid       pos
0          1  rs575272151     11008
1          1  rs544419019     11012
2          1   rs62635286     13116
3          1  rs200579949     13118
4          1  rs531730856     13273
...      ...          ...       ...
7550536   22  rs200607599  51237364
7550537   22  rs370652263  51237712
7550538   22  rs573137567  51239678
7550539   22  rs202228854  51240820
7550540   22  rs199560686  51244163

[7550541 rows x 3 columns]


Modify the format of these summary statistics to be compatible as popcorn input.

In [48]:
def get_popcorn_input_from_eas_meta_input(sumstats_dir, out_dir, bim_df=None):
    """
    sumstats_dir: directory of BCAC summary statistics file
    out_dir: directory to which the popcorn input will be written
    bim_df: (optional) pandas dataframe w/ chr, pos, and rsid columns; rsid column is added to output file if included
    """
    print(f"get_popcorn_input_from_eas_meta_sumstats: reading sumstats from {sumstats_dir}")
    sumstats_df = pd.read_csv(sumstats_dir, delim_whitespace=True)
    
    # renames columns to conform to popcorn input format
    # unique_SNP_id effect_allele_BBJ non_effect_allele_BBJ Freq_effect_BBJ BETA_BBJ SE_BBJ P_BBJ N_eff_BBJ SNPID effect_allele_BCAC non_effect_allele_BCAC Freq_effect_BCAC BETA_BCAC SE_BCAC P_BCAC N_eff_BCAC flip_sign w_BBJ w_BCAC BETA_meta SE_meta P_meta effect_allele_meta non_effect_allele_meta N_eff_meta
    print(f"get_popcorn_input_from_eas_meta_sumstats: processing summary statistics")
    
    # derives chr and pos from unique_SNP_id line
    snp_id_info_df = sumstats_df["unique_SNP_id"].str.split(pat='_', expand=True)
    snp_id_info_df.rename(columns={0: "chr", 1: "pos"}, inplace=True)
    
    # only includes autosomal sites
    snp_id_info_df = snp_id_info_df.loc[~snp_id_info_df["chr"].isin(['X', 'Y'])]
    sumstats_df["chr"] = snp_id_info_df["chr"].astype(int)
    sumstats_df["pos"] = snp_id_info_df["pos"].astype(float).astype(int)
    
    col_rename_dict = {"non_effect_allele_meta": "a1", "effect_allele_meta": "a2", "BETA_meta": "beta", "SE_meta": "SE", "N_eff_meta": "N"}
    sumstats_df.rename(columns=col_rename_dict, inplace=True)
    
    sort_alleles(sumstats_df)      # sorts alleles
    
    out_cols = ["chr", "pos", "a1", "a2", "beta", "SE", 'N']
    if bim_df is not None:    # adds rsid column if bim_df is provided
        sumstats_df = get_sumstats_df_w_rsids(sumstats_df, bim_df)
        out_cols = ["rsid", "a1", "a2", "beta", "SE", 'N']
    
    # writes file in popcorn input format
    print(f"get_popcorn_input_from_eas_meta_sumstats: writing sumstats to {out_dir}")
    out_df = sumstats_df[out_cols]
    out_df.to_csv(out_dir, sep='\t', index=False)
    
    print(out_df)

In [49]:
eas_meta_sumstats_dir = "/mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/EAS_BCAC_BBJ_meta_sumdata.txt"
get_popcorn_input_from_eas_meta_input(eas_meta_sumstats_dir, "EAS_meta.sumstats.txt",  bim_df=eas_bim_df)

get_popcorn_input_from_eas_meta_sumstats: reading sumstats from /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/EAS_BCAC_BBJ_meta_sumdata.txt


  sumstats_df = pd.read_csv(sumstats_dir, delim_whitespace=True)


get_popcorn_input_from_eas_meta_sumstats: processing summary statistics
get_popcorn_input_from_eas_meta_sumstats: writing sumstats to EAS_meta.sumstats.txt
                rsid a1 a2      beta        SE            N
0         rs28544273  A  T  0.045301  0.023110  7799.143294
1        rs143225517  C  T  0.045278  0.023073  7837.097357
2          rs3094315  A  G -0.043801  0.021686  8350.562889
3          rs3131971  C  T -0.044121  0.022548  7988.880590
4          rs3115860  A  C -0.046694  0.022926  7866.067801
...              ... .. ..       ...       ...          ...
7539138  rs113271527  A  G -0.005000  0.043500  4875.759626
7539139  rs112574713  C  G  0.003400  0.033200  2952.269988
7539140    rs2163055  C  T  0.038500  0.028200  6023.494301
7539141   rs62492145  A  G -0.012000  0.038200  4291.433157
7539142   rs74405861  C  G -0.027400  0.037000  3272.201969

[7539143 rows x 6 columns]


Take a look at the East Asian summary statistics cleaned for popcorn.

In [50]:
!head "EAS_meta.sumstats.txt"

rsid	a1	a2	beta	SE	N
rs28544273	A	T	0.0453005414448422	0.0231097656695597	7799.14329403978
rs143225517	C	T	0.0452776414536527	0.0230733035068074	7837.09735686651
rs3094315	A	G	-0.0438009913836003	0.0216863628058605	8350.56288941241
rs3131971	C	T	-0.0441211404234754	0.0225481633790612	7988.88059003056
rs3115860	A	C	-0.0466943423779331	0.0229257079151948	7866.06780088295
rs3131970	C	T	-0.0470119615303409	0.0228428360407606	7898.2835460389
rs10157329	A	T	0.0457303086456231	0.0260148882788661	7081.07369987925
rs114111569	A	T	-0.0473012635317065	0.0231427308568686	7745.06197244177
rs1048488	C	T	0.0456640900988062	0.0225335993119473	7847.78628870211


In [51]:
!dx upload "EAS_meta.sumstats.txt"    # uploads the popcorn input formatted file

ID                          file-GbJ9G5Q4BzFXGJY0XzPzGB1x
Class                       file
Project                     project-GYBZ3yj4BzFgB61K7vX9Fq9K
Folder                      /Aim4_CrossAncestryGeneticCorrelation/jahagirdarob
                            /sumstats_files
Name                        EAS_meta.sumstats.txt
State                       [33mclosing[0m
Visibility                  visible
Types                       -
Properties                  -
Tags                        -
Outgoing links              -
Created                     Sun Nov 19 23:11:18 2023
Created by                  omjahagirdar
 via the job                job-GbJ8gxQ4BzFZygZz8Qpvbb0v
Last modified               Sun Nov 19 23:11:22 2023
Media type                  
archivalState               "live"
cloudAccount                "cloudaccount-dnanexus"


In [52]:
! dx ls

AFR.sumstats.txt : file-Gb2Z25j4BzFV4x7BPBXfGJV8
AFR.sumstats.txt : file-Gb1X2yj4BzFf8vkgq8yjB9bq
AFR_meta.sumstats.txt
EAS.sumstats.txt
EAS_meta.sumstats.txt
EUR.sumstats.txt : file-Gb2Z44Q4BzFxZQb3k4b320Pk
EUR.sumstats.txt : file-Gb1X34j4BzFk3XQBv8PVqy85
EUR.sumstats.txt : file-Gb23kJQ4BzFb8QvJyfzqP3z1
EUR_meta.sumstats.txt


# Prepare Population ER+/ER- Breast Cancer Meta Summary Statistics Data

## African Data

Take a look at the summary statistics format. Checks it is the same as the the overall summary statistics format.

In [4]:
!head -1 /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/AA_overall_sumdata.txt
!head -1 /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/AA_ERneg_sumdata.txt
!head -1 /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/AA_ERpos_sumdata.txt

ID CHR POS effect_allele non_effect_allele Freq_effect BETA SE P imputation_r2 N_case N_control N_eff
ID CHR POS effect_allele non_effect_allele Freq_effect BETA SE P imputation_r2 N_case N_control N_eff
ID CHR POS effect_allele non_effect_allele Freq_effect BETA SE P imputation_r2 N_case N_control N_eff


Modify the format of these summary statistics to be compatible as popcorn input.

In [25]:
afr_meta_sumstats_dir = "/mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/AA_ERneg_sumdata.txt"
get_popcorn_input_from_afr_meta_sumstats(afr_meta_sumstats_dir, "AFR_meta_ERneg.sumstats.txt", bim_df=afr_bim_df)

get_popcorn_input_from_afr_meta_sumstats: reading sumstats from /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/AA_ERneg_sumdata.txt


  sumstats_df = pd.read_csv(sumstats_dir, delim_whitespace=True)


get_popcorn_input_from_afr_meta_sumstats: processing summary statistics
get_popcorn_input_from_afr_meta_sumstats: writing sumstats to AFR_meta_ERneg.sumstats.txt
                rsid a1 a2      beta        SE       N        AF
0          rs3131972  A  G -0.012059  0.035727  1765.0  0.332382
1          rs2073814  C  G -0.006416  0.039553  1673.0  0.257161
2          rs3131969  A  G -0.028949  0.035316  1669.0  0.400378
3          rs3131968  A  G -0.028961  0.035318  1669.0  0.400391
4          rs3131967  C  T  0.030431  0.035285  1671.0  0.599012
...              ... .. ..       ...       ...     ...       ...
14029445   rs2847047  C  T  0.000480  0.036765  1582.0  0.372259
14029446  rs10780202  A  T  0.016848  0.038188  1538.0  0.335211
14029447   rs4088486  A  G -0.042467  0.036493  1502.0  0.509695
14029448  rs11137400  A  G -0.014139  0.039591  1452.0  0.674135
14029449  rs11137401  C  G  0.015339  0.039602  1436.0  0.332461

[14029450 rows x 7 columns]


Take a look at the African summary statistics cleaned for popcorn.

In [26]:
!head "AFR_meta_ERneg.sumstats.txt"

rsid	a1	a2	beta	SE	N	AF
rs3131972	A	G	-0.0120593	0.03572656	1765.0	0.3323816
rs2073814	C	G	-0.00641626	0.039553	1673.0	0.25716098
rs3131969	A	G	-0.0289486	0.0353164	1669.0	0.40037756
rs3131968	A	G	-0.02896089	0.03531814	1669.0	0.40039101
rs3131967	C	T	0.03043069	0.03528459	1671.0	0.59901162
rs3115859	A	G	0.0205729	0.03579849	1652.0	0.61817447
rs3131966	C	T	-0.02015514	0.03583954	1649.0	0.38158567
rs3115853	A	G	0.0140754	0.03810667	1649.0	0.70296654
rs2286139	C	T	-0.0194629	0.03932637	1642.0	0.26941848


In [27]:
!dx upload "AFR_meta_ERneg.sumstats.txt"    # uploads the popcorn input formatted file

ID                          file-GbXYV0Q4BzFv6x8y25bxxzFG
Class                       file
Project                     project-GYBZ3yj4BzFgB61K7vX9Fq9K
Folder                      /Aim4_CrossAncestryGeneticCorrelation/jahagirdarob
                            /sumstats_files
Name                        AFR_meta_ERneg.sumstats.txt
State                       [33mclosing[0m
Visibility                  visible
Types                       -
Properties                  -
Tags                        -
Outgoing links              -
Created                     Tue Nov 28 00:46:58 2023
Created by                  omjahagirdar
 via the job                job-GbXXjyQ4BzFjf8X177yJQvKq
Last modified               Tue Nov 28 00:47:02 2023
Media type                  
archivalState               "live"
cloudAccount                "cloudaccount-dnanexus"


Generate the summary statistics using total sample size rather than effective sample size.

In [28]:
afr_meta_sumstats_dir = "/mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/AA_ERpos_sumdata.txt"
get_popcorn_input_from_afr_meta_sumstats(afr_meta_sumstats_dir, "AFR_meta_ERpos.sumstats.txt", bim_df=afr_bim_df)

get_popcorn_input_from_afr_meta_sumstats: reading sumstats from /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/AA_ERpos_sumdata.txt


  sumstats_df = pd.read_csv(sumstats_dir, delim_whitespace=True)


get_popcorn_input_from_afr_meta_sumstats: processing summary statistics
get_popcorn_input_from_afr_meta_sumstats: writing sumstats to AFR_meta_ERpos.sumstats.txt
                rsid a1 a2      beta        SE       N        AF
0          rs3131972  A  G -0.009653  0.029988  2505.0  0.332382
1          rs2073814  C  G -0.023882  0.033037  2398.0  0.257161
2          rs3131969  A  G -0.032293  0.029974  2318.0  0.400378
3          rs3131968  A  G -0.032284  0.029975  2317.0  0.400391
4          rs3131967  C  T  0.033041  0.029944  2321.0  0.599012
...              ... .. ..       ...       ...     ...       ...
14029445   rs2847047  C  T -0.030400  0.030785  2257.0  0.372259
14029446  rs10780202  A  T -0.017362  0.031921  2201.0  0.335211
14029447   rs4088486  A  G -0.031260  0.030559  2142.0  0.509695
14029448  rs11137400  A  G  0.021477  0.033048  2083.0  0.674135
14029449  rs11137401  C  G -0.022081  0.033084  2058.0  0.332461

[14029450 rows x 7 columns]


Take a look at the African summary statistics cleaned for popcorn.

In [29]:
!head "AFR_meta_ERpos.sumstats.txt"

rsid	a1	a2	beta	SE	N	AF
rs3131972	A	G	-0.00965253	0.02998752	2505.0	0.3323816
rs2073814	C	G	-0.02388151	0.03303695	2398.0	0.25716098
rs3131969	A	G	-0.03229329	0.02997372	2318.0	0.40037756
rs3131968	A	G	-0.03228352	0.02997505	2317.0	0.40039101
rs3131967	C	T	0.03304077	0.02994418	2321.0	0.59901162
rs3115859	A	G	0.03004083	0.03031778	2304.0	0.61817447
rs3131966	C	T	-0.03028812	0.03035522	2299.0	0.38158567
rs3115853	A	G	0.01361716	0.03192142	2349.0	0.70296654
rs2286139	C	T	-0.02454694	0.03278619	2363.0	0.26941848


In [30]:
!dx upload "AFR_meta_ERpos.sumstats.txt"    # uploads the popcorn input formatted file

ID                          file-GbXYX2Q4BzFq2yBbF79JBbzJ
Class                       file
Project                     project-GYBZ3yj4BzFgB61K7vX9Fq9K
Folder                      /Aim4_CrossAncestryGeneticCorrelation/jahagirdarob
                            /sumstats_files
Name                        AFR_meta_ERpos.sumstats.txt
State                       [33mclosing[0m
Visibility                  visible
Types                       -
Properties                  -
Tags                        -
Outgoing links              -
Created                     Tue Nov 28 00:49:14 2023
Created by                  omjahagirdar
 via the job                job-GbXXjyQ4BzFjf8X177yJQvKq
Last modified               Tue Nov 28 00:49:17 2023
Media type                  
archivalState               "live"
cloudAccount                "cloudaccount-dnanexus"


In [11]:
afr_meta_sumstats_dir = "/mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/AA_ERneg_sumdata.txt"
get_popcorn_input_from_afr_meta_sumstats(afr_meta_sumstats_dir, "AFR_meta_ERneg_totsamples.sumstats.txt", bim_df=afr_bim_df, use_tot_samples=True)

get_popcorn_input_from_afr_meta_sumstats: reading sumstats from /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/AA_ERneg_sumdata.txt


  sumstats_df = pd.read_csv(sumstats_dir, delim_whitespace=True)


get_popcorn_input_from_afr_meta_sumstats: processing summary statistics
get_popcorn_input_from_afr_meta_sumstats: writing sumstats to AFR_meta_ERneg_totsamples.sumstats.txt
                rsid a1 a2      beta        SE      N        AF
0          rs3131972  A  G -0.012059  0.035727  12819  0.332382
1          rs2073814  C  G -0.006416  0.039553  12819  0.257161
2          rs3131969  A  G -0.028949  0.035316  12819  0.400378
3          rs3131968  A  G -0.028961  0.035318  12819  0.400391
4          rs3131967  C  T  0.030431  0.035285  12819  0.599012
...              ... .. ..       ...       ...    ...       ...
14029445   rs2847047  C  T  0.000480  0.036765  12819  0.372259
14029446  rs10780202  A  T  0.016848  0.038188  12819  0.335211
14029447   rs4088486  A  G -0.042467  0.036493  12819  0.509695
14029448  rs11137400  A  G -0.014139  0.039591  12819  0.674135
14029449  rs11137401  C  G  0.015339  0.039602  12819  0.332461

[14029450 rows x 7 columns]


Take a look at the African summary statistics cleaned for popcorn.

In [12]:
!head "AFR_meta_ERneg_totsamples.sumstats.txt"

rsid	a1	a2	beta	SE	N	AF
rs3131972	A	G	-0.0120593	0.03572656	12819	0.3323816
rs2073814	C	G	-0.00641626	0.039553	12819	0.25716098
rs3131969	A	G	-0.0289486	0.0353164	12819	0.40037756
rs3131968	A	G	-0.02896089	0.03531814	12819	0.40039101
rs3131967	C	T	0.03043069	0.03528459	12819	0.59901162
rs3115859	A	G	0.0205729	0.03579849	12819	0.61817447
rs3131966	C	T	-0.02015514	0.03583954	12819	0.38158567
rs3115853	A	G	0.0140754	0.03810667	12819	0.70296654
rs2286139	C	T	-0.0194629	0.03932637	12819	0.26941848


In [13]:
!dx upload "AFR_meta_ERneg_totsamples.sumstats.txt"    # uploads the popcorn input formatted file

ID                          file-GbggKvQ4BzFV4V49x96QYzz5
Class                       file
Project                     project-GYBZ3yj4BzFgB61K7vX9Fq9K
Folder                      /Aim4_CrossAncestryGeneticCorrelation/jahagirdarob
                            /sumstats_files
Name                        AFR_meta_ERneg_totsamples.sumstats.txt
State                       [33mclosing[0m
Visibility                  visible
Types                       -
Properties                  -
Tags                        -
Outgoing links              -
Created                     Tue Dec  5 19:18:10 2023
Created by                  omjahagirdar
 via the job                job-GbggB5Q4BzFgqk7V1GxF5Z42
Last modified               Tue Dec  5 19:18:13 2023
Media type                  
archivalState               "live"
cloudAccount                "cloudaccount-dnanexus"


In [14]:
afr_meta_sumstats_dir = "/mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/AA_ERpos_sumdata.txt"
get_popcorn_input_from_afr_meta_sumstats(afr_meta_sumstats_dir, "AFR_meta_ERpos_totsamples.sumstats.txt", bim_df=afr_bim_df, use_tot_samples=True)

get_popcorn_input_from_afr_meta_sumstats: reading sumstats from /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/AA_ERpos_sumdata.txt


  sumstats_df = pd.read_csv(sumstats_dir, delim_whitespace=True)


get_popcorn_input_from_afr_meta_sumstats: processing summary statistics
get_popcorn_input_from_afr_meta_sumstats: writing sumstats to AFR_meta_ERpos_totsamples.sumstats.txt
                rsid a1 a2      beta        SE      N        AF
0          rs3131972  A  G -0.009653  0.029988  14479  0.332382
1          rs2073814  C  G -0.023882  0.033037  14479  0.257161
2          rs3131969  A  G -0.032293  0.029974  14479  0.400378
3          rs3131968  A  G -0.032284  0.029975  14479  0.400391
4          rs3131967  C  T  0.033041  0.029944  14479  0.599012
...              ... .. ..       ...       ...    ...       ...
14029445   rs2847047  C  T -0.030400  0.030785  14479  0.372259
14029446  rs10780202  A  T -0.017362  0.031921  14479  0.335211
14029447   rs4088486  A  G -0.031260  0.030559  14479  0.509695
14029448  rs11137400  A  G  0.021477  0.033048  14479  0.674135
14029449  rs11137401  C  G -0.022081  0.033084  14479  0.332461

[14029450 rows x 7 columns]


Take a look at the African summary statistics cleaned for popcorn.

In [15]:
!head "AFR_meta_ERpos_totsamples.sumstats.txt"

rsid	a1	a2	beta	SE	N	AF
rs3131972	A	G	-0.00965253	0.02998752	14479	0.3323816
rs2073814	C	G	-0.02388151	0.03303695	14479	0.25716098
rs3131969	A	G	-0.03229329	0.02997372	14479	0.40037756
rs3131968	A	G	-0.03228352	0.02997505	14479	0.40039101
rs3131967	C	T	0.03304077	0.02994418	14479	0.59901162
rs3115859	A	G	0.03004083	0.03031778	14479	0.61817447
rs3131966	C	T	-0.03028812	0.03035522	14479	0.38158567
rs3115853	A	G	0.01361716	0.03192142	14479	0.70296654
rs2286139	C	T	-0.02454694	0.03278619	14479	0.26941848


In [16]:
!dx upload "AFR_meta_ERpos_totsamples.sumstats.txt"    # uploads the popcorn input formatted file

ID                          file-GbggPvj4BzFxxVJkJxGXv5fq
Class                       file
Project                     project-GYBZ3yj4BzFgB61K7vX9Fq9K
Folder                      /Aim4_CrossAncestryGeneticCorrelation/jahagirdarob
                            /sumstats_files
Name                        AFR_meta_ERpos_totsamples.sumstats.txt
State                       [33mclosing[0m
Visibility                  visible
Types                       -
Properties                  -
Tags                        -
Outgoing links              -
Created                     Tue Dec  5 19:20:19 2023
Created by                  omjahagirdar
 via the job                job-GbggB5Q4BzFgqk7V1GxF5Z42
Last modified               Tue Dec  5 19:20:22 2023
Media type                  
archivalState               "live"
cloudAccount                "cloudaccount-dnanexus"


## European Data

Take a look at the summary statistics format. overall boolean parameter incorporated into function to account for differences.

In [31]:
!head -1 /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/European_BCAC_icogs_onco_sumdata.txt
!head -1 /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/European_ERneg_BCAC_icogs_onco_sumdata.txt
!head -1 /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/European_ERpos_BCAC_icogs_onco_sumdata.txt

var_name SNP_ID CHR POS effect_allele_iCOGs non_effect_allele_iCOGs Freq_effect_iCOGs Imputation_r2_iCOGs BETA_iCOGs SE_iCOGs P_iCOGs effect_allele_Onco non_effect_allele_Onco Freq_effect_Onco Imputation_r2_Onco BETA_Onco SE_Onco P_Onco effect_allele_meta non_effect_allele_meta BETA_meta SE_meta P_meta N_eff_meta
var_name phase3_1kg_id chr position_b37 effect_allele_meta non_effect_allele_meta Freq_effect_iCOGs Imputation_r2_iCOGs BETA_iCOGs SE_iCOGs P_iCOGs Freq_effect_Onco Imputation_r2_Onco BETA_Onco SE_Onco P_Onco BETA_meta SE_meta P_meta N_eff_meta
var_name phase3_1kg_id chr position_b37 effect_allele_meta non_effect_allele_meta Freq_effect_iCOGs Imputation_r2_iCOGs BETA_iCOGs SE_iCOGs P_iCOGs Freq_effect_Onco Imputation_r2_Onco BETA_Onco SE_Onco P_Onco BETA_meta SE_meta P_meta N_eff_meta


Load the European bim file into a dataframe.

In [32]:
eur_meta_sumstats_dir = "/mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/European_ERneg_BCAC_icogs_onco_sumdata.txt"
get_popcorn_input_from_eur_meta_sumstats(eur_meta_sumstats_dir, "EUR_meta_ERneg.sumstats.txt", bim_df=eur_bim_df, overall=False)

get_popcorn_input_from_eur_meta_sumstats: reading sumstats from /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/European_ERneg_BCAC_icogs_onco_sumdata.txt
get_popcorn_input_from_eur_meta_sumstats: processing summary statistics
get_popcorn_input_from_eur_meta_sumstats: writing sumstats to EUR_meta_ERneg.sumstats.txt
                rsid a1 a2     beta       SE       N
0        rs575272151  C  G -0.05444  0.03106  6010.0
1        rs544419019  C  G -0.05444  0.03106  6010.0
2        rs540538026  A  G  0.01441  0.04900  4240.0
3         rs62635286  G  T  0.04391  0.02778  4586.0
4        rs200579949  A  G -0.04391  0.02778  4586.0
...              ... .. ..      ...      ...     ...
8533390   rs62240045  A  G -0.01922  0.02512  4229.0
8533391    rs3896457  C  T -0.00519  0.01893  7305.0
8533392  rs200607599  A  G  0.07123  0.08080  5127.0
8533393  rs370652263  A  G -0.02375  0.03095  9198.0
8533394  rs202228854  C  T  0.04888  0.04917  4684.0

[8533395 rows x 6 columns]


Take a look at the European summary statistics cleaned for popcorn.

In [33]:
!head "EUR_meta_ERneg.sumstats.txt"

rsid	a1	a2	beta	SE	N
rs575272151	C	G	-0.05444	0.03106	6010.0
rs544419019	C	G	-0.05444	0.03106	6010.0
rs540538026	A	G	0.01441	0.049	4240.0
rs62635286	G	T	0.04391	0.02778	4586.0
rs200579949	A	G	-0.04391	0.02778	4586.0
rs531730856	C	G	-0.04069	0.03016	4490.0
rs546169444	A	T	-0.01356	0.02904	4378.0
rs531646671	A	T	0.04288	0.02746	4675.0
rs541940975	A	G	-0.04288	0.02746	4675.0


In [34]:
!dx upload "EUR_meta_ERneg.sumstats.txt"

ID                          file-GbXYXv04BzFvf4y2Jxk87k1Z
Class                       file
Project                     project-GYBZ3yj4BzFgB61K7vX9Fq9K
Folder                      /Aim4_CrossAncestryGeneticCorrelation/jahagirdarob
                            /sumstats_files
Name                        EUR_meta_ERneg.sumstats.txt
State                       [33mclosing[0m
Visibility                  visible
Types                       -
Properties                  -
Tags                        -
Outgoing links              -
Created                     Tue Nov 28 00:50:56 2023
Created by                  omjahagirdar
 via the job                job-GbXXjyQ4BzFjf8X177yJQvKq
Last modified               Tue Nov 28 00:50:59 2023
Media type                  
archivalState               "live"
cloudAccount                "cloudaccount-dnanexus"


In [35]:
eur_meta_sumstats_dir = "/mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/European_ERpos_BCAC_icogs_onco_sumdata.txt"
get_popcorn_input_from_eur_meta_sumstats(eur_meta_sumstats_dir, "EUR_meta_ERpos.sumstats.txt", bim_df=eur_bim_df, overall=False)

get_popcorn_input_from_eur_meta_sumstats: reading sumstats from /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/European_ERpos_BCAC_icogs_onco_sumdata.txt
get_popcorn_input_from_eur_meta_sumstats: processing summary statistics
get_popcorn_input_from_eur_meta_sumstats: writing sumstats to EUR_meta_ERpos.sumstats.txt
                rsid a1 a2     beta       SE        N
0        rs575272151  C  G -0.02268  0.01886  16300.0
1        rs544419019  C  G -0.02268  0.01886  16300.0
2        rs540538026  A  G -0.04472  0.02964  11632.0
3         rs62635286  G  T  0.00138  0.01693  12361.0
4        rs200579949  A  G -0.00138  0.01693  12361.0
...              ... .. ..      ...      ...      ...
8533390   rs62240045  A  G  0.00559  0.01546  11165.0
8533391    rs3896457  C  T -0.00293  0.01162  19390.0
8533392  rs200607599  A  G  0.06058  0.04959  13609.0
8533393  rs370652263  A  G -0.00320  0.01924  23783.0
8533394  rs202228854  C  T  0.03733  0.03049  12172.0

[8533395 rows x 6 col

Take a look at the European summary statistics cleaned for popcorn.

In [36]:
!head "EUR_meta_ERpos.sumstats.txt"

rsid	a1	a2	beta	SE	N
rs575272151	C	G	-0.02268	0.01886	16300.0
rs544419019	C	G	-0.02268	0.01886	16300.0
rs540538026	A	G	-0.04472	0.02964	11632.0
rs62635286	G	T	0.00138	0.01693	12361.0
rs200579949	A	G	-0.00138	0.01693	12361.0
rs531730856	C	G	-0.00759	0.01859	11810.0
rs546169444	A	T	-0.01378	0.01779	11672.0
rs531646671	A	T	0.0169	0.01679	12511.0
rs541940975	A	G	-0.0169	0.01679	12511.0


In [37]:
!dx upload "EUR_meta_ERpos.sumstats.txt"

ID                          file-GbXYYXQ4BzFgfGPX6Xy1fQzJ
Class                       file
Project                     project-GYBZ3yj4BzFgB61K7vX9Fq9K
Folder                      /Aim4_CrossAncestryGeneticCorrelation/jahagirdarob
                            /sumstats_files
Name                        EUR_meta_ERpos.sumstats.txt
State                       [33mclosing[0m
Visibility                  visible
Types                       -
Properties                  -
Tags                        -
Outgoing links              -
Created                     Tue Nov 28 00:52:26 2023
Created by                  omjahagirdar
 via the job                job-GbXXjyQ4BzFjf8X177yJQvKq
Last modified               Tue Nov 28 00:52:29 2023
Media type                  
archivalState               "live"
cloudAccount                "cloudaccount-dnanexus"


Generate the summary statistics using total sample size rather than effective sample size.

In [17]:
eur_meta_sumstats_dir = "/mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/European_ERneg_BCAC_icogs_onco_sumdata.txt"
get_popcorn_input_from_eur_meta_sumstats(eur_meta_sumstats_dir, "EUR_meta_ERneg_totsamples.sumstats.txt", bim_df=eur_bim_df, overall=False, num_samples=21468)

get_popcorn_input_from_eur_meta_sumstats: reading sumstats from /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/European_ERneg_BCAC_icogs_onco_sumdata.txt
get_popcorn_input_from_eur_meta_sumstats: processing summary statistics
get_popcorn_input_from_eur_meta_sumstats: writing sumstats to EUR_meta_ERneg_totsamples.sumstats.txt
                rsid a1 a2     beta       SE      N
0        rs575272151  C  G -0.05444  0.03106  21468
1        rs544419019  C  G -0.05444  0.03106  21468
2        rs540538026  A  G  0.01441  0.04900  21468
3         rs62635286  G  T  0.04391  0.02778  21468
4        rs200579949  A  G -0.04391  0.02778  21468
...              ... .. ..      ...      ...    ...
8533390   rs62240045  A  G -0.01922  0.02512  21468
8533391    rs3896457  C  T -0.00519  0.01893  21468
8533392  rs200607599  A  G  0.07123  0.08080  21468
8533393  rs370652263  A  G -0.02375  0.03095  21468
8533394  rs202228854  C  T  0.04888  0.04917  21468

[8533395 rows x 6 columns]


Take a look at the European summary statistics cleaned for popcorn.

In [18]:
!head "EUR_meta_ERneg_totsamples.sumstats.txt"

rsid	a1	a2	beta	SE	N
rs575272151	C	G	-0.05444	0.03106	21468
rs544419019	C	G	-0.05444	0.03106	21468
rs540538026	A	G	0.01441	0.049	21468
rs62635286	G	T	0.04391	0.02778	21468
rs200579949	A	G	-0.04391	0.02778	21468
rs531730856	C	G	-0.04069	0.03016	21468
rs546169444	A	T	-0.01356	0.02904	21468
rs531646671	A	T	0.04288	0.02746	21468
rs541940975	A	G	-0.04288	0.02746	21468


In [19]:
!dx upload "EUR_meta_ERneg_totsamples.sumstats.txt"

ID                          file-GbggQX84BzFpKgfQXf9gz9Vy
Class                       file
Project                     project-GYBZ3yj4BzFgB61K7vX9Fq9K
Folder                      /Aim4_CrossAncestryGeneticCorrelation/jahagirdarob
                            /sumstats_files
Name                        EUR_meta_ERneg_totsamples.sumstats.txt
State                       [33mclosing[0m
Visibility                  visible
Types                       -
Properties                  -
Tags                        -
Outgoing links              -
Created                     Tue Dec  5 19:21:45 2023
Created by                  omjahagirdar
 via the job                job-GbggB5Q4BzFgqk7V1GxF5Z42
Last modified               Tue Dec  5 19:21:48 2023
Media type                  
archivalState               "live"
cloudAccount                "cloudaccount-dnanexus"


In [20]:
eur_meta_sumstats_dir = "/mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/European_ERpos_BCAC_icogs_onco_sumdata.txt"
get_popcorn_input_from_eur_meta_sumstats(eur_meta_sumstats_dir, "EUR_meta_ERpos_totsamples.sumstats.txt", bim_df=eur_bim_df, overall=False, num_samples=69501)

get_popcorn_input_from_eur_meta_sumstats: reading sumstats from /mnt/project/Aim2_Polygenicity/HZ/Results/Clean_summary_data/European_ERpos_BCAC_icogs_onco_sumdata.txt
get_popcorn_input_from_eur_meta_sumstats: processing summary statistics
get_popcorn_input_from_eur_meta_sumstats: writing sumstats to EUR_meta_ERpos_totsamples.sumstats.txt
                rsid a1 a2     beta       SE      N
0        rs575272151  C  G -0.02268  0.01886  69501
1        rs544419019  C  G -0.02268  0.01886  69501
2        rs540538026  A  G -0.04472  0.02964  69501
3         rs62635286  G  T  0.00138  0.01693  69501
4        rs200579949  A  G -0.00138  0.01693  69501
...              ... .. ..      ...      ...    ...
8533390   rs62240045  A  G  0.00559  0.01546  69501
8533391    rs3896457  C  T -0.00293  0.01162  69501
8533392  rs200607599  A  G  0.06058  0.04959  69501
8533393  rs370652263  A  G -0.00320  0.01924  69501
8533394  rs202228854  C  T  0.03733  0.03049  69501

[8533395 rows x 6 columns]


Take a look at the European summary statistics cleaned for popcorn.

In [21]:
!head "EUR_meta_ERpos_totsamples.sumstats.txt"

rsid	a1	a2	beta	SE	N
rs575272151	C	G	-0.02268	0.01886	69501
rs544419019	C	G	-0.02268	0.01886	69501
rs540538026	A	G	-0.04472	0.02964	69501
rs62635286	G	T	0.00138	0.01693	69501
rs200579949	A	G	-0.00138	0.01693	69501
rs531730856	C	G	-0.00759	0.01859	69501
rs546169444	A	T	-0.01378	0.01779	69501
rs531646671	A	T	0.0169	0.01679	69501
rs541940975	A	G	-0.0169	0.01679	69501


In [22]:
!dx upload "EUR_meta_ERpos_totsamples.sumstats.txt"

ID                          file-GbggV884BzFfF0j72650Bj14
Class                       file
Project                     project-GYBZ3yj4BzFgB61K7vX9Fq9K
Folder                      /Aim4_CrossAncestryGeneticCorrelation/jahagirdarob
                            /sumstats_files
Name                        EUR_meta_ERpos_totsamples.sumstats.txt
State                       [33mclosing[0m
Visibility                  visible
Types                       -
Properties                  -
Tags                        -
Outgoing links              -
Created                     Tue Dec  5 19:23:13 2023
Created by                  omjahagirdar
 via the job                job-GbggB5Q4BzFgqk7V1GxF5Z42
Last modified               Tue Dec  5 19:23:16 2023
Media type                  
archivalState               "live"
cloudAccount                "cloudaccount-dnanexus"


In [23]:
!dx ls

AFR.sumstats.txt
AFR_meta.sumstats.txt
AFR_meta_ERneg.sumstats.txt
AFR_meta_ERneg_totsamples.sumstats.txt
AFR_meta_ERpos.sumstats.txt
AFR_meta_ERpos_totsamples.sumstats.txt
EAS.sumstats.txt
EAS_meta.sumstats.txt
EUR.sumstats.txt
EUR_meta.sumstats.txt
EUR_meta_ERneg.sumstats.txt
EUR_meta_ERneg_totsamples.sumstats.txt
EUR_meta_ERpos.sumstats.txt
EUR_meta_ERpos_totsamples.sumstats.txt
