Created a package `corradin_ovp_utils` to:
- Fill the gap in packages that read genetic data in Python (`limix`, `Hail`, `scikit-allele`)
- Facilitate data analysis outside of the outside variant pipeline, for common analyses specific to Corradin lab
- Decompose pipeline into data processing code and monitoring/logging code

In [None]:
from corradin_ovp_utils.catalog import test_data_catalog as catalog

Load in a bunch of different datasets

In [None]:
genetic_file_single = catalog.load("genetic_file_single")
genetic_file_case_control = catalog.load("genetic_file_case_control")
genetic_file_split_by_chrom = catalog.load("genetic_file_split_by_chrom")

sample_file_single = catalog.load("sample_file_single")
sample_file_case_control = catalog.load("sample_file_case_control")

---

## High level

### Read in data

In [None]:
from corradin_ovp_utils.datasets.CombinedGenoPheno import CombinedGenoPheno

You don't have to understand this part, but we're just loading the file to give you an idea of the format

In [None]:
genetic_file_case_chrom22 = genetic_file_case_control.files.case.load(chrom=22)
genetic_file_case_chrom22.load_df()

Unnamed: 0_level_0,dashes,rsid,position,alleleA,alleleB,sample1,sample1,sample1,sample2,sample2,...,sample9769,sample9770,sample9770,sample9770,sample9771,sample9771,sample9771,sample9772,sample9772,sample9772
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,AA,AB,BB,AA,AB,...,BB,AA,AB,BB,AA,AB,BB,AA,AB,BB
0,---,rs77948203,21249165,G,A,1,0.0,0.0,1,0.0,...,0,1,0,0,1,0.0,0.0,1,0,0
1,---,rs1014626,21461017,C,T,0,0.0,1.0,0,0.0,...,1,0,0,1,0,0.0,1.0,0,0,1
2,---,rs9610458,22205353,C,T,0,0.0,1.0,0,0.0,...,0,0,1,0,0,1.0,0.0,0,0,1
3,---,rs5762201,27888455,A,G,0,0.0,1.0,0,0.012,...,1,0,0,1,0,0.0,1.0,0,0,1
4,---,rs1004237,28068501,C,T,1,0.0,0.0,1,0.0,...,0,1,0,0,1,0.0,0.0,1,0,0
5,---,rs134490,28730175,C,T,0,0.232,0.768,0,0.014,...,0,0,1,0,0,0.356,0.644,0,0,1
6,---,rs4821519,37102100,G,C,1,0.0,0.0,0,1.0,...,0,1,0,0,1,0.0,0.0,1,0,0
7,---,rs1003500,37262769,C,T,1,0.0,0.0,1,0.0,...,0,1,0,0,1,0.0,0.0,1,0,0
8,---,rs5756405,37310954,A,G,0,1.0,0.0,1,0.0,...,1,0,1,0,1,0.0,0.0,0,1,0


Finding SNPs:
- Can give different chromosomes
- Can give different id columns

Reading files will now only ever take at most 1.7 GB memory at all times no matter how big the file is => no more memory problems. It will process large files in batches 

In [None]:
test = CombinedGenoPheno.init_from_OVPDataset(genetic_file_case_control,
                                       sample_file_case_control, 
                                       rsid_dict = {22: ["rs77948203", "rs9610458", "rs134490", "rs5756405", "21461017", "C"]},
                                        id_col_list = ["rsid", "position", "alleleB"]
                                        )

reading genetic file and collecting found SNPs for file data/test_data/gen_file/test_CASE_MS_chr22.gen


0it [00:00, ?it/s]

processing last batch


0it [00:00, ?it/s]

reading genetic file and collecting found SNPs for file data/test_data/gen_file/test_CONTROL_MS_chr22.gen


0it [00:00, ?it/s]

processing last batch


0it [00:00, ?it/s]

In [None]:
test

CombinedGenoPheno(num_snps=6, num_samples={'case': 9772, 'control': 5175})

Notice how because we had the sample file, the samples now have appropriate `sample_ids`

In [None]:
test.all_samples_geno_df

id_col,rs77948203,21461017,rs9610458,rs134490,C,rs5756405
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
WTCCCT473540,GG,TT,TT,,GG,AG
WTCCCT473530,GG,TT,TT,TT,CG,AA
WTCCCT473555,GG,TT,TT,TT,GG,
WTCCCT473426,GG,TT,TT,TT,GG,GG
WTCCCT473489,GG,TT,CT,,GG,AA
...,...,...,...,...,...,...
WS574632,GG,TT,CT,TT,GG,GG
WS574661,GG,TT,TT,TT,GG,AA
BLOOD294452,GG,TT,CT,TT,GG,AG
WTCCCT511021,GG,TT,CT,TT,GG,AG


In [None]:
test.get_geno_each_sample_subset("case")

id_col,rs77948203,21461017,rs9610458,rs134490,C,rs5756405
ID_2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
WTCCCT473540,GG,TT,TT,,GG,AG
WTCCCT473530,GG,TT,TT,TT,CG,AA
WTCCCT473555,GG,TT,TT,TT,GG,
WTCCCT473426,GG,TT,TT,TT,GG,GG
WTCCCT473489,GG,TT,CT,,GG,AA
...,...,...,...,...,...,...
WTCCCT473455,GG,TT,TT,TT,GG,AG
WTCCCT473479,GG,TT,CT,CT,GG,GG
WTCCCT473432,GG,TT,CT,CT,GG,AG
WTCCCT473465,GG,TT,CT,,GG,AA


In [None]:
test.get_geno_each_sample_subset("control")

id_col,rs77948203,21461017,rs9610458,rs134490,C,rs5756405
ID_2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
WTCCCT443025,GG,TT,TT,CT,GG,AG
WTCCCT443065,GG,TT,CT,CT,GG,AG
WTCCCT443063,GG,TT,TT,CC,GG,GG
WTCCCT443026,GG,TT,CC,CT,GG,AG
WTCCCT443066,GG,TT,CT,TT,GG,GG
...,...,...,...,...,...,...
WS574632,GG,TT,CT,TT,GG,GG
WS574661,GG,TT,TT,TT,GG,AA
BLOOD294452,GG,TT,CT,TT,GG,AG
WTCCCT511021,GG,TT,CT,TT,GG,AG


In [None]:
test.all_geno_df

We can call this function on any dataset, MS, breast cancer, UKBiobank, and any file types we have in the future. Writing a new file type is easy and you don't have to know or understand any of the other code to write a new file type

### Investigate SNP combinations

Decoupling the data read in from the investigate SNP combinations mean that we can start from any file with the same format below, without doing the reading the data again or have access to the the original genetic file.

It also mean we can save this table, and load just a subset of SNPs/columns at a time, using very little memory. For the old pipeline, the more SNPs you query the more memory it takes and you have to give more cores, since it's storing all the SNPs at all time

In [None]:
from corradin_ovp_utils.odds_ratio import get_geno_combination_df

In [None]:
all_samples_geno_df = test.all_samples_geno_df
all_samples_geno_df

id_col,rs77948203,21461017,rs9610458,rs134490,C,rs5756405
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
WTCCCT473540,GG,TT,TT,,GG,AG
WTCCCT473530,GG,TT,TT,TT,CG,AA
WTCCCT473555,GG,TT,TT,TT,GG,
WTCCCT473426,GG,TT,TT,TT,GG,GG
WTCCCT473489,GG,TT,CT,,GG,AA
...,...,...,...,...,...,...
WS574632,GG,TT,CT,TT,GG,GG
WS574661,GG,TT,TT,TT,GG,AA
BLOOD294452,GG,TT,CT,TT,GG,AG
WTCCCT511021,GG,TT,CT,TT,GG,AG


In [None]:
rsid_combo = get_geno_combination_df(all_samples_geno_df,
                                     rsid_list = ["rs77948203", "rs134490"])
rsid_combo

RsidComboInfo(rsid_list=['rs77948203', 'rs134490'], NA_val='NA')

In [None]:
rsid_combo.df

Unnamed: 0,rs77948203,rs134490,unique_samples_id,unique_samples_count
0,AA,CC,[BLOOD293241],1
1,AA,CT,"[WTCCCT474448, WTCCCT466178, WTCCCT470219, WTC...",19
2,AA,,"[WTCCCT474394, WTCCCT470548, WTCCCT508352, WTC...",9
3,AA,TT,"[WTCCCT474560, WTCCCT469955, WTCCCT470264, WTC...",49
4,AG,CC,"[WTCCCT470291, WTCCCT507975, WTCCCT497897, WTC...",45
5,AG,CT,"[WTCCCT489588, WTCCCT489586, WTCCCT489637, WTC...",493
6,AG,,"[WTCCCT473497, WTCCCT473524, WTCCCT489613, WTC...",248
7,AG,TT,"[WTCCCT473522, WTCCCT473514, WTCCCT473551, WTC...",1159
8,GG,CC,"[WTCCCT489604, WTCCCT489620, WTCCCT489645, WTC...",328
9,GG,CT,"[WTCCCT473552, WTCCCT473447, WTCCCT473505, WTC...",3078


In [None]:
rsid_combo.num_samples_NA

2043

In [None]:
rsid_combo.total_samples_no_NA

12904

In [None]:
rsid_combo.query(rs77948203 = "AA", rs134490= "CT")

Unnamed: 0,rs77948203,rs134490,unique_samples_id,unique_samples_count
1,AA,CT,"[WTCCCT474448, WTCCCT466178, WTCCCT470219, WTC...",19


In [None]:
rsid_combo.query(rs77948203 = "AA", rs134490= "CT").unique_samples_count.item()

19

In [None]:
rsid_combo.query(rs77948203 = "AA", rs134490= "CT").unique_samples_id.item()

array(['WTCCCT474448', 'WTCCCT466178', 'WTCCCT470219', 'WTCCCT468665',
       'WTCCCT471002', 'WTCCCT466993', 'WTCCCT508925', 'WTCCCT508195',
       'WTCCCT476283', 'WTCCCT448730', 'WTCCCT500775', 'WTCCCT508309',
       'CCC2_MS656176', 'WTCCCT444162', 'WTCCCT443119', 'WTCCCT543236',
       'WTCCC88305', 'WTCCCT511322', 'BLOOD292928'], dtype=object)

**We can do more than 2 SNPs combinations, any number of SNPs we want**

In [None]:
rsid_combo_4_SNPs = get_geno_combination_df(all_samples_geno_df,
                                     rsid_list = ["rs77948203", "21461017", "rs134490", "rs5756405"])
rsid_combo_4_SNPs

RsidComboInfo(rsid_list=['rs77948203', '21461017', 'rs134490', 'rs5756405'], NA_val='NA')

In [None]:
rsid_combo_4_SNPs.df

Unnamed: 0,rs77948203,21461017,rs134490,rs5756405,unique_samples_id,unique_samples_count
0,AA,TT,CC,AA,[BLOOD293241],1
1,AA,TT,CT,AA,"[WTCCCT470219, WTCCCT471002, WTCCCT476283, WTC...",6
2,AA,TT,CT,AG,"[WTCCCT468665, WTCCCT466993, WTCCCT448730, WTC...",5
3,AA,TT,CT,GG,"[WTCCCT474448, WTCCCT466178, WTCCCT508925, WTC...",7
4,AA,TT,CT,,[WTCCCT443119],1
...,...,...,...,...,...,...
59,GG,TT,,,"[WTCCCT515244, WTCCCT465972, WTCCCT467601, WTC...",32
60,GG,TT,TT,AA,"[WTCCCT473530, WTCCCT473468, WTCCCT473462, WTC...",1815
61,GG,TT,TT,AG,"[WTCCCT473435, WTCCCT473500, WTCCCT473537, WTC...",3667
62,GG,TT,TT,GG,"[WTCCCT473426, WTCCCT473456, WTCCCT473515, WTC...",2060


See what genotypes exist in the data

In [None]:
rsid_combo_4_SNPs.get_all_genos("rs77948203")

array(['AA', 'AG', 'GG'], dtype=object)

In [None]:
rsid_combo_4_SNPs.get_all_genos("21461017")

array(['TT', 'CT', 'NA'], dtype=object)

In [None]:
rsid_combo_4_SNPs.get_all_genos("rs134490")

array(['CC', 'CT', 'NA', 'TT'], dtype=object)

---

**`genetic_file_single` and `sample_file_single` are both `OVPDataset` so they will have similar interfaces**

In [None]:
genetic_file_single

<corradin_ovp_utils.datasets.OVPDataset.OVPDataset at 0x1529612dffa0>

In [None]:
sample_file_single

<corradin_ovp_utils.datasets.OVPDataset.OVPDataset at 0x1529612e97c0>

In [None]:
genetic_file_case_control.files.case.load_all_chrom()

{1: None,
 2: None,
 3: None,
 4: None,
 5: None,
 6: None,
 7: None,
 8: None,
 9: None,
 10: None,
 11: None,
 12: None,
 13: None,
 14: None,
 15: None,
 16: None,
 17: None,
 18: None,
 19: None,
 20: None,
 21: None,
 22: GenFileObject(chrom=22, file_path=Path('data/test_data/gen_file/test_CASE_MS_chr22.gen'))}

In [None]:
genetic_file_split_by_chrom.files.case.load_all_chrom()

Cannot find file data/test_data/gen_file/test_CASE_MS_chr1.gen
Cannot find file data/test_data/gen_file/test_CASE_MS_chr2.gen
Cannot find file data/test_data/gen_file/test_CASE_MS_chr3.gen
Cannot find file data/test_data/gen_file/test_CASE_MS_chr4.gen
Cannot find file data/test_data/gen_file/test_CASE_MS_chr5.gen
Cannot find file data/test_data/gen_file/test_CASE_MS_chr6.gen
Cannot find file data/test_data/gen_file/test_CASE_MS_chr7.gen
Cannot find file data/test_data/gen_file/test_CASE_MS_chr8.gen
Cannot find file data/test_data/gen_file/test_CASE_MS_chr9.gen
Cannot find file data/test_data/gen_file/test_CASE_MS_chr10.gen
Cannot find file data/test_data/gen_file/test_CASE_MS_chr11.gen
Cannot find file data/test_data/gen_file/test_CASE_MS_chr12.gen
Cannot find file data/test_data/gen_file/test_CASE_MS_chr13.gen
Cannot find file data/test_data/gen_file/test_CASE_MS_chr14.gen
Cannot find file data/test_data/gen_file/test_CASE_MS_chr15.gen
Cannot find file data/test_data/gen_file/test_CAS

{1: None,
 2: None,
 3: None,
 4: None,
 5: None,
 6: None,
 7: None,
 8: None,
 9: None,
 10: None,
 11: None,
 12: None,
 13: None,
 14: None,
 15: None,
 16: None,
 17: None,
 18: None,
 19: None,
 20: None,
 21: None,
 22: GenFileObject(chrom=22, file_path=Path('data/test_data/gen_file/test_CASE_MS_chr22.gen'))}

---

Load in an entire file for specific chromosome, without the sample file, the samples are automatically numbered
- User can decide what they define as "big" to not load in all

In [None]:
genetic_file_case_chrom22 = genetic_file_case_control.files.case.load(chrom=22)
genetic_file_case_chrom22.load_df()

Unnamed: 0_level_0,dashes,rsid,position,alleleA,alleleB,sample1,sample1,sample1,sample2,sample2,...,sample9769,sample9770,sample9770,sample9770,sample9771,sample9771,sample9771,sample9772,sample9772,sample9772
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,AA,AB,BB,AA,AB,...,BB,AA,AB,BB,AA,AB,BB,AA,AB,BB
0,---,rs77948203,21249165,G,A,1,0.0,0.0,1,0.0,...,0,1,0,0,1,0.0,0.0,1,0,0
1,---,rs1014626,21461017,C,T,0,0.0,1.0,0,0.0,...,1,0,0,1,0,0.0,1.0,0,0,1
2,---,rs9610458,22205353,C,T,0,0.0,1.0,0,0.0,...,0,0,1,0,0,1.0,0.0,0,0,1
3,---,rs5762201,27888455,A,G,0,0.0,1.0,0,0.012,...,1,0,0,1,0,0.0,1.0,0,0,1
4,---,rs1004237,28068501,C,T,1,0.0,0.0,1,0.0,...,0,1,0,0,1,0.0,0.0,1,0,0
5,---,rs134490,28730175,C,T,0,0.232,0.768,0,0.014,...,0,0,1,0,0,0.356,0.644,0,0,1
6,---,rs4821519,37102100,G,C,1,0.0,0.0,0,1.0,...,0,1,0,0,1,0.0,0.0,1,0,0
7,---,rs1003500,37262769,C,T,1,0.0,0.0,1,0.0,...,0,1,0,0,1,0.0,0.0,1,0,0
8,---,rs5756405,37310954,A,G,0,1.0,0.0,1,0.0,...,1,0,1,0,1,0.0,0.0,0,1,0


In [None]:
genetic_file_case_chrom22.load_df(size_limit = 10_000)

MemoryError: the file's size (799K) is too big, input limit is 10K.
 Please increase the limit or choose a smaller file