[![CI](https://github.com/corradin-lab/corradin-ovp-utils/actions/workflows/main.yml/badge.svg)](https://github.com/corradin-lab/corradin-ovp-utils/actions/workflows/main.yml)

In [None]:
#hide
from corradin_ovp_utils import *
from corradin_ovp_utils.catalog import test_data_catalog, conf_test_data_catalog
from corradin_ovp_utils.datasets.genetic_file import triplicate_converter
from corradin_ovp_utils.datasets.CombinedGenoPheno import CombinedGenoPheno
from corradin_ovp_utils.odds_ratio import get_geno_combination_df

  and should_run_async(code)


# Outside Variants utilities functions

> Library of functions for the Outside Variant Pipeline to create unifying API and facilitate interactive exploration, so anyone can run the Outside Variant Pipeline

## Install

`pip install your_project_name`

## How to use

This library is created to create a unifying API to read in different genetic data formats and specification, combine with phenotype data to create the core data structure of the outside variant pipeline. The library decouples data ingestion from downstream analyses, and make it extremely easy to extend to other input data formats while maintaining the core analysis features

### Quick Start

After specifying the input in a `yaml` file (see `conf/base/catalog_input/genetic_file.yaml` and `conf/base/catalog_input/sample_file.yaml`), we can load the datasets in like this:

In [None]:
#collapse_input closed

from kedro.config import ConfigLoader
from kedro.io import DataCatalog
conf_loader = ConfigLoader("conf/base")
conf_test_data_catalog = conf_loader.get("catalog*.yaml", "catalog*/*.yaml")
test_data_catalog = DataCatalog.from_config(conf_test_data_catalog)

Each of them is an `OVPDataset` with a unifying API

In [None]:
genetic_file = test_data_catalog.load("genetic_file")
sample_file = test_data_catalog.load("sample_file")

genetic_file, sample_file

(<corradin_ovp_utils.datasets.OVPDataset.OVPDataset at 0x7fd8383aa4f0>,
 <corradin_ovp_utils.datasets.OVPDataset.OVPDataset at 0x7fd878c9fcd0>)

We can see that the `genetic_file` and `sample_file` contains two files, case and control

In [None]:
genetic_file.full_file_path

{'case': PosixPath('data/test_data/gen_file/test_CASE_MS_chr22.gen'),
 'control': PosixPath('data/test_data/gen_file/test_CONTROL_MS_chr22.gen')}

In [None]:
sample_file.full_file_path

{'case': PosixPath('data/test_data/sample_file/MS_impute2_ALL_sample_out.tsv'),
 'control': PosixPath('data/test_data/sample_file/ALL_controls_58C_NBS_WTC2_impute2_sample_out.tsv')}

To extract the core data structure of the outside variant pipeline, you need:
- Genetic file
- Sample file
- List of Rsid to extract information

Just feed these inputs into the `CombinedGenoPheno` object to get back a dataframe of genotypes for those SNPs for all of the samples (both case and control). **This is the core data structure of the pipeline**.

In [None]:
all_samples_geno_df = CombinedGenoPheno.init_from_OVPDataset(genetic_file, sample_file, rsid_list = ["rs77948203", "rs9610458", "rs134490", "rs5756405"])
all_samples_geno_df

  and should_run_async(code)


0it [00:00, ?it/s]

0it [00:00, ?it/s]

rsid,rs77948203,rs9610458,rs134490,rs5756405
sample_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
WTCCCT473540,GG,TT,,AG
WTCCCT473530,GG,TT,TT,AA
WTCCCT473555,GG,TT,TT,
WTCCCT473426,GG,TT,TT,GG
WTCCCT473489,GG,CT,,AA
...,...,...,...,...
WS574632,GG,CT,TT,GG
WS574661,GG,TT,TT,AA
BLOOD294452,GG,CT,TT,AG
WTCCCT511021,GG,CT,TT,AG


We can then take this output dataframe and do downstream analysis with it, using the functions in this library. For example, let's see the break down of genotypes grouped by the two SNPs `"rs77948203"` and `"rs9610458"`

In [None]:
get_geno_combination_df(geno_each_sample_df=all_samples_geno_df, 
                       rsid_list= ["rs77948203", "rs9610458"], as_df = True)

  and should_run_async(code)


Unnamed: 0,rs77948203,rs9610458,unique_samples_id,unique_samples_count
0,AA,CC,"[WTCCCT470057, WTCCCT489315, WTCCCT508408, WTC...",19
1,AA,CT,"[WTCCCT474394, WTCCCT470264, WTCCCT470548, WTC...",34
2,AA,,"[WTCCCT474448, WTCCCT508352]",2
3,AA,TT,"[WTCCCT474560, WTCCCT469955, WTCCCT470219, WTC...",23
4,AG,CC,"[WTCCCT466268, WTCCCT489637, WTCCCT488814, WTC...",360
5,AG,CT,"[WTCCCT473524, WTCCCT473551, WTCCCT489609, WTC...",949
6,AG,,"[WTCCCT489613, WTCCCT497565, WTCCCT468278, WTC...",61
7,AG,TT,"[WTCCCT473522, WTCCCT473497, WTCCCT473514, WTC...",575
8,GG,CC,"[WTCCCT473500, WTCCCT473552, WTCCCT473505, WTC...",2593
9,GG,CT,"[WTCCCT473489, WTCCCT473456, WTCCCT473515, WTC...",6126


Let's add the third SNP, `"rs134490"`, and see how the break down changes

In [None]:
#collapse_output closed
triple_SNPs = get_geno_combination_df(geno_each_sample_df=all_samples_geno_df, 
                       rsid_list= ["rs77948203", "rs9610458", "rs134490"])
triple_SNPs.df

  and should_run_async(code)


Unnamed: 0,rs77948203,rs9610458,rs134490,unique_samples_id,unique_samples_count
0,AA,CC,CT,"[WTCCCT508925, CCC2_MS656176, WTCCCT444162, WT...",4
1,AA,CC,,"[WTCCCT490220, BLOOD293205]",2
2,AA,CC,TT,"[WTCCCT470057, WTCCCT489315, WTCCCT508408, WTC...",13
3,AA,CT,CT,"[WTCCCT466178, WTCCCT468665, WTCCCT471002, WTC...",8
4,AA,CT,,"[WTCCCT474394, WTCCCT470548, WTCCCT443601]",3
5,AA,CT,TT,"[WTCCCT470264, WTCCCT449002, WTCCCT467316, WTC...",23
6,AA,,CT,[WTCCCT474448],1
7,AA,,,[WTCCCT508352],1
8,AA,TT,CC,[BLOOD293241],1
9,AA,TT,CT,"[WTCCCT470219, WTCCCT466993, WTCCCT508309, WTC...",6


We can compute basic information about these 3 SNPs

In [None]:
print("how many samples have at least one low quality (`NA`) genotype?", triple_SNPs.num_samples_NA)
print("how many samples have genotypes of high quality for all 3 SNPs?",triple_SNPs.total_samples_no_NA)

how many samples have at least one low quality (`NA`) genotype? 2460
how many samples have genotypes of high quality for all 3 SNPs? 12487


  and should_run_async(code)


We can query based on genotype of each SNP:

In [None]:
triple_SNPs.query(rs77948203= "AA", rs9610458 = "CT")

Unnamed: 0,rs77948203,rs9610458,rs134490,unique_samples_id,unique_samples_count
3,AA,CT,CT,"[WTCCCT466178, WTCCCT468665, WTCCCT471002, WTC...",8
4,AA,CT,,"[WTCCCT474394, WTCCCT470548, WTCCCT443601]",3
5,AA,CT,TT,"[WTCCCT470264, WTCCCT449002, WTCCCT467316, WTC...",23


---

### The problem of multiple possible input types

The genetic file can be specified in multiple ways:
- Different formats (.gen, .bgen)
- Split into multiple files for different phenotypes (case/control etc)
- Split into one file per chromosome

Also, the input files can be:
- Stored in local disk
- In a compute cluster
- In the cloud

The library allows any combinations of these options. Let's look at the data catalog to see examples of these:

In [None]:
#collapse_output closed

#printing out the catalog
!cat conf/base/catalog_input/genetic_file.yaml

_MS_gen_file: &MS_gen_file
    type: corradin_ovp_utils.datasets.OVPDataset.OVPDataset
    file_format: genetic_file.GenFileFormat
    load_args:
        prob_n_cols: 3
        initial_cols:
            - "dashes"
            - "rsid"
            - "position"
            - "ref"
            - "alt"
        rsid_col: "rsid"
        ref_col: "ref"
        alt_col: "alt"
        pandas_args:
            sep: " "
            header: null
            
        

genetic_file:
    <<: *MS_gen_file
    file_type: OVPDataset.CaseControlFilePathSchema
    file_path:
        case:
            folder: "data/test_data/gen_file"
            full_file_name: "test_CASE_MS_chr22.gen"
        control:
            folder: "data/test_data/gen_file"
            full_file_name: "test_CONTROL_MS_chr22.gen"

            
genetic_file_common_folder:
    <<: *MS_gen_file
    file_type: OVPDataset.CaseControlFilePathSchema
    file_path:
        common_folder: "data/test_data/gen_file"
        case:
            

In [None]:
!cat conf/base/catalog_input/sample_file.yaml

sample_file:
    type: corradin_ovp_utils.datasets.OVPDataset.OVPDataset
    file_type: OVPDataset.CaseControlFilePathSchema
    file_format: sample_file.SampleFileFormat
    load_args:
        sample_id_col: "ID_2"
        cov_cols: ["sex"]
        missing_col: "missing"
        pandas_args:
            sep: " "
            skiprows: [1] #2nd line of file is extra and should be discarded
    file_path:
        common_folder: "data/test_data/sample_file"
        case:
            full_file_name: "MS_impute2_ALL_sample_out.tsv"
        control:
            full_file_name: "ALL_controls_58C_NBS_WTC2_impute2_sample_out.tsv"