## Genotypic recalibration with machine learning: usage example

This notebook illustrates the use of machine learning methods and associated code described in the parent paper.

In [1]:
import pandas as pd

import sys
sys.path.insert(0, '../python')

from preprocessing import VCF, load_suffixes, prepare_input
from recalibrator import Recalibrator

## Training & saving a model

Training is performed on a dataset consisting of VCF files resulting from performing variant calling with GATK on reads from a family trio, along with a 'synthetic abortus' that contains a mixture of the mother's and child's reads. The code to read and process the dataset relies on a specific directory structure. The `VCF` class and the `prepare input` method are all that's needed to read a VCF and convert it into an array that can be input into a model. Once trained, a recalibrator model can be serialized and saved.

In [None]:
trios = ["ajt", "chd", "corpas", "yri"]

# Pre-processing. Uncomment during first run of the script, then
# comment to avoid re-computing

# for trio in trios:
#     data_dir = '../data/' + trio + '/'
#     df = load_suffixes(data_dir)
#     df.to_csv(trio + '.csv', index=False)

Construct training dataset by concatenating rows from all the synthetic abortus trios

In [None]:
df_train = pd.DataFrame()
    
for train in trios:
    df_train = df_train.append(pd.read_csv(train + '.csv'))
    
# df_train = df_train[::10] # Train on subset of input rows

In [None]:
X_train = prepare_input(df_train, target_cols=['justchild^GT'])
y_train = df_train['justchild^GT'].values

r = Recalibrator()
r.train(X_train, y_train)
r.save("model.pickle")


## Loading a model & recalibrating

In [None]:
r = Recalibrator()
r.load("model.pickle")

abortus = VCF("../data/ajt/abortus.frac0.5.seed151_trio.vcf")
abortus.process(0.5)

# Predicted labels
preds_lr = r.predict_lr(abortus.prepare_input())
abortus.save_predictions(preds_lr, filename="recalibrated_lr.vcf", child="abortus")

## Recalibrating with confidence intervals

We use the `VCF` class's inbuilt method to process the VCF and extract the fields required by `confidence_intervals`.

In [2]:
from confidence_intervals import confidence_intervals

abortus = VCF("../data/ajt/abortus.frac0.5.seed151_trio.vcf")
abortus.process(0.5, "mother", "father", "abortus")
preds_ci = confidence_intervals(abortus.df_processed)

abortus.save_predictions(preds_ci, filename="recalibrated_ci.vcf", child="abortus")

For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  df = df.convert_objects(convert_numeric=True)
  lower_bound = contaminations - z*np.sqrt(contaminations*(1 - contaminations)/df_test[ab_name + '^DP'].values)
  upper_bound = contaminations + z*np.sqrt(contaminations*(1 - contaminations)/df_test[ab_name + '^DP'].values)
