## Genotypic recalibration with machine learning: usage example

This notebook illustrates the use of machine learning methods and associated code described in the parent paper.

In [1]:
import pandas as pd

from preprocessing import VCF, load_suffixes, prepare_input
from recalibrator import Recalibrator

## Training & saving a model

Training is performed on a dataset consisting of VCF files resulting from performing variant calling with GATK on reads from a family trio, along with a 'synthetic abortus' that contains a mixture of the mother's and child's reads. The code to read and process the dataset relies on a specific directory structure. The `VCF` class and the `prepare input` method are all that's needed to read a VCF and convert it into an array that can be input into a model. Once trained, a recalibrator model can be serialized and saved.

In [2]:
trios = ["ajt", "chd", "corpas", "yri"]

# Pre-processing. Uncomment during first run of the script, then
# comment to avoid re-computing

# for trio in trios:
#     data_dir = '../data/' + trio + '/'
#     df = load_suffixes(data_dir)
#     df.to_csv(trio + '.csv', index=False)

Construct training dataset by concatenating rows from all the synthetic abortus trios

In [3]:
df_train = pd.DataFrame()
    
for train in trios:
    df_train = df_train.append(pd.read_csv(train + '.csv'))
    
df_train = df_train[::10]

In [4]:
X_train = prepare_input(df_train, target_cols=['justchild^GT'])
y_train = df_train['justchild^GT'].values

r = Recalibrator()
r.train(X_train, y_train)
r.save("model.pickle")


Training logistic regression




Training XGB
[0]	validation_0-merror:0.047056
Will train until validation_0-merror hasn't improved in 20 rounds.
[1]	validation_0-merror:0.044297
[2]	validation_0-merror:0.044773
[3]	validation_0-merror:0.044297
[4]	validation_0-merror:0.044741
[5]	validation_0-merror:0.043409
[6]	validation_0-merror:0.043409
[7]	validation_0-merror:0.043346
[8]	validation_0-merror:0.042553
[9]	validation_0-merror:0.042236
[10]	validation_0-merror:0.040904
[11]	validation_0-merror:0.04046
[12]	validation_0-merror:0.039795
[13]	validation_0-merror:0.040175
[14]	validation_0-merror:0.039477
[15]	validation_0-merror:0.039129
[16]	validation_0-merror:0.039002
[17]	validation_0-merror:0.038494
[18]	validation_0-merror:0.038051
[19]	validation_0-merror:0.037702
[20]	validation_0-merror:0.037194
[21]	validation_0-merror:0.037353
[22]	validation_0-merror:0.037131
[23]	validation_0-merror:0.036814
[24]	validation_0-merror:0.036211
[25]	validation_0-merror:0.035768
[26]	validation_0-merror:0.035165
[27]	validati

## Loading a model & recalibrating

In [5]:
r = Recalibrator()
r.load("model.pickle")

abortus = VCF("../data/ajt/abortus.frac0.5.seed151_trio.vcf")
abortus.process(0.5)

# Predicted labels
preds_lr = r.predict_lr(abortus.prepare_input())
abortus.save_predictions(preds_lr, filename="recalibrated_lr.vcf", sample_name="abortus")

For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  df = df.convert_objects(convert_numeric=True)


## Recalibrating with confidence intervals

We use the `VCF` class's inbuilt method to process the VCF and extract the fields required by `confidence_intervals`.

In [6]:
from confidence_intervals import confidence_intervals

abortus = VCF("../data/ajt/abortus.frac0.5.seed151_trio.vcf")
abortus.process(0.5)
preds_ci = confidence_intervals(abortus.df_processed)

abortus.save_predictions(preds_ci, filename="recalibrated_ci.vcf", sample_name="abortus")

For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  df = df.convert_objects(convert_numeric=True)
  lower_bound = contaminations - z*np.sqrt(contaminations*(1 - contaminations)/df_test[sample_name + '^DP'].values)
  upper_bound = contaminations + z*np.sqrt(contaminations*(1 - contaminations)/df_test[sample_name + '^DP'].values)
