## COI estimation with RealMcCOIL
This notebook shows an example of how [RealMcCOIL](https://github.com/EPPIcenter/THEREALMcCOIL) (via [McCoilR package](https://github.com/OJWatson/McCOILR)) can be used to estimate complexity of infection from MIP data. Please see the linked repositories for the corresponding packages for more information and options to use for the programs.

In [1]:
# import modules to use
import sys
sys.path.append("/opt/src")
import numpy as np
import pandas as pd

We are going to need a genotype table. Use the **processing-and-filtering-variant-calls.ipynb** template to create a genotype table from your variant calls. It is important to create a high confidence genotype set, so you should make sure to use filtering steps described in that notebook.

Load the genotype table from file. The genotypes called using the standard pipeline will have 0 for homozygous reference, 1 for mixed, 2 for homozygous alternate allele and N/A where genotypes are missing.

In [2]:
genotype_file = "genotypes.csv"
genotypes = pd.read_csv(genotype_file, header=list(range(6)),
                               index_col=0)
genotypes.head()

CHROM,chr1,chr1,chr1,chr1,chr1,chr1,chr1,chr1,chr1,chr1,...,chr9,chr9,chr9,chr9,chr9,chr9,chr9,chr9,chr9,chr9
POS,139191,139192,140820,155939,155977,155978,308232,308385,349247,349253,...,817567,831911,850753,850754,850756,917264,95529,95530,963651,963670
REF,C,C,A,G,G,C,C,G,C,A,...,T,C,C,A,A,A,C,T,C,G
ALT,T,T,C,A,A,T,T,C,T,G,...,C,T,A,G,C,G,A,A,T,C
Mutation Name,chr1:139191:.:C:T,chr1:139192:C:T,chr1:140820:.:A:C,chr1:155939:.:G:A,chr1:155977:G:A,chr1:155978:.:C:T,chr1:308232:.:C:T,chr1:308385:.:G:C,chr1:349247:C:T,chr1:349253:.:A:G,...,chr9:817567:.:T:C,chr9:831911:.:C:T,chr9:850753:C:A,chr9:850754:A:G,chr9:850756:A:C,chr9:917264:.:A:G,chr9:95529:C:A,chr9:95530:.:T:A,chr9:963651:.:C:T,chr9:963670:G:C
Targeted,Yes,No,Yes,Yes,No,Yes,Yes,Yes,No,Yes,...,Yes,Yes,No,No,No,Yes,No,Yes,Yes,No
D10-DRC-17,1.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0
D10-DRC-11,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,,
D10-JJJ-62,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
D10-JJJ-41,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
D10-JJJ-56,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


RealMcCoil uses 0, 0.5, 1 and -1 for homozygous reference, mixed, homozygous alternate and missing genotypes, respectively. Below we'll divide the values of our table by 2 and fill the missing values with -1.

In [3]:
mccoil_genotypes = (genotypes/2).fillna(-1)

Import rpy2.ipython module to run some **R** code within this notebook.

In [4]:
%load_ext rpy2.ipython

Below cell starts with %%R, meaning the entire cell content is interpreted in R language. We can provide the genotype table, which is in python format, to the R code using -i in the first line. Similarly, with -o, we can get the output of the McCoil R object in python form to use downstream.  

RealMcCOIL also creates the specified file (output_15.txt, in this case) but we will not use it here. The cell below is only an example. Use the parameters appropriate for your data but these settings should be suitable for most MIP data with enough iterations (2000) and starting and maximum COI (15 and 25, respectively).

In [5]:
%%R -i mccoil_genotypes -o out_cat
library(McCOILR)
library(Rcpp)
starting_coi <- 15
out_cat <- McCOIL_categorical(mccoil_genotypes,maxCOI=25, totalrun=2000, burnin=500, 
                            M0=starting_coi, err_method=3, 
                            path="mcoil_output", 
                            output=paste0("output_", starting_coi,
                                          ".txt" ))

Time = 49.00 s


Check the output

In [28]:
out_cat

Unnamed: 0,file,CorP,name,mean,median,sd,quantile0.025,quantile0.975
1,output_15.txt,C,D10-DRC-17,5.000000,5.000000,0.13302,5.000000,5.000000
2,output_15.txt,C,D10-DRC-11,6.000000,6.000000,0.30799,5.000000,6.000000
3,output_15.txt,C,D10-JJJ-62,21.000000,21.000000,1.24320,19.000000,24.000000
4,output_15.txt,C,D10-JJJ-41,18.000000,18.000000,1.02714,16.000000,20.000000
5,output_15.txt,C,D10-JJJ-56,7.000000,7.000000,0.31429,7.000000,8.000000
...,...,...,...,...,...,...,...,...
1746,output_15.txt,P,"('chr9', '95530', 'T', 'A', 'chr9:95530:.:T:A'...",0.506195,0.503596,0.09898,0.320321,0.699479
1747,output_15.txt,P,"('chr9', '963651', 'C', 'T', 'chr9:963651:.:C:...",0.242574,0.238454,0.05417,0.157902,0.366651
1748,output_15.txt,P,"('chr9', '963670', 'G', 'C', 'chr9:963670:G:C'...",0.028849,0.027717,0.00827,0.016935,0.048962
1749,output_15.txt,e1,e1,0.003506,0.003452,0.00180,0.000314,0.007023


COI calls can be selected using the CorP column (it is C for the COI calls). The rest of the table describes statistics for the variants etc.  

The **mean** and **median** columns show the mean and median COI values for the sample across the iterations.

In [6]:
coi_calls = out_cat.loc[out_cat["CorP"] == "C"]
coi_calls.shape

(41, 8)

In [7]:
coi_calls

Unnamed: 0,file,CorP,name,mean,median,sd,quantile0.025,quantile0.975
1,output_15.txt,C,D10-DRC-17,5.0,5.0,0.0447,5.0,5.0
2,output_15.txt,C,D10-DRC-11,5.0,5.0,0.46021,5.0,6.0
3,output_15.txt,C,D10-JJJ-62,24.0,24.0,1.08739,21.0,25.0
4,output_15.txt,C,D10-JJJ-41,21.0,20.0,1.39506,18.0,23.0
5,output_15.txt,C,D10-JJJ-56,7.0,7.0,0.14806,7.0,7.0
6,output_15.txt,C,D10-JJJ-49,20.0,20.0,1.29324,18.0,23.0
7,output_15.txt,C,D10-DRC-9,6.0,6.0,0.08913,6.0,6.0
8,output_15.txt,C,D10-JJJ-34,19.0,19.0,1.28601,17.0,22.0
9,output_15.txt,C,D10-JJJ-40,21.0,21.0,1.39645,19.0,24.0
10,output_15.txt,C,D10-JJJ-36,22.0,22.0,1.26082,20.0,24.0


Save the COI values to file

In [8]:
coi_calls.to_csv("COI_calls.csv")