## Code visibility
Use the Show/Hide Code button on the top left to make to make the code visible or hide it. It will be hidden in the HTML files by default.

In [1]:
# RUN
import sys
sys.path.append("/opt/src")
import mip_functions as mip
import pickle
import json
import copy
import os
import numpy as np
import subprocess
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import pandas as pd
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from wrangler_stats import get_stats
wdir = "/opt/analysis/"
data_dir = "/opt/data/"

Classes reloading.
functions reloading


After each MIPWrangler run, 3 statistics files are generated. These are useful to understand how a sequencing run or processing performed. These files will be located in the wrangler directory. Their names may be slightly different depending on the miptools version (names may contain date information, may or may not be zipped etc.)

For the test analysis run, I used the path binding wrangler_dir to /opt/data when I started this notebook, so the file paths below reflect that. 

## Numbers for read extraction from fastq files

In [36]:
extraction_summary_file = "/opt/data/extractInfoSummary.txt.gz"
extraction = get_stats(extraction_summary_file)

In [37]:
extraction.head()

Unnamed: 0,Sample,total,unmatched,indeterminate,smallFragment,totalMatched,goodReads,failedLigationArm,failedMinLen(<30),failed_q30<0.75,containsNs,badStitch
0,D10-JJJ-47,7049,2965,0,0,4084,83,4,0,1,0,3996
2,D10-JJJ-48,6820,3162,0,0,3658,120,3,0,1,0,3534
4,D10-JJJ-20,1,1,0,0,0,0,0,0,0,0,0
6,D10-JJJ-11,216,130,0,0,86,1,0,0,2,0,83
8,D10-JJJ-37,26897,1720,0,0,25177,21407,393,0,946,6,2425


Explanation of important field names for the extractInfoSummary file (all numbers show number of reads for that sample:
*  total: number of total reads for the sample
*  totalMatched: reads that had a proper extension arm sequence
*  failedLigationArm: reads that did not have the matching ligation arm sequence
*  badStitch: read pairs that did not stitch properly
*  goodReads: reads used downstream

Get total numbers for each field

In [38]:
extraction.sum(numeric_only=True).sort_values(ascending=False)

total                643615
totalMatched         544671
goodReads            436599
unmatched             98944
badStitch             85658
failed_q30<0.75       15235
failedLigationArm      7039
containsNs              140
failedMinLen(<30)         0
smallFragment             0
indeterminate             0
dtype: int64

Same statistic in fraction of total:

In [41]:
extraction.sum(numeric_only=True).div(extraction.sum(numeric_only=True)["total"], axis=0).sort_values(
    ascending=False)

total                1.000000
totalMatched         0.846268
goodReads            0.678354
unmatched            0.153732
badStitch            0.133089
failed_q30<0.75      0.023671
failedLigationArm    0.010937
containsNs           0.000218
failedMinLen(<30)    0.000000
smallFragment        0.000000
indeterminate        0.000000
dtype: float64

67.8% of the reads will be used in clustering

## Numbers for forward and reverse read stitching

In [47]:
# Load the stitching info file
stitch_file = "/opt/data/stitchInfoByTarget.txt.gz"
sti = get_stats(stitch_file)
sti.head()

Unnamed: 0,Sample,mipTarget,mipFamily,total,r1EndsInR2,r1BeginsInR2,OverlapFail,PerfectOverlap
0,D10-JJJ-47,PF3D7-1322700_S0_Sub0_mip0_ref,PF3D7-1322700_S0_Sub0_mip0,12,0,8,4,0
1,D10-JJJ-47,PF3D7-1451200_S0_Sub0_mip14_ref,PF3D7-1451200_S0_Sub0_mip14,6,0,6,0,0
2,D10-JJJ-47,ama1_S0_Sub0_mip0_ref,ama1_S0_Sub0_mip0,93,4,89,0,0
3,D10-JJJ-47,ama1_S0_Sub0_mip1_ref,ama1_S0_Sub0_mip1,3,0,3,0,0
4,D10-JJJ-47,ama1_S0_Sub0_mip2_ref,ama1_S0_Sub0_mip2,112,0,111,1,0


The stitching info file has one line per mip per sample and 5 data columns:
  * **total**: Total number of reads for that sample/mip combination.
  * **r1EndsInR2**: Number of reads that properly stitched.
  * **r2BeginsInR2**: Indicates primer/adapter dimers or small junk sequence.
  * **OverlapFail**: No high quality overlap was found. This could mean the sequences were low quality, or there was not enough overlap, for example if the captured region is 500 bp but we sequenced 150 bp paired end sequencing. Another example is when there is a big enough insertion in the captured region, the reads do not overlap.
  *  **PerfectOverlap**: Unlikely scenario that two reads perfectly overlap.

Only the **r1EndsInR2** and **PerfectOverlap** reads are used for the rest of the pipeline.

Let's look at the total number of each category.

In [30]:
sti.sum(numeric_only=True).sort_values(ascending=False)

total             544671
r1EndsInR2        459013
r1BeginsInR2       44323
OverlapFail        41334
PerfectOverlap         1
dtype: int64

Out of 544671 reads, 459013 stiched fine.  
Let's look at them in terms of fraction of total.

In [32]:
sti.sum(numeric_only=True).div(sti.sum(numeric_only=True)["total"], axis=0).sort_values(ascending=False)

total             1.000000
r1EndsInR2        0.842734
r1BeginsInR2      0.081376
OverlapFail       0.075888
PerfectOverlap    0.000002
dtype: float64

84% of the reads are fine and will be used for the next steps in the pipeline.  

We can also look at the stats per MIP to have an idea which MIPs are performing good or bad. If certain MIPs have increased failure, it could warrant some attention.

In [33]:
sti.groupby("mipTarget").sum(numeric_only=True).sort_values("r1EndsInR2", ascending=False)

Unnamed: 0_level_0,total,r1EndsInR2,r1BeginsInR2,OverlapFail,PerfectOverlap
mipTarget,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
crt_S0_Sub0_mip1_ref,49519,47344,789,1386,0
crt_S0_Sub0_mip0_ref,33161,30921,1448,792,0
win42_S0_Sub0_mip1_ref,30226,28811,128,1287,0
dhfr-ts_S0_Sub0_mip0_ref,22783,21366,602,815,0
cytb_S0_Sub0_mip3,21872,20418,29,1425,0
atp6_S0_Sub0_mip12,18973,17938,140,895,0
crt_S0_Sub2_mip6_ref,46099,17387,23021,5691,0
csp_S0_Sub0_mip1,16531,15166,510,854,1
atp6_S0_Sub0_mip10,16342,15030,270,1042,0
win24_S0_Sub0_mip0_ref,16116,14606,898,612,0


In [35]:
sti.groupby("mipTarget").sum(numeric_only=True).div(sti.groupby("mipTarget").sum(numeric_only=True)["total"],
                                                   axis=0).sort_values("r1EndsInR2", ascending=False)

Unnamed: 0_level_0,total,r1EndsInR2,r1BeginsInR2,OverlapFail,PerfectOverlap
mipTarget,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
win54_S0_Sub0_mip0_ref,1.0,0.961287,0.011436,0.027277,0.0
k13_S0_Sub0_mip1_ref,1.0,0.959052,0.009755,0.031193,0.0
cytb_S0_Sub0_mip2,1.0,0.957862,0.00244,0.039698,0.0
crt_S0_Sub0_mip1_ref,1.0,0.956077,0.015933,0.027989,0.0
k13_S0_Sub0_mip4_ref,1.0,0.955904,0.009064,0.035032,0.0
k13_S0_Sub0_mip5_ref,1.0,0.954848,0.010882,0.03427,0.0
mdr1_S0_Sub0_mip12,1.0,0.953522,0.012606,0.033872,0.0
win42_S0_Sub0_mip1_ref,1.0,0.953186,0.004235,0.042579,0.0
mdr1_S0_Sub0_mip1,1.0,0.951072,0.018225,0.030702,0.0
ama1_S0_Sub0_mip3_ref,1.0,0.948352,0.014086,0.037563,0.0


## Extraction statistics per sample per probe

In [43]:
extraction_by_target = "/opt/data/extractInfoByTarget.txt.gz"
ext_by_target = get_stats(extraction_by_target)
ext_by_target.head()

Unnamed: 0,Sample,mipTarget,mipFamily,totalMatched,goodReads,failedLigationArm,failedMinLen(<30),failed_q30<0.75,containsNs,badStitch
0,D10-JJJ-47,PF3D7-1322700_S0_Sub0_mip0_ref,PF3D7-1322700_S0_Sub0_mip0,12,0,0,0,0,0,12
1,D10-JJJ-47,PF3D7-1451200_S0_Sub0_mip14_ref,PF3D7-1451200_S0_Sub0_mip14,6,0,0,0,0,0,6
2,D10-JJJ-47,ama1_S0_Sub0_mip0_ref,ama1_S0_Sub0_mip0,93,4,0,0,0,0,89
3,D10-JJJ-47,ama1_S0_Sub0_mip1_ref,ama1_S0_Sub0_mip1,3,0,0,0,0,0,3
4,D10-JJJ-47,ama1_S0_Sub0_mip2_ref,ama1_S0_Sub0_mip2,112,0,0,0,0,0,112


In [46]:
ext_by_target.groupby("mipTarget").sum(numeric_only=True).div(
    ext_by_target.groupby("mipTarget").sum(
        numeric_only=True)["totalMatched"], axis=0).sort_values("goodReads", ascending=False)

Unnamed: 0_level_0,totalMatched,goodReads,failedLigationArm,failedMinLen(<30),failed_q30<0.75,containsNs,badStitch
mipTarget,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
crt_S0_Sub0_mip1_ref,1.0,0.944708,0.010743,0.0,0.000202,0.000424,0.043923
mdr1_S0_Sub0_mip12,1.0,0.923691,0.010585,0.0,0.018861,0.000385,0.046478
k13_S0_Sub0_mip4_ref,1.0,0.921852,0.014454,0.0,0.019598,0.0,0.044096
win54_S0_Sub0_mip0_ref,1.0,0.919165,0.012928,0.0,0.029123,7.1e-05,0.038713
cytb_S0_Sub0_mip2,1.0,0.916057,0.012087,0.0,0.029718,0.0,0.042138
crt_S0_Sub0_mip0_ref,1.0,0.915503,0.016345,0.0,0.000271,0.000332,0.067549
atp6_S0_Sub0_mip12,1.0,0.914247,0.011121,0.0,0.019818,0.000264,0.054551
dhfr-ts_S0_Sub0_mip2,1.0,0.906518,0.01215,0.0,0.02773,0.000143,0.053459
win42_S0_Sub0_mip1_ref,1.0,0.906372,0.013399,0.0,0.03315,0.000265,0.046814
cytb_S0_Sub0_mip3,1.0,0.8969,0.019706,0.0,0.016551,0.000366,0.066478
