# Demux and Alignment QC
This notebook is for QC-ing the demux and alignment of sequencing runs. It is layouted out run through a list of 'run_ids' and find the fastqs, alignment files, and dropouts.

# Imports

In [259]:
import csv
import os
import utilities.s3_util as s3_util
import itertools
import pandas as pd

from collections import defaultdict, Counter


# Count Files in S3 by Sequencing Run

Make a list of all experiment IDS in a given bucket. 

In [100]:
run_ids1 = ['180705_A00111_0170_BH5LTFDSXX',
'180719_A00111_0174_AH725LDSXX',
'180727_A00111_0178_AH72YWDSXX',
'180813_A00111_0188_AH7G2FDSXX',
'180822_A00111_0194_AH7HCMDSXX',
'180822_A00111_0195_BH7JT3DSXX',
'180831_A00111_0201_BH7WGCDSXX',
'180911_A00111_0207_BHC7KNDSXX',
'180911_A00111_0208_AH7VGKDSXX',
'180918_A00111_0213_BHGKTWDMXX']

## QC Runs in `run1`

In [121]:
runs1 = pd.DataFrame(index=run_ids1, columns = ['nCells_Demux', 'nFastqs/cell' , 'nAligment_Files', 'nCells_Aligned', 'Alignment_Dropout', 'nIncomplete_Alignment'])

In [120]:
runs1.head(2)

Unnamed: 0,nCells_Demux,nFastqs/cell,nAligment_Files,nCells_Aligned,Alignment_Dropout
180705_A00111_0170_BH5LTFDSXX,6321,{2},31525,6305,16
180719_A00111_0174_AH725LDSXX,7030,{2},35055,7011,19


Crawl through AWS and count all the files to assess if any alignment seriously failed. 

Get the samples from the folder of fastq files. The second `for loop` is getting all of the filenames from `bucket` that start with the specified prefix (i.e. the fastqs for this experiment). The first line is taking those filenames, removing the folders (using `basename`), and then splitting from the right on '_' to get the sample name.
  
We'll use a Counter to a) remove duplicates and b) make sure we have two of everything. This same strategy counts alignment files and find cells with less than the expected 5 alignment files.

In [106]:
for run in run_ids1:
    #Count the number of cells that demuxed in S3 using their basename
    samples = Counter(os.path.basename(fn).rsplit('_', 2)[0] 
                  for fn in s3_util.get_files(
                      bucket='czb-maca',
                      prefix=f'Plate_seq/24_month/{run}/fastqs'))
    
    #Append that number in the df at the location of run_id
    runs1.loc[run,'nCells_Demux'] = len(samples) 
    runs1.loc[run,'nFastqs/cell'] = set(samples.values()) 
    
    #Count the number of alignment files in S3 using 
    alignment_files = [os.path.basename(fn)
                   for fn in s3_util.get_files(
                       bucket='czb-maca',
                       prefix=f'Plate_seq/24_month/{run}/results_new_aegea')]
    
    #Append append the number of aligned files, aligned cells, and the dropout in the df at the location of run_id
    runs1.loc[run,'nAligment_Files'] = len(alignment_files) 
    runs1.loc[run,'nCells_Aligned'] = len(alignment_files)/5
    runs1.loc[run,'Alignment_Dropout'] = len(samples) - len(alignment_files)/5
    
    #Count if there are any cells with less than 5 alignment files (incomplete alignment)
    sample_count = Counter(fn.split('.', 2)[0] for fn in alignment_files)
    inc = [s for s in sample_count if sample_count[s] < 5]
    runs1.loc[run,'Alignment_Dropout'] = len(inc) 

In [107]:
runs1

Unnamed: 0,nCells_Demux,nFastqs/cell,nAligment_Files,nCells_Aligned,Alignment_Dropout
180705_A00111_0170_BH5LTFDSXX,6321,{2},31525,6305,16
180719_A00111_0174_AH725LDSXX,7030,{2},35055,7011,19
180727_A00111_0178_AH72YWDSXX,7024,{2},35010,7002,22
180813_A00111_0188_AH7G2FDSXX,6993,{2},33620,6724,269
180822_A00111_0194_AH7HCMDSXX,7008,{2},34960,6992,16
180822_A00111_0195_BH7JT3DSXX,7020,{2},34515,6903,117
180831_A00111_0201_BH7WGCDSXX,7677,{2},34165,6833,844
180911_A00111_0207_BHC7KNDSXX,6321,{2},30545,6109,212
180911_A00111_0208_AH7VGKDSXX,348,{2},1740,348,0
180918_A00111_0213_BHGKTWDMXX,1774,{2},8835,1767,7
