### Summarizing Alignment Statistics

The final bam file will have the name (or suffix) ```*star_gene_exon_tagged.bam``` We are going to take a look at that file on the command line to see how all the information about alignment is stored. 

Samtools is the package that allows us to view and manipulate sam/bam files. Let's use ```samtools view``` to look at the bam file. We are going to pipe the results into less -S which will allow us to scroll through the file, displaying one line at a time (no text wrapping). Make sure to insert the name of your bamfile in the command below.

```bash
samtools view bamfile | less -S```

Let's take a look at one entry in the bam file. This read mapped to the coding sequence of the gene Ppp1r14c. 

```bash

SRR1853178.360590975    0       10      3366531 255     60M     *       0       0       CCGCAAGGATCCAGCGTCTAGGCGCGCGGAGCAGGTGCGGGCCACCGTATGCGGCTGTTG    A<AAAFFFFFFF<FFAAF<.FFFFAFFF7.FFFFFFFFFA.FF.FFFFA)FFFFFFF7<F    XC:Z:AGTGGGATAGTC       MD:Z:60 GE:Z:Ppp1r14c   XF:Z:CODING     PG:Z:STAR       RG:Z:A  NH:i:1  NM:i:0  XM:Z:TCTCTTTT   UQ:i:0  AS:i:59 GS:Z:+```

Dropseqtools stores the cell barcode with the XC tag, UMI in the XM tag, and gene name in the GE tag. We are going to use these fields to parse and get information about summary statistics on how our reads mapped to cell barcodes, and duplication rates for UMIs.

## Running qc scripts

**1) Softlink scripts**

```bash
cd ~/jupyter_notebooks/macosko_analysis/
ln -s /oasis/tscc/scratch/cshl_2018/shared_scripts/dropseqtools_qc.py ./
```

**2) Install a few more packages in py3 environment**

```bash
source activate py3_cshl
conda install pysam tqdm 
```

Once the installs have finished, get out of the environment with: 
```bash
source deactivate
source ~/.bashrc
```

## Using Jupyter to run qc scripts

**1) Load Jupyter Notebook**

Follow the instructions [here](URL_later). Make a folder in your home with all your analyses notebooks:

```bash
mkdir ~/jupyter_notebooks/
```

And make another folder specifically for this dataset that we are working with: 

```bash
mkdir -p ~/jupyter_notebooks/macosko_results/
```

Once your notebooks are loaded, open a new notebook in the macosko_results folder using the python3 kernel that comes from our py3_cshl environment

**2) Import required packages**

For this analysis we will only be using the dropseqtools_qc script, so import that and give it a shorthand notation (dq). 

In [None]:
import dropseqtools_qc as dq

**3) Define folders and files**
We will use variables to store the full path of commonly used folders. In the code below, ```results_dir``` is a variable that contains data in the form of a string and now stores the full path information of where the results are located. A string can be anything between quotation marks. You can combine strings with a ```+``` as shown when appending the name of the bam file to the full path. 

In [None]:
results_dir = "/home/ucsd-trainXY/cshl_2018/dropseqtools_results_macosko/downsampled_100M/"
save_dir = "/home/ucsd-trainXY/scratch/projects/macosko_batch1/dropseqtools_qc/"

In [None]:
bamfile = results_dir+"12_my_clean.sorted.bam"

**4) Count the number of reads mapping to each cell barcode**

The dropseqtools_qc script has a function called ```get_cell_barcode_counts```. You can call it with the name of the package followed by a dot and a function name. 

In [None]:
cell_bc_counts = dq.get_cell_barcode_counts(bamfile, save_dir+"cell_barcode_counts.pickle")

**5) Analyze umi and gene counts per cell barcode**

In [None]:
cell_barcodes_to_analyze = dq.get_cell_barcodes_to_analyze(cell_bc_counts)

In [None]:
umi_counts = dq.count_umis_per_barcode(bamfile, cell_barcodes_to_analyze, save_dir+"barcodes_genes_umi")

In [None]:
from matplotlib import pyplot as plt
import seaborn as sns
%matplotlib inline

plt.plot(umi_counts['cumulative'].values)