# Answers to Exercise 1 (Downloading data)

### Haber et al.
- Find the GEO accession number at the end of the publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6022292/
- Visit: [GSE92332](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE92332)
- Locate the accompanying SRA project link: [SRP095033](https://www.ncbi.nlm.nih.gov/sra?term=SRP095033)
- Search for "SRP095033 Atlas1" in the search bar since there are many experiments associated with this project
- Atlas1 contains two runs, let's download one of them: [SRR6254355](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR6254355)
- 10X raw data is sometimes uploaded as a BAM file. Navigate to the "Download" tab and download: [Atlas1.bam](https://sra-download.ncbi.nlm.nih.gov/traces/sra53/SRZ/006254/SRR6254355/Atlas1.bam)

### Macosko et al.
- Find the GEO accession number at the end of the publication: [Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4481139/)
- Visit: [GSE63473](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63473)
- This GEO ID represents a superseries, which contains the P14 mouse retina 1 subseries: [GSM1626793](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1626793) 
- Locate the accompanying SRA experiment link: [SRX907219](https://www.ncbi.nlm.nih.gov/sra?term=SRX907219)
- This experiment contains one run: [SRR1853178](https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR1853178)
- Use fastq-dump to download the file: "fastq-dump --gzip --split-files SRR1853178"

# Answers to Exercise 2 (Aggregating QC summaries)

In [1]:
import pandas as pd
import glob
import os
import re

### Find and list all files we want to compile
- here we hardcoded the paths, you can also use glob if you'd like.

In [2]:
brain_cells_summary = '/oasis/tscc/scratch/CSHL_single_cell_2019/programming_exercise_material/1k_brain_cells/outs/metrics_summary.csv'
heart_cells_summary = '/oasis/tscc/scratch/CSHL_single_cell_2019/programming_exercise_material/1k_heart_cells/outs/metrics_summary.csv'


all_metrics_summaries = [brain_cells_summary, heart_cells_summary]
all_metrics_summaries

['/oasis/tscc/scratch/CSHL_single_cell_2019/programming_exercise_material/1k_brain_cells/outs/metrics_summary.csv',
 '/oasis/tscc/scratch/CSHL_single_cell_2019/programming_exercise_material/1k_heart_cells/outs/metrics_summary.csv']

### Use read pd.concat() to merge each dataframe into one

In [3]:
merged = pd.DataFrame()
for summary in all_metrics_summaries:
    df = pd.read_csv(summary)
    df.index = [summary]
    merged = pd.concat([merged, df])

merged.sort_index()

Unnamed: 0,Estimated Number of Cells,Mean Reads per Cell,Median Genes per Cell,Number of Reads,Valid Barcodes,Sequencing Saturation,Q30 Bases in Barcode,Q30 Bases in RNA Read,Q30 Bases in Sample Index,Q30 Bases in UMI,Reads Mapped to Genome,Reads Mapped Confidently to Genome,Reads Mapped Confidently to Intergenic Regions,Reads Mapped Confidently to Intronic Regions,Reads Mapped Confidently to Exonic Regions,Reads Mapped Confidently to Transcriptome,Reads Mapped Antisense to Gene,Fraction Reads in Cells,Total Genes Detected,Median UMI Counts per Cell
/oasis/tscc/scratch/CSHL_single_cell_2019/programming_exercise_material/1k_brain_cells/outs/metrics_summary.csv,997,52964,2742,52805264,98.1%,57.3%,97.3%,86.0%,97.3%,97.7%,94.1%,89.2%,3.3%,20.7%,65.1%,62.3%,1.3%,80.1%,16230,8146
/oasis/tscc/scratch/CSHL_single_cell_2019/programming_exercise_material/1k_heart_cells/outs/metrics_summary.csv,712,124821,1469,88872840,96.9%,86.8%,97.3%,93.2%,95.3%,97.3%,92.0%,85.2%,3.3%,12.8%,69.0%,66.4%,0.9%,90.8%,17544,5145


In [4]:
# Save the file
merged.to_csv(
    '~/scratch/my_qc.csv'
)

# Answers to Exercise 3

```bash
module load scanpy
scanpy_notebook
```

In [5]:
import scanpy as sc
matrix_path = '/oasis/tscc/scratch/CSHL_single_cell_2019/programming_exercise_material/1k_heart_cells/outs/filtered_feature_bc_matrix/'

adata = sc.read_10x_mtx(
    matrix_path,                # the directory with the `.mtx` file
    var_names='gene_ids',       # use gene symbols for the variable names (variables-axis index)
    cache=True                  # write a cache file for faster subsequent reading
)

sc.pl.highest_expr_genes(adata, n_top=20)

<matplotlib.figure.Figure at 0x2b9d5a325470>

# Installing Singularity through Vagrant - optional (and may be troublesome depending on your laptop permissions)

1. Install [Vagrant](https://www.vagrantup.com/). This can be done a few ways:
    - Through homebrew (mac): 
        - brew cask install virtualbox
        - brew cask install vagrant
        - brew cask install vagrant-manager
    - Downloading a binary (mac + windows)
        - use the URL 
2. Install Singularity
    - Make a directory and change into the directory (ie. mkdir singularity-vm; cd singularity-vm)
    - Download the "Vagrantfile" into this directory. A Vagrantfile is a set of configurations and instructions for creating a virtual machine.
    
    ```bash 
    vagrant init singularityware/singularity-2.4
    ```
    
    - Make another directory (from this directory, you will be able to transfer files to/from your virtual machine.
    
        ```bash
        mkdir data
        ```
        
    - Ensure the "Vagrantfile" contains the following line (this "mounts" the directory you created inside the virtual machine):
    
        ```bash
        config.vm.synced_folder "data", "/vagrant_data"
        ```
    
    - Download one of the images that we're using on TSCC *inside your mounted folder*:
    
    [scanpy](https://external-collaborator-data.s3-us-west-1.amazonaws.com/public-software/scanpy-1.4.1.img)
    
    - Initialize the virtual machine and login! Your virtual machine should have Singularity installed.
    
        ```bash
        vagrant up
        vagrant ssh
        ```
        
    - You should be able to see your 'vagrant_data' folder with the image inside. Try running Jupyter inside the image with singularity (Use ip 0.0.0.0 since this is the easiest to run with the default Vagrantfile config):
    
        ```bash
        singularity exec scanpy-1.4.1.img jupyter notebook --ip 0.0.0.0
        ```
3. Create a notebook using the 'scanpy' kernel. If this works, voila!