# 00) Overview

The final project will be similar to assignments 3 and 4, however, we will be asking you to **plan and complete** *your own* analysis.

## Requirements:
You will need to decide *and detail* ONE OF:  
a) a novel analysis/technique to try on the Myeloid vs non-Myeloid dataset   
b) a new dataset to re-run the Assignment 3 and/or 4 code on  
c) an alternative analysis 

You do not need both a new dataset + a new method, but you are certainly allowed to do both (as an alternative analysis, c) ). If you would like to submit an alternative analysis, please contact eyes and/or Prof. Lareau (unless you simply intend to do both a) and b) ). 

## Example(s):
We've hinted at batch effects over the course of the semester, but we have not used it thus far. 

One sample project you can do is choose a batch effects correction technique, e.g. Combat (a 2007 technique as seen in the R package here https://rdrr.io/bioc/sva/man/ComBat.html -- for a Python version there is pyComBat among others), then apply it to the Myeloid vs non-Myeloid data (which are different batches, the input batch variable for combat would be a vector like `[1 1 1 1 1 ... 0 0 0 0]` where 1 corresponds to a cell from Myeloid and 0 from non-Myeloid, etc. 

You will then need to compare the results (clustering, differential gene expression, etc) to the data without batch effects and note any differences between the figures, etc. 

Obviously, Combat is a slightly antiquated method, so if you choose to do it, try to look up a more advanced technique and explain what you might expect the benefits of the new method would be and *why* (i.e. is there some nonlinear effect, if so, what kind and where?)

For method ideas: Other examples might include using a sparsity imputation technique (e.g. MAGIC https://www.krishnaswamylab.org/projects/magic) -- feel free to contact eyes or Prof. Lareau for more ideas/suggestions about methods!

For datsaets: Count-based matrices are ideal to look for (ideally not too big, these might be very slow or have engineering issues) -- Tabula Muris (https://tabula-muris.ds.czbiohub.org/) or the Chan-Zuckerberg Human Cell Atlas (https://www.humancellatlas.org/portals/) are good places to start looking :) 


## Deliverables

For full credit, you will need to turn in:

1) Your .ipynb notebook containing your analysis code.   
2) A matching pdf of the notebook with figures partially visible.   
3) a 2-3 page .txt, .doc(x), or PDF writeup of your findings, including but not limited to:   
* why you chose your method and/or dataset of interest. Note: if you are using the Myeloid data provided or the Assignment 3/4 analysis instead of doing **c) an alternative analysis** above, you **must** justify your method/dataset selection with an explanation of why it is interesting or good data.   
* a summary of your dataset (columns, cell count, etc) or method chosen. If there is a paper/pre-print for your dataset/method, please provide a citation and quick summary of the paper.  
* 2-3 key figures with labeled + appropriate axes, titles, and captions (hint: add a textbox in Word (etc) for the caption to make it easier).  
* a results section with 2+ findings and discussion as to what the findings mean / why they are important
* try to note a limitation/possible room for improvements in each part of your response! 



## 0) INSTALLS

1. Using a mounted drive (or local installation) is recommended 
2. The following code will install packages for our analysis this week; installations will take >= 6 minutes and *it's best to not spend an hour of your life re-installing it* :)

See Debug tips if you have installation errors

In [None]:
## mount Google drive (optional, recommended)
## DEBUG TIP ##
## when restarting session, be sure to run first two lines (else you can comment out)
from google.colab import drive
drive.mount('/content/drive')

!mkdir -p /content/drive/MyDrive/bioe_190_290/
!mkdir -p /content/drive/MyDrive/bioe_190_290/pythonpkg/
!mkdir -p /content/drive/MyDrive/bioe_190_290/project/

## set working dir (feel free to change)
%cd /content/drive/MyDrive/bioe_190_290/project/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/MyDrive/bioe_190_290/assignment4


In [None]:
%%capture
## NOTE: %%capture removes the outputs -- if you have an import/install error (etc)
## you should remove the line. Otherwise, let this cell run for ~10 minutes on your 
## first installation (on subsequent runs this should be instantaneous as it skips)

pkg_contents = !ls /content/drive/MyDrive/bioe_190_290/pythonpkg/
do_install = len(pkg_contents) < 10 

## this umap dependency has to be reinstalled each time (it's quick!)
!pip install pynndescent 

if do_install:
  !pip install --target=/content/drive/MyDrive/bioe_190_290/pythonpkg/ scprep phate tasklogger igraph
  !pip install --target=/content/drive/MyDrive/bioe_190_290/pythonpkg/ umap-learn magic-impute louvain

pheno_install = !ls /content/drive/MyDrive/bioe_190_290/pythonpkg | grep pheno
pheno_install = len(pheno_install) < 1

if pheno_install:
  !pip install --target=/content/drive/MyDrive/bioe_190_290/pythonpkg/ git+https://github.com/dpeerlab/phenograph.git

## this is IMPORTANT -- otherwise python won't be able to find 
import sys
if sys.path[0] != '/content/drive/MyDrive/bioe_190_290/pythonpkg/':
  sys.path.insert(0, '/content/drive/MyDrive/bioe_190_290/pythonpkg/')

In [None]:
import pandas as pd
import numpy as np
import scprep, phate, umap ## sometimes this takes 2-3 minutes in a new session  

import sklearn
import sklearn.cluster
import sklearn.manifold
import graphtools as gt
import magic
import phenograph
import louvain
import matplotlib.pyplot as plt

%matplotlib inline 

  # This is added back by InteractiveShellApp.init_path()


In [None]:
## TODO -- if you need to install a pip package and would like it to be saved 
## in Google drive, use the following code:

## for github, bash, or apt-get packages this should also work
## conda would require modifications

# this checks if my_install is present in drive (to prevent re-installation )
my_install = !ls /content/drive/MyDrive/bioe_190_290/pythonpkg | grep ..my dir..
my_install = len(my_install) < 1 

if my_install:
  !pip install --target=/content/drive/MyDrive/bioe_190_290/pythonpkg/ git+https://github.com/dpeerlab/phenograph.git

import sys
if sys.path[0] != '/content/drive/MyDrive/bioe_190_290/pythonpkg/':
  sys.path.insert(0, '/content/drive/MyDrive/bioe_190_290/pythonpkg/')

import ..package name..

<a id='loading'></a>
## 1. OPTIONAL: Loading preprocessed data

In [None]:
%%bash 

## this will copy your preprocessed data from the Assignment 3 folder
## if you are uploading ther bCourses version (etc) this should not have an error 
## (unless you are somehow in the wrong directory)
if !(test -f "data.pickle.gz"); then
    cp ../assignment4/data.pickle.gz .
fi

if !(test -f "metadata.pickle.gz"); then
    cp ../assignment4/metadata.pickle.gz .
fi

In [None]:
## load saved data (mount drive if not found or upload from bCourses)
data = pd.read_pickle('data.pickle.gz')
metadata = pd.read_pickle('metadata.pickle.gz')

## we will rerun PCA so that you can visualize the clustering process
data_pca = scprep.reduce.pca(data, n_components=50, method='dense')
data_pca.head()

In [None]:
## TODO your code here