# How to be a Bioinformagician part 1
- Author: Olga Botvinnik
- Date: 2018-10-09

Abstract: Many computational problems have already been solved and yet hundreds of hours are lost to re-solving them. This series provides tips and tricks that solve common pain points in bioinformatics, using AWS, reading/writing CSVs, extracting data out of file names, and lots more!

## Prerequisities
1. Everything you do should be full screen. Use [divvy](http://mizage.com/divvy/) to make managing windows easier - ask IT for a license
1. [anaconda python](https://www.anaconda.com/download/#macos) - Install the 3.x (e.g. 3.7) version if you don't have it already
1. aws account - Ask Olga or James for one
1. awscli - on the command line, `pip install awscli` after you install Anaconda Python
1. aegea - on the command line, `conda install aegea` after you install Anaconda Python
1. nbconda - on the command line, `conda install nbconda` after you install Anaconda Python
1. reflow. Click this link: https://github.com/grailbio/reflow/releases/download/reflow0.6.8/reflow0.6.8.darwin.amd64 Then on the command line:
    ```
    cd ~/Downloads
    chmod ugo+x reflow0.6.8.darwin.amd64
    sudo mv reflow0.6.8.darwin.amd64 /usr/local/bin/reflow
    ```
    Now the command `reflow` should output a lot of stuff:
    
    ```
      reflow
    The reflow command helps users run Reflow programs, inspect their
    outputs, and query their statuses.

    The command comprises a set of subcommands; the list of supported
    commands can be obtained by running

        reflow -help

    ... (more stuff) ...
    ```
    Then configure reflow, following the [Confluence entry on Reflow (https://czbiohub.atlassian.net/wiki/spaces/DS/pages/838205454/reflow) instructions for configuration:
    ```
    AWS_SDK_LOAD_CONFIG=1 reflow setup-ec2
    AWS_SDK_LOAD_CONFIG=1 reflow setup-s3-repository czbiohub-reflow-quickstart-cache
    AWS_SDK_LOAD_CONFIG=1 reflow setup-dynamodb-assoc czbiohub-reflow-quickstart
    export AWS_REGION=us-west-2
    ```
1. Claim a folder within `s3://czbiohub-cupcakes/` with today's date and your username, e.g.:
    ```
    s3://czbiohub-cupcakes/2018-10-09/olgabot/
    ```
1. [GitHub](http://github.com) username and membership to [@czbiohub](https://github.com/czbiohub/) GitHub group.

Highly recommended:
- If you haven't seen it already, follow https://github.com/czbiohub/codonboarding
- Especially 
    - install homebrew - it makes your life better for installing packages on mac
    ```
    /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
    ```
    - install Oh My ZSH: https://ohmyz.sh/
    - Install exa: https://the.exa.website/ - MUCH easier to install if you installed homebrew.
    ```
    brew install exa
    ```


## Installation

### 1. Clone the cupcakes repo to `~/code` if you haven't already


```
mkdir ~/code
cd ~/code
git clone https://github.com/czbiohub/cupcakes
```

If you've already cloned it, change to the `master` branch and get the latest code with `git pull`:

```
cd ~/code/cupcakes/
git checkout master
git pull origin master
```


### 2. Create a "bioinformagician" environment

```
conda env create --name bioinformagician --file ~/code/cupcakes/2018/olgas_bioinformagician_tricks/environment.yml
```

Then activate the environment

```
source activate bioinformagician

```

## Running the notebook


`cd` to your `~/code` directory in the terminal, then type `jupyter notebook`:

```
cd ~/code
jupyter notebook
```

In the file browser, navigate to `cupcakes/2018/olgas_bioinformagician_tricks` and open `001_how_to_be_a_bioinformagician_part01.ipynb`


In the notebook, make sure it is using the kernel named "Python [conda env: bioinformagician]"

## Now we are ready to read the data!

Run each cell below by pressing Shift+Enter


1. Helpful jupter notebook keystrokes: 
    - Ctrl-M-A add cell above
    - Ctrl-M-B add cell below
    - Ctrl-M-d d delete cell
    - Ctrl-M-i interrupt
    - Ctrl-M-0 restart

The ones above are the shortcuts I use the most. Go to Help > Keyboard Shortcuts to see them all.

In [1]:
# Standard convention is to import python standard libraries first, then third-party libraries after that
# See list of standard libraries here: https://docs.python.org/3/library/
# Both import lists should be alphabetically sorted

# --- Python standard library --- #
# Easily grab filenames from a folder
import glob

# Amazing library that I use almost every day for:
# - chaining lists together into mega-lists
# - "multiplying" lists against each other to get the full product of combinations
import itertools

# Read/write javascript object notation (JSON) files
import json

# Perform path manipulations
import os

# --- Third-party (non-standard Python) libraries --- #
# python dataframes. very similar to R dataframes
import pandas as pd

# Make the number of characters allowed per column super big since our filenames are long
pd.options.display.max_colwidth = 500

### Read csv of lung cancer fastqs for which a kmer signature was calculated

In [2]:
compute_samples = pd.read_csv('lung-cancer/compute/samples.csv')
print(compute_samples.shape)
compute_samples.head()

(5054, 10)


Unnamed: 0,id,read1,read2,name,output,trim_low_abundance_kmers,dna,protein,ksizes,scaled
0,A10_B000419_S34,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B000419_S34/A10_B000419_S34_R1_001.fastq.gz,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B000419_S34/A10_B000419_S34_R2_001.fastq.gz,A10_B000419_S34,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,True,True,21273351,1000
1,A10_B000420_S82,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B000420_S82/A10_B000420_S82_R1_001.fastq.gz,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B000420_S82/A10_B000420_S82_R2_001.fastq.gz,A10_B000420_S82,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000420_S82.signature,True,True,True,21273351,1000
2,A10_B002073_S166,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B002073_S166/A10_B002073_S166_R1_001.fastq.gz,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B002073_S166/A10_B002073_S166_R2_001.fastq.gz,A10_B002073_S166,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B002073_S166.signature,True,True,True,21273351,1000
3,A10_B002078_S202,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B002078_S202/A10_B002078_S202_R1_001.fastq.gz,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B002078_S202/A10_B002078_S202_R2_001.fastq.gz,A10_B002078_S202,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B002078_S202.signature,True,True,True,21273351,1000
4,A10_B002095_S118,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B002095_S118/A10_B002095_S118_R1_001.fastq.gz,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B002095_S118/A10_B002095_S118_R2_001.fastq.gz,A10_B002095_S118,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B002095_S118.signature,True,True,True,21273351,1000


## Read the documentation of the `sourmash_search.rf` file to see what we need

In [3]:
! reflow doc reflow/sourmash_search.rf

Parameters

val signature string (required)
    S3 path to single signature file e.g.
    s3://olgabot-maca/facs/sourmash_compute_all/A1-B000610-3_56_F-1-1.sig
val database string (required)
    S3 full path to the sourmash database folder containing the database folder e.g.
    s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/
    Note: this folder contains tabula-muris-k21-protein.sbt.json and a bunch of
    hidden files
val database_name string (required)
    Name of the database e.g.: tabula-muris-k21-protein
val output string (required)
    CSV file to write with search results e.g
    s3://olgabot-maca/facs/sourmash_search/A1-B000610-3_56_F-1-1_tabula-muris-k21-protein.csv
val ksize int = 21
    Size of kmer to use (can only use one for index)
val sequence_to_compare string = "dna"
    What to compare, could be either "protein" or "dna"
val ignore_abundance bool = false
    Whether or not to include the abundance of k

We will be creating a *Reflow batch job* which will submit 100s of lung cancer signatures to be looked up in the tabula muris database.


For every argument, we'll need to create a column in a CSV that has that title, e.g. here we need the columns:

- signature
- database
- database_name
- output
- ksize
- sequence_to_compare
- ignore_abundance

## look at aws folder of tabula muris kmer signature databases and save aws output to file

In [4]:
prefix = 's3://olgabot-maca/facs/sourmash_index_all'
txt = 'sourmash_databases.txt'

! aws s3 ls $prefix/ > $txt
! cat $txt

                           PRE tabula-muris-k21-dna/
                           PRE tabula-muris-k21-protein/
                           PRE tabula-muris-k27-dna/
                           PRE tabula-muris-k27-protein/
                           PRE tabula-muris-k33-dna/
                           PRE tabula-muris-k33-protein/
                           PRE tabula-muris-k51-dna/
                           PRE tabula-muris-k51-protein/


Show that we now have a file called `sourmash_databases.txt`

In [5]:
ls -lha

total 12168
drwxr-xr-x  12 olgabot  staff   408B Oct  8 17:19 [1m[36m.[m[m/
drwxr-xr-x  13 olgabot  staff   442B Oct  8 13:59 [1m[36m..[m[m/
drwxr-xr-x   4 olgabot  staff   136B Oct  8 14:12 [1m[36m.ipynb_checkpoints[m[m/
-rw-r--r--   1 olgabot  staff    30K Oct  8 17:19 001_how_to_be_a_bioinformagician_part01.ipynb
-rw-r--r--   1 olgabot  staff   5.9M Oct  8 14:12 002_how_to_be_a_bioinformagician_part02.ipynb
-rw-r--r--   1 olgabot  staff   5.5K Oct  5 17:31 all_my_tricks_prep_notes.ipynb
-rw-r--r--   1 olgabot  staff   6.1K Oct  8 13:35 environment.yml
-rw-r--r--   1 olgabot  staff   2.9K Oct  8 13:35 environment_no_versions.yml
-rw-r--r--   1 olgabot  staff   2.9K Oct  8 14:08 environment_no_versions_no_ng.yml
drwxr-xr-x   4 olgabot  staff   136B Oct  8 13:13 [1m[36mlung-cancer[m[m/
drwxr-xr-x   3 olgabot  staff   102B Oct  8 13:59 [1m[36mreflow[m[m/
-rw-r--r--   1 olgabot  staff   440B Oct  8 17:19 sourmash_databases.txt


Running each line one-by-one is left as an exercise to the reader :)

- To uncomment each cell, put your cursor in it, then:
    1. select-all with Command-A 
    2. uncomment with Command-/

In [6]:
# databases = pd.read_table(txt, delim_whitespace=True, header=None, names=['is_prefix', 'database_name'])
# databases

In [7]:
# databases['database_name'] = databases['database_name'].str.strip('/')
# databases

In [8]:
# databases = databases.drop('is_prefix', axis=1)
# databases

In [9]:
# databases['ksize'] = databases['database_name'].str.extract('k(\d+)').astype(int)
# databases

In [10]:
# databases['sequence_to_compare'] = databases['database_name'].map(lambda x: x.split('-')[-1])
# databases

In [11]:
# databases['database'] = databases['database_name'].map(lambda x: f"{prefix}/{x}/{x}/")
# databases

In [12]:
# databases = databases.set_index('database_name')
# databases

### Load the table with pandas

In [13]:
databases = pd.read_table(txt, delim_whitespace=True, header=None, names=['is_prefix', 'database_name'])
databases['database_name'] = databases['database_name'].str.strip('/')
databases = databases.drop('is_prefix', axis=1)
databases['ksize'] = databases['database_name'].str.extract('k(\d+)').astype(int)
databases['sequence_to_compare'] = databases['database_name'].map(lambda x: x.split('-')[-1])
databases['database'] = databases['database_name'].map(lambda x: f"{prefix}/{x}/{x}/")
databases = databases.set_index('database_name')
databases

Unnamed: 0_level_0,ksize,sequence_to_compare,database
database_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
tabula-muris-k21-dna,21,dna,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-dna/tabula-muris-k21-dna/
tabula-muris-k21-protein,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/
tabula-muris-k27-dna,27,dna,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k27-dna/tabula-muris-k27-dna/
tabula-muris-k27-protein,27,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k27-protein/tabula-muris-k27-protein/
tabula-muris-k33-dna,33,dna,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k33-dna/tabula-muris-k33-dna/
tabula-muris-k33-protein,33,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k33-protein/tabula-muris-k33-protein/
tabula-muris-k51-dna,51,dna,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k51-dna/tabula-muris-k51-dna/
tabula-muris-k51-protein,51,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k51-protein/tabula-muris-k51-protein/


### Only compare on protein databases

Since we're mapping human signatures onto a mouse database, we want to only compare on the protein signatures since protein sequences are more conserved than nucleotides

In [14]:
protein_databases = databases.query('sequence_to_compare == "protein"')
protein_databases

Unnamed: 0_level_0,ksize,sequence_to_compare,database
database_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
tabula-muris-k21-protein,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/
tabula-muris-k27-protein,27,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k27-protein/tabula-muris-k27-protein/
tabula-muris-k33-protein,33,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k33-protein/tabula-muris-k33-protein/
tabula-muris-k51-protein,51,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k51-protein/tabula-muris-k51-protein/


Now comes the cool part!

## "Multiply" the samples x databases x ignore abundances to get every combination

We want to map each human lung sample onto ALL FOUR databases, PLUS we want to try both `ignore_abundance=True` and `ignore_abundance=False` 

Use `product` from Python's [itertools](https://docs.python.org/3/library/itertools.html) which is my favorite standard library module. (Though [collections](https://docs.python.org/3/library/collections.html) is a close second)

Remember that Python was designed as "batteries included" so if you're doing something like doing a ton of nested for loops, know that many people have done that in the past and have figured out better ways to do it.

In [15]:
ignore_abundances = True, False

data = list(itertools.product(compute_samples['output'], ignore_abundances,
                              protein_databases.index))

samples = pd.DataFrame(data, columns=['signature', 'ignore_abundance', 'database_name'])
print(samples.shape)
samples

(40432, 3)


Unnamed: 0,signature,ignore_abundance,database_name
0,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k21-protein
1,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k27-protein
2,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k33-protein
3,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k51-protein
4,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,False,tabula-muris-k21-protein
5,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,False,tabula-muris-k27-protein
6,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,False,tabula-muris-k33-protein
7,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,False,tabula-muris-k51-protein
8,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000420_S82.signature,True,tabula-muris-k21-protein
9,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000420_S82.signature,True,tabula-muris-k27-protein


### Extract the `cell_id` from the signature filename

In [16]:
samples['cell_id'] = samples['signature'].map(lambda x: os.path.basename(x).split('.')[0])
samples.head()

Unnamed: 0,signature,ignore_abundance,database_name,cell_id
0,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k21-protein,A10_B000419_S34
1,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k27-protein,A10_B000419_S34
2,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k33-protein,A10_B000419_S34
3,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k51-protein,A10_B000419_S34
4,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,False,tabula-muris-k21-protein,A10_B000419_S34


### Add an output location

This is the full path to where we'll be storing the output csv files from the search results

In [17]:
# Change the following to the output location you chose! Should start with s3://czbiohub-cupcakes, 
# e.g. s3://czbiohub-cupcakes/2018-10-09/olgabot but use YOUR username! (not "olgabot!")
output_prefix = ''

samples['output'] = output_prefix + samples['database_name'] + '/' + samples['cell_id'] + '.csv'
samples['output'].head()

0    tabula-muris-k21-protein/A10_B000419_S34.csv
1    tabula-muris-k27-protein/A10_B000419_S34.csv
2    tabula-muris-k33-protein/A10_B000419_S34.csv
3    tabula-muris-k51-protein/A10_B000419_S34.csv
4    tabula-muris-k21-protein/A10_B000419_S34.csv
Name: output, dtype: object

## Add the protein database information to the samples
We'll use the [`join()` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) to combine the `protein_databases` table with our `samples` table, so we have the database `ksize`, `sequence_to_compare` and full URL.

[Here](https://chrisalbon.com/python/data_wrangling/pandas_join_merge_dataframe/) is a nice blog post showing very clear examples of merge, join, and concatenate with pandas dataframes.

In [18]:
samples_databases = samples.join(protein_databases, on='database_name')
print(samples_databases.shape)
samples_databases.head()

(40432, 8)


Unnamed: 0,signature,ignore_abundance,database_name,cell_id,output,ksize,sequence_to_compare,database
0,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k21-protein,A10_B000419_S34,tabula-muris-k21-protein/A10_B000419_S34.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/
1,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k27-protein,A10_B000419_S34,tabula-muris-k27-protein/A10_B000419_S34.csv,27,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k27-protein/tabula-muris-k27-protein/
2,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k33-protein,A10_B000419_S34,tabula-muris-k33-protein/A10_B000419_S34.csv,33,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k33-protein/tabula-muris-k33-protein/
3,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k51-protein,A10_B000419_S34,tabula-muris-k51-protein/A10_B000419_S34.csv,51,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k51-protein/tabula-muris-k51-protein/
4,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,False,tabula-muris-k21-protein,A10_B000419_S34,tabula-muris-k21-protein/A10_B000419_S34.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/


## Create a unique id for each row

Reflow will create a log file for every job, and if you have duplicate ids, then those logs will get overwritten, and it won't treat those jobs as unique. So you want to have UNIQUE ids for each row.

We use the [`apply()` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.apply.html) to run the same function on every *row* of the dataframe. If we hadn't specified `axis=1`, it would have tried to apply the function to every *column* (`axis=0`). I usually forget which one is which so I try the function on one side and if it doesn't work I know it's the opposite.

In [19]:
samples_databases['id'] = samples_databases.apply(lambda x: 
                              '{cell_id}_ignore-abundance={ignore_abundance}_{database_name}'.format(**x), 
                              axis=1)
samples_databases.head()

Unnamed: 0,signature,ignore_abundance,database_name,cell_id,output,ksize,sequence_to_compare,database,id
0,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k21-protein,A10_B000419_S34,tabula-muris-k21-protein/A10_B000419_S34.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/,A10_B000419_S34_ignore-abundance=True_tabula-muris-k21-protein
1,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k27-protein,A10_B000419_S34,tabula-muris-k27-protein/A10_B000419_S34.csv,27,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k27-protein/tabula-muris-k27-protein/,A10_B000419_S34_ignore-abundance=True_tabula-muris-k27-protein
2,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k33-protein,A10_B000419_S34,tabula-muris-k33-protein/A10_B000419_S34.csv,33,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k33-protein/tabula-muris-k33-protein/,A10_B000419_S34_ignore-abundance=True_tabula-muris-k33-protein
3,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,tabula-muris-k51-protein,A10_B000419_S34,tabula-muris-k51-protein/A10_B000419_S34.csv,51,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k51-protein/tabula-muris-k51-protein/,A10_B000419_S34_ignore-abundance=True_tabula-muris-k51-protein
4,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,False,tabula-muris-k21-protein,A10_B000419_S34,tabula-muris-k21-protein/A10_B000419_S34.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/,A10_B000419_S34_ignore-abundance=False_tabula-muris-k21-protein


## Subset to a few dozen cells
We don't need to see the result of ALL human cells, which is ~5,000. We can just look at the output of a few to get a feel for how well it is working. Below are the sample ids that I chose, and we'll subset using the [`.query()` method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html) and access the python variable with `@chosen_ids`.

In [20]:
chosen_ids = ['C14_B003528_S62',
 'D1_B003125_S25',
 'E19_B003570_S199',
 'F21_B000420_S213',
 'G10_B003586_S142',
 'G4_B003570_S232',
 'G9_B003511_S57',
 'H7_B003588_S211',
 'I22_B002095_S22',
 'I3_B003573_S63',
 'J11_B003573_S95',
 'J8_B003528_S224',
 'K7_B002073_S103',
 'L16_B003588_S16',
 'L5_B003588_S5',
 'M1_B000420_S61',
 'M23_B002097_S251',
 'N15_B000420_S99',
 'O3_B003573_S207',
 'P14_B000420_S146',
 'P2_B003125_S14']

samples_subset = samples_databases.query('cell_id in @chosen_ids')
print(samples_subset.shape)
samples_subset.head()

(168, 9)


Unnamed: 0,signature,ignore_abundance,database_name,cell_id,output,ksize,sequence_to_compare,database,id
5568,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,True,tabula-muris-k21-protein,C14_B003528_S62,tabula-muris-k21-protein/C14_B003528_S62.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/,C14_B003528_S62_ignore-abundance=True_tabula-muris-k21-protein
5569,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,True,tabula-muris-k27-protein,C14_B003528_S62,tabula-muris-k27-protein/C14_B003528_S62.csv,27,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k27-protein/tabula-muris-k27-protein/,C14_B003528_S62_ignore-abundance=True_tabula-muris-k27-protein
5570,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,True,tabula-muris-k33-protein,C14_B003528_S62,tabula-muris-k33-protein/C14_B003528_S62.csv,33,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k33-protein/tabula-muris-k33-protein/,C14_B003528_S62_ignore-abundance=True_tabula-muris-k33-protein
5571,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,True,tabula-muris-k51-protein,C14_B003528_S62,tabula-muris-k51-protein/C14_B003528_S62.csv,51,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k51-protein/tabula-muris-k51-protein/,C14_B003528_S62_ignore-abundance=True_tabula-muris-k51-protein
5572,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,False,tabula-muris-k21-protein,C14_B003528_S62,tabula-muris-k21-protein/C14_B003528_S62.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/,C14_B003528_S62_ignore-abundance=False_tabula-muris-k21-protein


## Remove sample ID column and set

In [21]:
samples_no_cell_id = samples_subset.drop(columns=['cell_id'])
samples_no_cell_id = samples_no_cell_id.set_index('id')
print(samples_no_cell_id.shape)
samples_no_cell_id.head()

(168, 7)


Unnamed: 0_level_0,signature,ignore_abundance,database_name,output,ksize,sequence_to_compare,database
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
C14_B003528_S62_ignore-abundance=True_tabula-muris-k21-protein,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,True,tabula-muris-k21-protein,tabula-muris-k21-protein/C14_B003528_S62.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/
C14_B003528_S62_ignore-abundance=True_tabula-muris-k27-protein,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,True,tabula-muris-k27-protein,tabula-muris-k27-protein/C14_B003528_S62.csv,27,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k27-protein/tabula-muris-k27-protein/
C14_B003528_S62_ignore-abundance=True_tabula-muris-k33-protein,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,True,tabula-muris-k33-protein,tabula-muris-k33-protein/C14_B003528_S62.csv,33,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k33-protein/tabula-muris-k33-protein/
C14_B003528_S62_ignore-abundance=True_tabula-muris-k51-protein,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,True,tabula-muris-k51-protein,tabula-muris-k51-protein/C14_B003528_S62.csv,51,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k51-protein/tabula-muris-k51-protein/
C14_B003528_S62_ignore-abundance=False_tabula-muris-k21-protein,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,False,tabula-muris-k21-protein,tabula-muris-k21-protein/C14_B003528_S62.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/


## Create a folder to save reflow workflow to

What is our current directory?

In [22]:
pwd

'/Users/olgabot/code/cupcakes/2018/olgas_bioinformagician_tricks'

Make the folder

In [23]:
folder = 'lung-cancer/search_protein_databases'
! mkdir $folder

mkdir: lung-cancer/search_protein_databases: File exists


Double-check that the folder exists

In [24]:
ls -lha

total 12168
drwxr-xr-x  12 olgabot  staff   408B Oct  8 17:19 [1m[36m.[m[m/
drwxr-xr-x  13 olgabot  staff   442B Oct  8 13:59 [1m[36m..[m[m/
drwxr-xr-x   4 olgabot  staff   136B Oct  8 14:12 [1m[36m.ipynb_checkpoints[m[m/
-rw-r--r--   1 olgabot  staff    30K Oct  8 17:19 001_how_to_be_a_bioinformagician_part01.ipynb
-rw-r--r--   1 olgabot  staff   5.9M Oct  8 14:12 002_how_to_be_a_bioinformagician_part02.ipynb
-rw-r--r--   1 olgabot  staff   5.5K Oct  5 17:31 all_my_tricks_prep_notes.ipynb
-rw-r--r--   1 olgabot  staff   6.1K Oct  8 13:35 environment.yml
-rw-r--r--   1 olgabot  staff   2.9K Oct  8 13:35 environment_no_versions.yml
-rw-r--r--   1 olgabot  staff   2.9K Oct  8 14:08 environment_no_versions_no_ng.yml
drwxr-xr-x   4 olgabot  staff   136B Oct  8 13:13 [1m[36mlung-cancer[m[m/
drwxr-xr-x   3 olgabot  staff   102B Oct  8 13:59 [1m[36mreflow[m[m/
-rw-r--r--   1 olgabot  staff   440B Oct  8 17:19 sourmash_databases.txt


### Write reflow batch `config.json` and `samples.csv` file

We'll run the program `sourmash_search.rf` which is in the `reflow` folder here. I recommend keeping your reflow scripts separate from their batch folders as you may use the same script across multiple folders.

In [26]:
config = 	{
    # Since the folder we're writing to is relative to here as "lung-cancer/search_protein_databases"
    # but the reflow folder is "reflow/" then we need to go up two directories with "../.."
	"program": "../../reflow/sourmash_search.rf",
	"runs_file": "samples.csv"
	}

# Make sure the index (the ids!) are unique
assert samples_no_cell_id.index.is_unique

samples_no_cell_id.to_csv(f'{folder}/samples.csv', index=True)


with open(f'{folder}/config.json', 'w') as f:
    json.dump(config, f)

### Look at the contents of the folder

In [27]:
ls -lha $folder

total 352
drwxr-xr-x  7 olgabot  staff   238B Oct  8 17:17 [1m[36m.[m[m/
drwxr-xr-x  4 olgabot  staff   136B Oct  8 13:13 [1m[36m..[m[m/
-rw-r--r--  1 olgabot  staff    74B Oct  8 17:20 config.json
-rw-r--r--  1 olgabot  staff    51K Oct  8 17:20 samples.csv
-rw-------  1 olgabot  staff     3B Oct  8 17:17 state.bak
-rw-------  1 olgabot  staff   114K Oct  8 17:17 state.json
-rwxr-xr-x  1 olgabot  staff     0B Oct  8 17:17 [31mstate.lock[m[m*


#### Count the number of lines  in the folder to make sure it's the same as our input file

In [29]:
! wc -l $folder/samples.csv

     169 lung-cancer/search_protein_databases/samples.csv


Great, it's 168 rows + 1 header row = 169 rows!

## To save your changes for the future, create a branch and commit your changes

Since you are saving the output to YOUR own bucket, you'll want to make sure you have the code that made these changes, and the best way to do that is to use `git`.

Create a branch named like this: `yourgithubsername/bioinformagician-part1`, e.g.:

```
git checkout -b olgabot/bioinformagician-part1
```

Add all the files in the `olgas_bioinformagician_tricks` folder:

```
cd ~/code/cupcakes/2018/
git add -A olgas_bioinformagician_tricks
```

Write a message about what files you're committing and why:

```
git commit -m "Use a s3 bucket I can write to"
```

Try to push the changes:

```
git push
```

Then you'll get a "fatal error" (but really nobody died so why the freakout?) that looks like this:


```
fatal: The current branch olgabot/enable_quality_filtering has no upstream branch.
To push the current branch and set the remote as upstream, use

    git push --set-upstream origin olgabot/enable_quality_filtering
```


Copy/paste THEIR `git push` command which will properly link up your own branch name with the remote branch name on GitHub. (This is what I always do... I'm too lazy to write out the full command myself)

[Check out your branch in the whole tree here!](https://github.com/czbiohub/cupcakes/network)

## Running reflow jobs

Finally, we're ready to run our 168 jobs! Reflow very nicely manages the jobs so we can "set it and forget it" while they run, and not have to worry about starting and stopping the instances.


To run the reflow batch, go to the terminal and navigate to the `cupcakes/2018/olgas_bioinformagician_tricks/lung-cancer/search_protein_databases` directory. Once you're there, run this command:

```
reflow runbatch
```

You should see an output that looks like this:

```
  allocate {mem:24.0GiB cpu:1 disk:1.0GiB}[168]:  provisioning new instance                          1s
  r4.xlarge:                                      awaiting fulfillment of spot request sir-18qg9fkp  1s
  r4.xlarge:                                      awaiting fulfillment of spot request sir-fe6881cm  1s
  r4.xlarge:                                      awaiting fulfillment of spot request sir-3q5g81vp  1s
  r4.xlarge:                                      awaiting fulfillment of spot request sir-g51882ap  1s
  r4.xlarge:                                      awaiting fulfillment of spot request sir-qff8azcn  1s
batch /Users/olgabot/code/cupcakes/2018/olgas_bioinformagician_tricks/lung-cancer/search_protein_databases: remaining: 168
  2b8d606f:  waiting  1s
  ce90ab78:  waiting  1s
  b301bf64:  waiting  1s
  dbdb984c:  waiting  1s
  c36afd43:  waiting  1s
  b94b9ddd:  waiting  1s
  3a866e4d:  waiting  1s
  3b6ff8ab:  waiting  1s
  52d5ecff:  waiting  1s
  3838f6a3:  waiting  1s
  9030d940:  waiting  1s
  b352b3f3:  waiting  1s
  a60c9a93:  waiting  1s
  c9e1930f:  waiting  1s
  6f47939f:  waiting  1s
  f842c52f:  waiting  1s
  a054367b:  waiting  1s
```

To stop the batch, hit Control+C to cancel it.