# How to be a Bioinformagician part 1
- Author: Olga Botvinnik
- Date: 2018-10-09

Abstract: Many computational problems have already been solved and yet hundreds of hours are lost to re-solving them. This series provides tips and tricks that solve common pain points in bioinformatics, using AWS, reading/writing CSVs, dealing with large csvs, 

## Prerequisities
1. [divvy](http://mizage.com/divvy/) - ask IT for a license
1. [anaconda python](https://www.anaconda.com/download/#macos) - Install the 3.x (e.g. 3.7) version if you don't have it already
1. awscli - on the command line, `pip install awscli` after you install Anaconda Python
1. aegea - on the command line, `conda install aegea` after you install Anaconda Python
1. aws account - Ask Olga or James for one
1. reflow. Click this link: https://github.com/grailbio/reflow/releases/download/reflow0.6.8/reflow0.6.8.darwin.amd64 Then on the command line:
```
cd ~/Downloads
chmod ugo+x reflow0.6.8.darwin.amd64
sudo mv reflow0.6.8.darwin.amd64 /usr/local/bin/reflow
```
    Now the command `reflow` should output a lot of stuff:
```
reflow
```
    Then configure reflow, following the [Confluence entry on Reflow](https://czbiohub.atlassian.net/wiki/spaces/DS/pages/838205454/reflow) instructions for configuration:
```
AWS_SDK_LOAD_CONFIG=1 reflow setup-ec2
AWS_SDK_LOAD_CONFIG=1 reflow setup-s3-repository czbiohub-reflow-quickstart-cache
AWS_SDK_LOAD_CONFIG=1 reflow setup-dynamodb-assoc czbiohub-reflow-quickstart
export AWS_REGION=us-west-2
```
1. Claim a folder within `s3://czbiohub-cupcakes/` with today's date and your username, e.g.:
```
s3://czbiohub-cupcakes/2018-10-09/olgabot/
```


### Note: This notebook *can* be run locally 

But there's some weird version issues and anyway it's cooler to launch from AWS so that's what we're doing


# Step 0: Install Reflow on your laptop

## 0.01 Install Reflow


```
curl https://github.com/grailbio/reflow/releases/download/reflow0.6.8/reflow0.6.8.darwin.amd64 > reflow
chmod ugo+x reflow
sudo mv reflow0.6.8.linux.amd64 /usr/local/bin/reflow
```

Now the command `reflow` should output a lot of stuff:

```
reflow
```


## 1.07 Configure Reflow



```
AWS_SDK_LOAD_CONFIG=1 reflow setup-ec2
AWS_SDK_LOAD_CONFIG=1 reflow setup-s3-repository czbiohub-reflow-quickstart-cache
AWS_SDK_LOAD_CONFIG=1 reflow setup-dynamodb-assoc czbiohub-reflow-quickstart
export AWS_REGION=us-west-2
```

# Step 1: Launch Jupyter Notebooks from AWS

## 1.01 Use an existing packer image

### YOU can make your own packer image

I get questions about "how can I make my own image?" and people don't realize it's much easier than they think!

- You aren't beholden to what already exists there
- If the README instructions suck, ask me or James questions and we'll try to answer them. We'll also probably tell you to fix the README for future people with the same question
- Look at the [installation script for Jupyter](https://github.com/czbiohub/packer-images/blob/master/scripts/jupyter.sh) - it's super simple. You've already used those commands on AWS so it's not a huge leap to make a packer image.

Here's the `aegea launch` command. You'll want to change `olgabot-cc-jupyter` to be your own name


```
aegea launch --ami-tags Name=czbiohub-jupyter -t t2.xlarge --security-groups='R/RStudio Server and JupyterHub' --iam-role S3fromEC2 --duration-hours 24 olgabot-cc-jupyter
```

We're using a tiny instance (`t2.xlarge` - see [all instance options](https://www.ec2instances.info/)) because it has a good amount of memory and is pretty cheap. We don't need a lot of CPUs since we'll only be running few things at a time.



#### Aegea launch errors
If you're getting an error that looks like:

```
 Mon  8 Oct - 10:31  ~ 
  aegea launch --ami-tags Name=czbiohub-jupyter -t t2.xlarge --security-groups='R/RStudio Server and JupyterHub' --iam-role S3fromEC2 --duration-hours 24 olgabot-cc-jupyter
Traceback (most recent call last):
  File "/anaconda3/bin/aegea", line 23, in <module>
    aegea.main()
  File "/anaconda3/lib/python3.6/site-packages/aegea/__init__.py", line 78, in main
    result = parsed_args.entry_point(parsed_args)
  File "/anaconda3/lib/python3.6/site-packages/aegea/launch.py", line 50, in launch
    dns_zone = DNSZone(config.dns.get("private_zone"))
  File "/anaconda3/lib/python3.6/site-packages/aegea/util/aws/__init__.py", line 183, in __init__
    raise AegeaException(msg.format(len(private_zones)))
aegea.util.exceptions.AegeaException: Found 2 private DNS zones; unable to determine zone to use. Set the dns.private_zone key in Aegea config
```


*NOTE: I found this fix by SEARCHING slack for "aegea dns". If you're getting an error, it's likely many other people are, too, so SEARCH slack if you're not getting a response on #eng-support right away*

You can do one of two things, one that will fix the problem forever or a quick fix that will only work once.


##### Edit your aegea config file so it never happens again

```
echo "dns:\n  private_zone: aegea" >> ~/.config/aegea/config.yml
```

##### Fix it just this one time

Add `--no-dns` to your `aegea launch` command before the image name (last argument):

```
aegea launch --ami-tags Name=czbiohub-jupyter -t t2.xlarge --security-groups='R/RStudio Server and JupyterHub' --iam-role S3fromEC2 --no-dns --duration-hours 24 olgabot-cc-jupyter
```


## 1.02 Log into your instance


```
aegea ssh ubuntu@olgabot-cc-jupyter
```

## 1.03 Start Screen/Tmux
This will keep Jupyter notebook running forever even if your network connection breaks

Do one of:
```
screen
```
--- OR ---

```
tmux
```


## 1.04 Clone the `cupcakes` and `kmer-hashing` repositories

Now you'll have this notebook running on AWS!

```
git clone https://github.com/czbiohub/cupcakes/
git clone https://github.com/czbiohub/kmer-hashing/
```




## 1.08 Launch jupyter notebook


```
jupyter notebook
```

## 1.09 Open another tab in your terminal with Command-T

Multiple tabs >> (are much better than)  multiple windows because it's much easier to navigate between them

- `Command-Shift-[` moves one tab to the left
- `Command-Shift-]` moves one tab to the right


## 1.10 Tunnel the notebook from AWS to your computer

This binds the remote port `8888` to your local port `8877`

```
aegea ssh ubuntu@olgabot-cc-jupyter -NL localhost:8877:localhost:8888 
```

## 1.11 Go to http://localhost:8877 on your laptop

The password is the same as the InnerHub wifi password.

## 1.12 Navigate to the cupcakes/2018 folder

- Open `001_how_to_be_a_bioinformagician_part01.ipynb`

## Now we are ready to read the data!

In [2]:
# Standard convention is to import python standard libraries first, then third-party libraries after that
# See list of standard libraries here: https://docs.python.org/3/library/
# Both import lists should be alphabetically sorted

# --- Python standard library --- #
# Easily grab filenames from a folder
import glob

# Amazing library that I use almost every day for:
# - chaining lists together into mega-lists
# - "multiplying" lists against each other to get the full product of combinations
import itertools

# Read/write javascript object notation (JSON) files
import json

# Perform path manipulations
import os

# --- Third-party (non-standard Python) libraries --- #
# python dataframes. very similar to R dataframes
import pandas as pd

# Make the number of characters allowed per column super big since our filenames are long
pd.options.display.max_colwidth = 500

In [3]:
compute_samples = pd.read_csv('../sourmash/lung_cancer_v4/compute/samples.csv')
print(compute_samples.shape)
compute_samples.head()

(5054, 10)


Unnamed: 0,id,read1,read2,name,output,trim_low_abundance_kmers,dna,protein,ksizes,scaled
0,A10_B000419_S34,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B000419_S34/A10_B000419_S34_R1_001.fastq.gz,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B000419_S34/A10_B000419_S34_R2_001.fastq.gz,A10_B000419_S34,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,True,True,True,21273351,1000
1,A10_B000420_S82,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B000420_S82/A10_B000420_S82_R1_001.fastq.gz,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B000420_S82/A10_B000420_S82_R2_001.fastq.gz,A10_B000420_S82,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000420_S82.signature,True,True,True,21273351,1000
2,A10_B002073_S166,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B002073_S166/A10_B002073_S166_R1_001.fastq.gz,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B002073_S166/A10_B002073_S166_R2_001.fastq.gz,A10_B002073_S166,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B002073_S166.signature,True,True,True,21273351,1000
3,A10_B002078_S202,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B002078_S202/A10_B002078_S202_R1_001.fastq.gz,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B002078_S202/A10_B002078_S202_R2_001.fastq.gz,A10_B002078_S202,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B002078_S202.signature,True,True,True,21273351,1000
4,A10_B002095_S118,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B002095_S118/A10_B002095_S118_R1_001.fastq.gz,s3://czbiohub-seqbot/fastqs/180516_A00111_0149_AH5CM2DSXX/rawdata/A10_B002095_S118/A10_B002095_S118_R2_001.fastq.gz,A10_B002095_S118,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B002095_S118.signature,True,True,True,21273351,1000


## look at aws folder and save aws output to file

In [7]:
prefix = 's3://olgabot-maca/facs/sourmash_index_all'
txt = 'sourmash_databases.txt'

! aws s3 ls $prefix/ > $txt
! cat $txt

                           PRE tabula-muris-k21-dna/
                           PRE tabula-muris-k21-protein/
                           PRE tabula-muris-k27-dna/
                           PRE tabula-muris-k27-protein/
                           PRE tabula-muris-k33-dna/
                           PRE tabula-muris-k33-protein/
                           PRE tabula-muris-k51-dna/
                           PRE tabula-muris-k51-protein/


Running each line one-by-one is left as an exercise to the reader :)

- To uncomment each cell, put your cursor in it, then:
    1. select-all with Command-A 
    2. uncomment with Command-/

In [None]:
# databases = pd.read_table(txt, delim_whitespace=True, header=None, names=['is_prefix', 'database_name'])
# databases

In [None]:
# databases['database_name'] = databases['database_name'].str.strip('/')
# databases

In [None]:
# databases = databases.drop('is_prefix', axis=1)
# databases

In [None]:
# databases['ksize'] = databases['database_name'].str.extract('k(\d+)').astype(int)
# databases

In [None]:
# databases['sequence_to_compare'] = databases['database_name'].map(lambda x: x.split('-')[-1])
# databases

In [None]:
# databases = databases.set_index('database_name')
# databases

In [8]:
databases = pd.read_table(txt, delim_whitespace=True, header=None, names=['is_prefix', 'database_name'])
databases['database_name'] = databases['database_name'].str.strip('/')
databases = databases.drop('is_prefix', axis=1)
databases['ksize'] = databases['database_name'].str.extract('k(\d+)').astype(int)
databases['sequence_to_compare'] = databases['database_name'].map(lambda x: x.split('-')[-1])
databases['database'] = databases['database_name'].map(lambda x: f'{prefix}/{x}/{x}/')
databases = databases.set_index('database_name')
databases

Unnamed: 0_level_0,ksize,sequence_to_compare,database
database_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
tabula-muris-k21-dna,21,dna,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-dna/tabula-muris-k21-dna/
tabula-muris-k21-protein,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/
tabula-muris-k27-dna,27,dna,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k27-dna/tabula-muris-k27-dna/
tabula-muris-k27-protein,27,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k27-protein/tabula-muris-k27-protein/
tabula-muris-k33-dna,33,dna,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k33-dna/tabula-muris-k33-dna/
tabula-muris-k33-protein,33,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k33-protein/tabula-muris-k33-protein/
tabula-muris-k51-dna,51,dna,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k51-dna/tabula-muris-k51-dna/
tabula-muris-k51-protein,51,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k51-protein/tabula-muris-k51-protein/


### Only compare on protein databases

In [9]:
protein_databases = databases.query('sequence_to_compare == "protein"')
protein_databases

Unnamed: 0_level_0,ksize,sequence_to_compare,database
database_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
tabula-muris-k21-protein,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/
tabula-muris-k27-protein,27,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k27-protein/tabula-muris-k27-protein/
tabula-muris-k33-protein,33,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k33-protein/tabula-muris-k33-protein/
tabula-muris-k51-protein,51,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k51-protein/tabula-muris-k51-protein/


## "Multiply" the samples x databases x ignore abundances to get every combination

Use `product` from Python's [itertools](https://docs.python.org/3/library/itertools.html) which is my favorite standard library module. (Though [collections](https://docs.python.org/3/library/collections.html) is a close second)

Remember that Python was designed as "batteries included" so if you're doing something like doing a ton of nested for loops, know that many people have done that in the past and have figured out better ways to do it.

In [3]:
ignore_abundances = True, False

data = list(itertools.product(compute_samples['output'], ignore_abundances,
                              protein_databases.index))

samples = pd.DataFrame(data, columns=['signature', 'ignore_abundance', 'database_name'])
print(samples.shape)
samples

NameError: name 'compute_samples' is not defined

In [11]:


samples['sample_id'] = samples['signature'].map(lambda x: os.path.basename(x).split('.')[0])
samples.head()

Unnamed: 0,ignore_abundance,signature,database_name,sample_id
0,True,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,tabula-muris-k21-protein,A10_B000419_S34
1,True,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,tabula-muris-k27-protein,A10_B000419_S34
2,True,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,tabula-muris-k33-protein,A10_B000419_S34
3,True,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,tabula-muris-k51-protein,A10_B000419_S34
4,True,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000420_S82.signature,tabula-muris-k21-protein,A10_B000420_S82


In [13]:
output_prefix = f's3://olgabot-maca/lung_cancer/sourmash_search/'

samples['output'] = output_prefix + samples['database_name'] + '/' + samples['id'] + '.csv'
samples['output'].head()

0    s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k21-protein/A10_B000419_S34_ignore-abundance=True.csv
1    s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k27-protein/A10_B000419_S34_ignore-abundance=True.csv
2    s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k33-protein/A10_B000419_S34_ignore-abundance=True.csv
3    s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k51-protein/A10_B000419_S34_ignore-abundance=True.csv
4    s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k21-protein/A10_B000420_S82_ignore-abundance=True.csv
Name: output, dtype: object

In [27]:
samples_databases = samples.join(protein_databases, on='database_name')
print(samples_databases.shape)
samples_databases.head()

(40432, 9)


Unnamed: 0,ignore_abundance,signature,database_name,sample_id,id,output,ksize,sequence_to_compare,database
0,True,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,tabula-muris-k21-protein,A10_B000419_S34,A10_B000419_S34_ignore-abundance=True,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k21-protein/A10_B000419_S34_ignore-abundance=True.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/
1,True,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,tabula-muris-k27-protein,A10_B000419_S34,A10_B000419_S34_ignore-abundance=True,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k27-protein/A10_B000419_S34_ignore-abundance=True.csv,27,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k27-protein/tabula-muris-k27-protein/
2,True,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,tabula-muris-k33-protein,A10_B000419_S34,A10_B000419_S34_ignore-abundance=True,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k33-protein/A10_B000419_S34_ignore-abundance=True.csv,33,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k33-protein/tabula-muris-k33-protein/
3,True,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,tabula-muris-k51-protein,A10_B000419_S34,A10_B000419_S34_ignore-abundance=True,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k51-protein/A10_B000419_S34_ignore-abundance=True.csv,51,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k51-protein/tabula-muris-k51-protein/
4,True,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000420_S82.signature,tabula-muris-k21-protein,A10_B000420_S82,A10_B000420_S82_ignore-abundance=True,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k21-protein/A10_B000420_S82_ignore-abundance=True.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/


In [29]:
samples_databases['id'] = samples_databases.apply(lambda x: 
                              '{sample_id}_ignore-abundance={ignore_abundance}_{database_name}'.format(**x), 
                              axis=1)
samples_databases.head()

Unnamed: 0,ignore_abundance,signature,database_name,sample_id,id,output,ksize,sequence_to_compare,database
0,True,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,tabula-muris-k21-protein,A10_B000419_S34,A10_B000419_S34_ignore-abundance=True_tabula-muris-k21-protein,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k21-protein/A10_B000419_S34_ignore-abundance=True.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/
1,True,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,tabula-muris-k27-protein,A10_B000419_S34,A10_B000419_S34_ignore-abundance=True_tabula-muris-k27-protein,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k27-protein/A10_B000419_S34_ignore-abundance=True.csv,27,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k27-protein/tabula-muris-k27-protein/
2,True,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,tabula-muris-k33-protein,A10_B000419_S34,A10_B000419_S34_ignore-abundance=True_tabula-muris-k33-protein,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k33-protein/A10_B000419_S34_ignore-abundance=True.csv,33,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k33-protein/tabula-muris-k33-protein/
3,True,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000419_S34.signature,tabula-muris-k51-protein,A10_B000419_S34,A10_B000419_S34_ignore-abundance=True_tabula-muris-k51-protein,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k51-protein/A10_B000419_S34_ignore-abundance=True.csv,51,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k51-protein/tabula-muris-k51-protein/
4,True,s3://olgabot-maca/lung_cancer/sourmash_v4/A10_B000420_S82.signature,tabula-muris-k21-protein,A10_B000420_S82,A10_B000420_S82_ignore-abundance=True_tabula-muris-k21-protein,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k21-protein/A10_B000420_S82_ignore-abundance=True.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/


In [30]:
chosen_ids = ['C14_B003528_S62',
 'D1_B003125_S25',
 'E19_B003570_S199',
 'F21_B000420_S213',
 'G10_B003586_S142',
 'G4_B003570_S232',
 'G9_B003511_S57',
 'H7_B003588_S211',
 'I22_B002095_S22',
 'I3_B003573_S63',
 'J11_B003573_S95',
 'J8_B003528_S224',
 'K7_B002073_S103',
 'L16_B003588_S16',
 'L5_B003588_S5',
 'M1_B000420_S61',
 'M23_B002097_S251',
 'N15_B000420_S99',
 'O3_B003573_S207',
 'P14_B000420_S146',
 'P2_B003125_S14']

samples_subset = samples_databases.query('sample_id in @chosen_ids')
print(samples_subset.shape)
samples_subset.head()

(168, 9)


Unnamed: 0,ignore_abundance,signature,database_name,sample_id,id,output,ksize,sequence_to_compare,database
2784,True,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,tabula-muris-k21-protein,C14_B003528_S62,C14_B003528_S62_ignore-abundance=True_tabula-muris-k21-protein,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k21-protein/C14_B003528_S62_ignore-abundance=True.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/
2785,True,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,tabula-muris-k27-protein,C14_B003528_S62,C14_B003528_S62_ignore-abundance=True_tabula-muris-k27-protein,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k27-protein/C14_B003528_S62_ignore-abundance=True.csv,27,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k27-protein/tabula-muris-k27-protein/
2786,True,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,tabula-muris-k33-protein,C14_B003528_S62,C14_B003528_S62_ignore-abundance=True_tabula-muris-k33-protein,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k33-protein/C14_B003528_S62_ignore-abundance=True.csv,33,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k33-protein/tabula-muris-k33-protein/
2787,True,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,tabula-muris-k51-protein,C14_B003528_S62,C14_B003528_S62_ignore-abundance=True_tabula-muris-k51-protein,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k51-protein/C14_B003528_S62_ignore-abundance=True.csv,51,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k51-protein/tabula-muris-k51-protein/
4368,True,s3://olgabot-maca/lung_cancer/sourmash_v4/D1_B003125_S25.signature,tabula-muris-k21-protein,D1_B003125_S25,D1_B003125_S25_ignore-abundance=True_tabula-muris-k21-protein,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k21-protein/D1_B003125_S25_ignore-abundance=True.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/


In [31]:
samples_no_sample_id = samples_subset.drop(columns=['sample_id'])
samples_no_sample_id = samples_no_sample_id.set_index('id')
print(samples_no_sample_id.shape)
samples_no_sample_id.head()

(168, 7)


Unnamed: 0_level_0,ignore_abundance,signature,database_name,output,ksize,sequence_to_compare,database
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
C14_B003528_S62_ignore-abundance=True_tabula-muris-k21-protein,True,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,tabula-muris-k21-protein,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k21-protein/C14_B003528_S62_ignore-abundance=True.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/
C14_B003528_S62_ignore-abundance=True_tabula-muris-k27-protein,True,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,tabula-muris-k27-protein,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k27-protein/C14_B003528_S62_ignore-abundance=True.csv,27,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k27-protein/tabula-muris-k27-protein/
C14_B003528_S62_ignore-abundance=True_tabula-muris-k33-protein,True,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,tabula-muris-k33-protein,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k33-protein/C14_B003528_S62_ignore-abundance=True.csv,33,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k33-protein/tabula-muris-k33-protein/
C14_B003528_S62_ignore-abundance=True_tabula-muris-k51-protein,True,s3://olgabot-maca/lung_cancer/sourmash_v4/C14_B003528_S62.signature,tabula-muris-k51-protein,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k51-protein/C14_B003528_S62_ignore-abundance=True.csv,51,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k51-protein/tabula-muris-k51-protein/
D1_B003125_S25_ignore-abundance=True_tabula-muris-k21-protein,True,s3://olgabot-maca/lung_cancer/sourmash_v4/D1_B003125_S25.signature,tabula-muris-k21-protein,s3://olgabot-maca/lung_cancer/sourmash_search/tabula-muris-k21-protein/D1_B003125_S25_ignore-abundance=True.csv,21,protein,s3://olgabot-maca/facs/sourmash_index_all/tabula-muris-k21-protein/tabula-muris-k21-protein/


In [32]:
# folder = f'../sourmash/lung_cancer_v4/search_{database_name}'
folder = f'../sourmash/lung_cancer_v4/search_protein_databases'
! mkdir $folder

mkdir: ../sourmash/lung_cancer_v4/search_protein_databases: File exists


In [38]:


config = 	{
		"program": "../../../reflow/sourmash_search.rf",
		"runs_file": "samples.csv"
	}


# Make sure the index (the ids!) are unique
assert samples_no_sample_id.index.is_unique


samples_no_sample_id.to_csv(f'{folder}/samples.csv', index=True)


with open(f'{folder}/config.json', 'w') as f:
    json.dump(config, f)

In [39]:
pwd

'/Users/olgabot/code/kmer-hashing/notebooks'

In [35]:
ls -lha $folder

total 136
drwxr-xr-x  4 olgabot  staff   136B Sep 24 04:29 [1m[36m.[m[m/
drwxr-xr-x  6 olgabot  staff   204B Sep 21 10:31 [1m[36m..[m[m/
-rw-r--r--  1 olgabot  staff    77B Oct  5 17:51 config.json
-rw-r--r--  1 olgabot  staff    62K Oct  5 17:51 samples.csv


In [36]:
! wc -l $folder/samples.csv

     169 ../sourmash/lung_cancer_v4/search_protein_databases/samples.csv
