# 02 - Add hydrophobic-polar protein encoding

Explore an additional protein coding in sourmash (hydrophillic/hydrophobic encoding)

Sometimes we want to work with development versions of software tools. To keep these separate from the master versions, we need to make new conda environments to run them in.

Because of how jupyter and conda interact, it's best to install software on the command line, and run jupyter notebook from within that conda environment. So execute the following on the command line to install sourmash with the hp encoding code and then launch a jupyter notebook.

## Run this on the command line, prior to starting this notebook:
```
# First, make a folder to work in:
cd ~/2019-fall-rotation-diblab # change into your rotation folder

#now download the czbiohub version of sourmash (need changes in PR #758 https://github.com/dib-lab/sourmash/pull/758)
git clone https://github.com/czbiohub/sourmash czbiohub_sourmash #drops the code into new "czbiohub_sourmash" folder

cd czbiohub_sourmash
conda create -y -c bioconda -c conda-forge -n sourmash-hp jupyter bam2fasta python=3.7 # new environment with python=3.7 and jupyter
cd czbiohub_sourmash
pip install -e '.' # do a developer install of the code in this folder

# activate the environment
conda activate sourmash-hp

#check that the software was installed
sourmash
sourmash compute -h

# now let's run a jupyter notebook from this environment:
cd ../ # change back out to main rotation directory
jupyter notebook
```

In [6]:
# Now, let's check the sourmash compute options
!sourmash compute -h

usage: sourmash [--protein] [--no-protein] [--dayhoff] [--no-dayhoff] [--hp]
                [--no-hp] [--dna] [--no-dna] [-q] [--input-is-protein]
                [-k KSIZES] [-n NUM_HASHES] [--check-sequence] [-f]
                [-o OUTPUT] [--singleton] [--merge MERGED] [--name-from-first]
                [--input-is-10x] [--count-valid-reads COUNT_VALID_READS]
                [--write-barcode-meta-csv WRITE_BARCODE_META_CSV]
                [-p PROCESSES] [--save-fastas SAVE_FASTAS]
                [--line-count LINE_COUNT] [--track-abundance]
                [--scaled SCALED] [--seed SEED] [--randomize]
                [--license LICENSE]
                [--rename-10x-barcodes RENAME_10X_BARCODES]
                [--barcodes-file BARCODES_FILE]
                filenames [filenames ...]
sourmash: error: the following arguments are required: filenames


You should see a new option available for sourmash compute: `--hp`. 

You previously ran:
+ DNA k = 21, 31, 51; scaled = 2000
+ RNA k = 21, 31, 51; scaled = 2000
+ protein k = 7, 11, 17; scaled = 2000; no encoding
+ protein k = 7, 11, 17; scaled = 2000; dayhoff encoding

You should now be able to compute the following signatures:
+ protein k = 7, 11, 17; scaled = 2000; hp encoding

### Task: 
  - Build signatures with hp encoding and sourmash compare matrices, as before. How does hp encoding compare to dayhoff encoding, or signatures built with the full AA alphabet? What about the DNA/RNA level? Can you build a visualization to showcase these differences?

For example, the following generates hp signatures for the bacteroides dataset:

In [9]:
%%bash 
# ^this is magic code telling jupyter that this block is entirely bash code.

mkdir -p sigs/bacteroides/protein/hp
# generate hp signatures for bacteroides
for infile in 2018-test_datasets/bacteroides/protein/*faa.gz; do out_name=$(basename $infile .faa.gz); sourmash compute -k 7,11,17 --input-is-protein --hp --track-abundance -o sigs/bacteroides/protein/hp/${out_name}.sig ${infile}; done
# and build the compare matrix for k=7:
sourmash compare -k 7 --csv sourmash_compare/bacteroides_k7_prot_hp_comp.csv sigs/bacteroides/protein/hp/*.sig


[K== This is sourmash version 2.0.0a10.dev125+gbc9a050. ==
[K== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

[Kcomputing signatures for files: 2018-test_datasets/bacteroides/protein/GCA_000011065.1_ASM1106v1_protein.faa.gz
[KComputing signature for ksizes: [7, 11, 17]
[KComputing only protein (and not nucleotide) signatures.
[KComputing a total of 3 signature(s).
[KTracking abundance of input k-mers.
[K... reading sequences from 2018-test_datasets/bacteroides/protein/GCA_000011065.1_ASM1106v1_protein.faa.gz
[K...2018-test_datasets/bacteroides/protein/GCA_000011065.1_ASM1106v1_protein.faa.gz 4824 sequences[Kcalculated 6 signatures for 4825 sequences in 2018-test_datasets/bacteroides/protein/GCA_000011065.1_ASM1106v1_protein.faa.gz
[Ktime taken to save signatures is 0.00308 seconds[Ksaved signature(s) to sigs/bacteroides/protein/hp/GCA_000011065.1_ASM1106v1_protein.sig. Note: signature license is CC0.
[K== This is sourmash version 2.0.0a10.d

CalledProcessError: Command 'b'# ^this is magic code telling jupyter that this block is entirely bash code.\n\n#For example, you can run:\n\nmkdir -p sigs/bacteroides/protein/hp\n# generate hp signatures for bacteroides\nfor infile in 2018-test_datasets/bacteroides/protein/*faa.gz; do out_name=$(basename $infile .faa.gz); sourmash compute -k 7,11,17 --input-is-protein --hp --track-abundance -o sigs/bacteroides/protein/hp/${out_name}.sig ${infile}; done\n# and build the compare matrix for k=7:\nsourmash compare -k 7 --csv sourmash_compare/bacteroides_k7_prot_hp_comp.csv sigs/bacteroides/protein/hp/*.sig\n'' returned non-zero exit status 2.