Paths setup:
- data dir (including indexing file)
- external software paths
- temp data path

Levels of data analysis:
- Protein domain (Krzysiek)
- Protein (Bogna)
- Gene cluster (?)
- Genome (Janusz)

Structure:
- workflow notebook(s) - right now one but maybe need more (whitelist them for usage)
- lib dir with code to execute in notebook
- submodules dir with external code developed by others
- data created at individual steps should contain metadata about software used

Overall pipeline:
1. Get/load genome data (path to data dir and look for indexing file)
2. Compile data (RM 1_compile data):
- extract genomes (should it be done along with loading?)
- predict ORFs (run phanotate [params] but leave there option for other software)
- finalise for next step
3. Get representative proteins
4. Perform all vs all comparison
- create profiles for each protein with hhblits [params]
- build db
- search all vs all [params]
- create results table
- network creation with define_families [params]
5. Protein annotation (RM 3_annotations)
6. Domain annotation with phage-domains-finder (or other software)

Issues to solve:
- data storage and computing resources (Argon is 30CPUs only)
- break workflow into separate notebooks on computational bottlenecks? like hhblits profiles?

What machine we need to run this package?
- mulitcore (16+) for running computational-exhaustive software like hhsuite, memory can be an issue too
- significant (1TB+) space to store uniclust DB and other databases for hhsuite along with genomic data
- ability to install/compile external software (hhsuite, brew/apt packages from Rafał code)
- virtualenv support to run notebooks, Python 3.6+ installed
- easy access via ssh for all members of the group
- some sort of queue system would be nice but not essential

In [None]:
### setup ###

### imports
from lib_phage.utils import setup_paths
from lib_phage.database import load_indexing_file 
from lib_phage.database import validate_data
from lib_phage.predict_ORF import predict_ORFs_phanotate

### paths
# load existing configuration from file or create new if no file
data_dir, bin_paths, temp_dir = setup_paths()

### parameters used by all the software
# ORF prediction params

In [None]:
# 1. Get/load genome data (path to data dir and look for indexing file)
# look for indexing file in data dir
# if True: load indexing file to df (versioning of indexing file?)
# if False: perform tasks towards downloading and indexing data

data_index = load_indexing_file(data_dir=data_dir, data_version='')

# select genomes of interest for further analysis
#? how to deal with selecting genomes in notebook so it still be universal? - keep discrete dataset to analyse?
#? or just give example notebook and whitelist its name in git

In [None]:
# 2. Compile data (RM 1_compile data):
# - validate genome data (as in the script)

dataset = validate_data(data_index)

# - predict ORFs (run phanotate [params] but leave there option for other software)

predicted_orfs = predict_ORFs_phanotate(dataset, phanotate_bin=bin_paths['phanotate'])

# - finalise for next step - combine




In [None]:
3. Get representative proteins


In [None]:
4. Perform all vs all comparison
- create profiles for each protein with hhblits [params]
- build db
- search all vs all [params]
- create results table
- network creation with define_families [params]


In [None]:
5. Protein annotation (RM 3_annotations)


In [None]:
6. Domain annotation with phage-domains-finder (or other software)