# FAMUS: Functional Annotation in Multiple datasets Using Siamese neural networks

Hello! Thank you for trying FAMUS.   
This notebook shows how to use FAMUS to annotate sequences and train custom models.

## Main scripts

FAMUS has two main scripts: easy_train and easy_classify.  
To classify sequences using pre-existing models, we will only need easy_classify.
The models to use for classification are configured in cfg.yaml but can be overriden in the command line.  
The two scripts are tolerant to interruptions - if they are stopped before finishing, they will continue from where they left off.

### easy_classify.py

Used to label input sequences based on existing database models installed in `./models/`.  
Will continue interrupted runs if the input and output are the same as the interrupted run.

**Note:** depending on the number of CPU cores, the example may take a while to run. After downloading the models, it is recommended to run `python3 -m convert_sdf` once to convert the training data from JSON to pickle binaries which makes classification faster. Otherwise, remove the `--load_sdf_from_pickle` flag from easy_classify.  

In [None]:
!python3 -m convert_sdf

Command line arguments for easy_classify (unused arguments will be read from cfg.yaml):
- input_fasta_file_path - the path of the sequeces for classification. (required)
- output_dir - the directory to save the results to. (required)
- n_processes - number of cpu cores to use.
- device - cpu/cuda - in HPC environments with multiple CPU cores, there isn't a real difference.
- chunksize - how many sequences to classify per iteration. Decrease if GPU RAM becomes an issue (default is 20,000).
- models - space-separated list of model names to use. 
- models_type - full/light - type of model to use (light is slightly less accurate but significantly faster).
- load_sdf_from_pickle - loads training data from pickle instead of json. Only usable after running `python -m convert_sdf`.

In [None]:
!python3 -m easy_classify --input_fasta_file_path examples/example_for_classification.fasta --output_dir examples/classification_example_results/ --device cpu --n_processes 32 --models kegg interpro --model_type light --load_sdf_from_pickle

In [None]:
!head examples/classification_example_results/*

### easy_train.py

Used to create your own models.  
Will continue interrupted runs if the input directory/model name is the same, **but** the input fasta directory, unknown sequence fasta, number of epochs, batch size and model type must also be the same or an error will be raised.


Command line arguments for easy_train (unused arguments will be read from cfg.yaml):
- input_fasta_dir_path - the path of the directory holding fasta files where each file defines a protein family (required). **Note:** every file name **must** end in .fasta, and files must not be named unknown.fasta (since unknown is reserved for unknown sequences)
- model_type - full/light. The type of model to create - full models take longer to train and classify but are slightly more accurate.
- model_name - optional name for the model that will be used to refer to it in easy_classify. If not specified, the input directory base name will be used.
- unknown_sequences_fasta_path - fasta file with sequences of unknown function as negative examples for the model. Optional but recommended.
- n_processes - number of CPU cores to use.
- num_epochs - number of epochs to train the model for.
- batch_size - training batch size.
- stop_before_training - calling easy_train with --stop_before_training will exit before starting to train the model (useful for things like preprocessing in a high-CPU environment and them training the model in a different environment with CUDA).
- device - cpu/cuda.
- chunksize - reduce if GPU RAM becomes an issue when calculating threshold using GPU.
- save_every - save a checkpoint of the model's state every \<save_every> steps. Will load the last checkpoint automatically if the script is restarted.

In [None]:
!python3 -m easy_train --input_fasta_dir_path examples/example_orthologs/ --model_type light --model_name adar_example --unknown_sequences_fasta_path examples/unknowns.fasta --device cpu --chunksize 1000 --num_epochs 100 --save_every 1000