# FAMUS: Functional Annotation in Multiple datasets Using Siamese neural networks

Hello! Thank you for trying FAMUS.
This notebook shows how to use FAMUS, and demonstrates the individual steps taken in the FAMUS pipeline.

## Main scripts

FAMUS has two main scripts: easy_train and easy_classify.  
To classify sequences using pre-existing models, we will only need easy_classify.
The models to use for classification are configured in cfg.yaml.

### easy_classify.py

Used to label input sequences based on existing database models installed in `./models/`.

**Note:** depending on the number of CPU cores, the example may take a while to run. After downloading the models, it is recommended to run `python3 -m convert_sdf` to convert the training data from JSON to pickle binaries which makes classification faster. Otherwise, remove the `--load_sdf_from_pickle` flag.  

In [None]:
!python -m convert_sdf

Command line arguments for easy_classify (unused arguments will be read from cfg.yaml):
- input_fasta_file_path - the path of the sequeces for classification. (required)
- output_dir - the directory to save the results to. (required)
- n_processes - number of cpu cores to use.
- device - cpu/cuda
- chunksize - how many sequences to classify per iteration. Decrease if RAM becomes an issue (default is 20,000).
- models - comma-separated list of model names to use. 
- models_type - full/light - type of model to use (light is slightly less accurate but significantly faster).
- load_sdf_from_pickle - loads training data from pickle instead of json. Only usable after running `python -m convert_sdf`.

In [6]:
!python3 -m easy_classify --input_fasta_file_path examples/for_classification.fasta --output_dir examples/classification_example_results/ --device cpu --n_processes 32 --model_type light --load_sdf_from_pickle

Starting easy_classify.py...
Preprocessing data for kegg
Starting preprocessing
Input fasta: examples/for_classification.fasta
Input full profiles dir: /davidb/guyshur/famus/models/light/kegg/data_dir/subcluster_profiles/
Input sdf train: /davidb/guyshur/famus/models/light/kegg/data_dir/sdf_train.pkl
Data directory: examples/classification_example_results/kegg
Number of threads: 32
Running full hmmsearch
running hmmsearch for 24784 profiles
hmmsearch done
concatenating hmmsearch results
concatenating 24784 files
concatenating files: 100%|██████████████| 24784/24784 [00:05<00:00, 4204.40it/s]
postprocessing hmmsearch results
Running sdf_classify
loading hmmsearch results
reading sparse dict: 100%|███████████| 24916/24916 [00:00<00:00, 1455890.53it/s]
saving sdf
Finished preprocessing
Classifying data for kegg
device: cpu
Loading model
Loading dataframes
threshold: 0.19
Getting embeddings
Loading embeddings
Calculating embeddings
Getting chunk 1/1 for embedding calculation
embedding
Gett

In [5]:
!head -30 examples/classification_example_results/*

==> examples/classification_example_results/eggnog_classification_results.csv <==
tr|Q5J4D4|Q5J4D4_MOUSE	unknown
tr|H9F8T5|H9F8T5_MACMU	KOG2777
tr|U3KFW5|U3KFW5_FICAL	unknown
tr|Q6PB89|Q6PB89_MOUSE	unknown
tr|V8PFX0|V8PFX0_OPHHA	unknown
tr|V8P490|V8P490_OPHHA	KOG2777
tr|Q5J3Q7|Q5J3Q7_MOUSE	unknown
tr|Q4FJS9|Q4FJS9_MOUSE	KOG2777
tr|Q86X17|Q86X17_HUMAN	KOG2777
tr|Q3UTX2|Q3UTX2_MOUSE	unknown
tr|T1R8V7|T1R8V7_CYPCA	unknown
tr|W8B739|W8B739_CERCA	KOG2777
tr|Q5VW43|Q5VW43_HUMAN	unknown
tr|Q5J4D5|Q5J4D5_MOUSE	unknown
tr|Q3UUK2|Q3UUK2_MOUSE	KOG3732
tr|W5N371|W5N371_LEPOC	KOG2777
tr|Q5J3Q8|Q5J3Q8_MOUSE	unknown

==> examples/classification_example_results/eggnog_classification_results.tsv <==
tr|Q3UUK2|Q3UUK2_MOUSE	KOG3732
tr|Q3UTX2|Q3UTX2_MOUSE	unknown
tr|T1R8V7|T1R8V7_CYPCA	unknown
tr|W5N371|W5N371_LEPOC	KOG2777
tr|Q5VW43|Q5VW43_HUMAN	unknown
tr|Q5J4D5|Q5J4D5_MOUSE	unknown
tr|W8B739|W8B739_CERCA	KOG2777
tr|V8P490|V8P490_OPHHA	KOG2777
tr|Q6PB89|Q6PB89_MOUSE	unknown
tr|H9F8T5|H9F8T5_MACMU	KOG277