<a href="https://colab.research.google.com/github/alex-tianhuang/idrfeatlib/blob/main/notebooks/Featurize.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome!
========

Welcome to featurize.ipynb, a quick and easy colab notebook for computing some sequence features over a file of IDRs.

To use this notebook for the first time, hover your mouse over the top left corner (around where the play buttons or cell numbers are) of each cell to run it. Just run each of the three cells in order and respond to the prompts below.

To reuse this notebook, keep running the third cell below and responding to the prompts differently or uploading different files.

In [None]:
# Setup environment by downloading and installing the idrfeatlib repo
#
# You only need to run this once
!file idrfeatlib/ && rm -rf idrfeatlib
!git clone https://github.com/alex-tianhuang/idrfeatlib
!pip install idrfeatlib/

In [None]:
# Prepare/define the featurization scripts.

def main(args):
    """
    Taken from idrfeatlib/featurize.py (25-07-24) @tianh.
    """
    from idrfeatlib.utils import read_fasta, read_regions_csv, iter_nested
    from idrfeatlib import FeatureVector
    from idrfeatlib.featurizer import Featurizer, compile_featurizer
    from idrfeatlib.native import compile_native_featurizer
    import json
    import tqdm
    import sys
    if args.feature_file:
        with open(args.feature_file, "r") as file:
            config = json.load(file)
        featurizer, errors = compile_featurizer(config)
    else:
        featurizer, errors = compile_native_featurizer()
    for featname, error in errors.items():
        print("error compiling `%s`: %s" % (featname, error), file=sys.stderr)
    featurizer = Featurizer(featurizer)
    fa = read_fasta(args.input_sequences)
    if args.input_regions is None:
        seqs = tqdm.tqdm(fa, total=len(fa), desc="featurizing sequences")
        feature_vectors, errors = featurizer.featurize_to_matrices(seqs)
        for proteinid, error in errors.items():
            print("error for `%s`: %s" % (proteinid, error), file=sys.stderr)
        FeatureVector.dump(list(feature_vectors.items()), args.output_file, "ProteinID")
    else:
        fa = dict(fa)
        regions = read_regions_csv(args.input_regions)
        seqs = []
        for protid, regionid, (start, stop) in iter_nested(regions, 2):
            if (entry := fa.get(protid)) is None:
                continue
            whole_seq = entry
            assert isinstance(whole_seq, str), type(whole_seq)
            seq = whole_seq[start:stop]
            if len(seq) != stop - start:
                print("invalid region `%s` for protein `%s` (start=%d,stop=%d,seqlen=%s)" % (regionid, protid, start, stop, len(whole_seq)))
                continue
            seqs.append(((protid, regionid), seq))
        feature_vectors, errors = featurizer.featurize_to_matrices(tqdm.tqdm(seqs, desc="featurizing sequences"))
        for (protid, regionid), error in errors.items():
            print("error for (protid=%s, regionid=%s): %s" % (protid, regionid, error), file=sys.stderr)
        FeatureVector.dump(list(feature_vectors.items()), args.output_file, ("ProteinID", "RegionID"))


def run_colab_wrapper(output_name):
    """
    My `main` function expects an argparse.Namespace and I'm not changing it
    so most of this function is weirdly constructing this one Namespace object
    """
    from google.colab import files
    import os
    import argparse

    # The aforementioned Namespace object, starting as an empty args
    args = argparse.ArgumentParser().parse_args([])

    args.input_sequences = 'input_sequences.fasta'
    goto_upload = True
    if os.path.exists(args.input_sequences):
        choice = input(f"The file {args.input_sequences} already exists. Would you like to overwrite it? (y/n)")
        if choice.lower() != 'y':
            goto_upload = False
    if goto_upload:
        files.upload_file(args.input_sequences)

    choice = input("Would you like to upload a file containing region boundaries? (y/n)")
    if choice.lower() == 'y':
        args.input_regions = 'input_regions.csv'
        files.upload_file(args.input_regions)
    else:
        args.input_regions = None

    args.feature_file = 'feature_config.json'
    goto_upload = True
    if os.path.exists(args.feature_file):
        choice = input(f"The file {args.feature_file} already exists. Would you like to overwrite it? (y/n)")
        if choice.lower() != 'y':
            goto_upload = False
            print(f"Ignoring {args.feature_file}")
    else:
        choice = input("Would you like to upload a file containing feature configuration? (y/n)")
        if choice.lower() != 'y':
            goto_upload = False
    if goto_upload:
        files.upload_file(args.feature_file)
    else:
        args.feature_file = None

    abort = False
    args.output_file = output_name
    if os.path.exists(args.output_file):
        choice = input(f"The file {args.output_file} already exists. Would you like to overwrite it? (y/n)")
        if choice.lower() != 'y':
            abort = True
    if abort:
        print("Aborting because .")
        return
    main(args)
    print(f"Downloading output file to {args.output_file}")
    files.download(args.output_file)

In [None]:
# This cell will:
#
# 1. Ask for input files (respond no if unsure and see what happens)
# 2. Featurize sequences in the appropriate file
# 3. Ask to download the output file, called `output_features.csv`
#
# Run this cell after running the above cells as many times as you would like.

run_colab_wrapper("output_features.csv")


Extra
=====
The absolutely essential bits are above, but below are some extra tips.


Uploading regions
-----------------
If you have a fasta file of full protein sequences and a csv full of named protein regions and their start/stop boundaries, like the following:
```
ProteinID,RegionID,Start,Stop
DDX3X,N-IDR,0,142
protein_x,region3,24,72
...
```
Then upload this file when asked about a region boundaries csv and the notebook will automatically cut out regions for you.

Using custom features
---------------------
Follow this section to featurize using your own features.

Suppose you have a script uploaded @ `FOO.py`:

    def bar(sequence, secret_param):
        return len(sequence) + secret_param

To turn this into a sequence feature, make a json object that looks like this:

    {
        "features": {
            "my_feature": {
                "compute": "custom",
                "libpath": "FOO.py",
                "funcname": "bar",
                "kwargs": {
                    "secret_param": 42
                }
            }
        }
    }

Upload it into the file `feature_config.json`, so that when prompted about the features json file, you can upload this file and compute `len(seq) + 42` for arbitrary sequences (for example).

If you have multiple functions, you can use multiple names this way like:

    {
        "features": {
            "feature_a": ...,
            "feature_b": ...,
            ...
        }
    }

Admittedly the use of this feature format to run pre-existing python functions is kind of cumbersome. However if you get used to doing it this way there's a chance you'll find it more useful when integrating with other tools (not currently available but maybe someday) that use this feature specification format.