<a href="https://colab.research.google.com/github/alex-tianhuang/idrfeatlib/blob/main/notebooks/Featurize.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Welcome!
========

Welcome to featurize.ipynb, a quick and easy colab notebook for computing some sequence features over a file of IDRs.

To use this notebook for the first time, hover your mouse over the top left corner (around where the play buttons or cell numbers are) of each cell to run it. Just run each of the three cells in order and respond to the prompts below.

To reuse this notebook, keep running the third cell below and responding to the prompts differently or uploading different files.

What sequence features are there?
=================================

At its simplest, this notebook takes in sequences (a FASTA file), and outputs sequence features in the form of a very wide csv.

A full table of all the features and their interpretations will come soon. However, as a birds-eye view guide, generally speaking, the outputs start like:
```
ProteinID,AA_A,AA_C,...,AA_Y,...
Protein1,0.039,0.01,...,0.02,...
Protein2,0.045,0.02,...,0.02,...
...
```

which indicates `Protein1` is made of 3.9% alanines, 1% cysteines, ..., and 2% tyrosines.

Then there are many columns (as of June 24th, 2025 there are 89) motif or repeat related columns.

Some columns like `D_repeats` are simple to understand and correspond to the number of amino acids spanned by Ds that are followed immediately followed by another D (D repeats).

Some of the columns like `DOC_PP1_RVXF_1` or `LIG_PCNA_PIPBox1` come from names of motifs in the Eukaryotic Linear Motif (ELM) resource.
These are integer valued counts of motifs or the number of amino acids spanned by those motifs, and you can read more about them by searching them up in the ELM.

Then there are some columns that correspond to the percentage of sequence spanned by groups of amino acids like:

```
ProteinID,...,acidic,basic,fcr,...
Protein1,...,0.05,0.14,0.27,...
Protein2,...,0.06,0.12,0.30,...
...
```

There are a couple that correspond to residue spacing statistics like `custom_kappa`, `arospacing`, `SCD`...

And some specialized sequence features like `isoelectric_point`, `net_charge`, and so on.

In [None]:
# Setup environment by downloading and installing the idrfeatlib repo.
#
# You only need to run this once.
!file idrfeatlib/ >/dev/null && rm -rf idrfeatlib >/dev/null
!git clone https://github.com/alex-tianhuang/idrfeatlib --quiet
!pip install idrfeatlib/

In [None]:
# Prepare/define the featurization scripts.

def main(args):
    """
    Taken from idrfeatlib/featurize.py (25-06-24) @tianh.
    """
    from idrfeatlib.utils import read_fasta, read_regions_csv, iter_nested
    from idrfeatlib import FeatureVector
    from idrfeatlib.featurizer import Featurizer, compile_featurizer
    from idrfeatlib.native import compile_native_featurizer
    import json
    import tqdm
    import sys
    if args.feature_file:
        with open(args.feature_file, "r") as file:
            config = json.load(file)
        featurizer, errors = compile_featurizer(config)
    else:
        featurizer, errors = compile_native_featurizer()
    for featname, error in errors.items():
        print("error compiling `%s`: %s" % (featname, error), file=sys.stderr)
    featurizer = Featurizer(featurizer)
    if args.rtf:
        fa = read_fasta_rtf(args.input_sequences)
    else:
        fa = read_fasta(args.input_sequences)
    if args.input_regions is None:
        seqs = tqdm.tqdm(fa, total=len(fa), desc="featurizing sequences")
        feature_vectors, errors = featurizer.featurize_to_matrices(seqs)
        for proteinid, error in errors.items():
            print("error for `%s`: %s" % (proteinid, error), file=sys.stderr)
        FeatureVector.dump(list(feature_vectors.items()), args.output_file, "ProteinID")
    else:
        fa = dict(fa)
        regions = read_regions_csv(args.input_regions)
        seqs = []
        for protid, regionid, (start, stop) in iter_nested(regions, 2):
            if (entry := fa.get(protid)) is None:
                continue
            whole_seq = entry
            assert isinstance(whole_seq, str), type(whole_seq)
            seq = whole_seq[start:stop]
            if len(seq) != stop - start:
                print("invalid region `%s` for protein `%s` (start=%d,stop=%d,seqlen=%s)" % (regionid, protid, start, stop, len(whole_seq)))
                continue
            seqs.append(((protid, regionid), seq))
        feature_vectors, errors = featurizer.featurize_to_matrices(tqdm.tqdm(seqs, desc="featurizing sequences"))
        for (protid, regionid), error in errors.items():
            print("error for (protid=%s, regionid=%s): %s" % (protid, regionid, error), file=sys.stderr)
        FeatureVector.dump(list(feature_vectors.items()), args.output_file, ("ProteinID", "RegionID"))

def display_csv(output_name):
    """
    Show the table in the notebook.

    I assume colab will forever keep pandas as available by default.
    """
    from IPython.display import display
    import pandas as pd

    df = pd.read_csv(output_name)

    print()
    print("Showing output below")
    print("--------------------")
    print()
    display(df)
    print()

def run_colab_wrapper(output_name, rtf=False):
    """
    My `main` function expects an argparse.Namespace and I'm not changing it
    so most of this function is weirdly constructing this one Namespace object
    """
    from google.colab import files
    import os
    import argparse

    # The aforementioned Namespace object, starting as an empty args
    args = argparse.ArgumentParser().parse_args([])

    args.input_sequences = 'input_sequences.fasta'
    goto_upload = True
    if os.path.exists(args.input_sequences):
        choice = input(f"The file {args.input_sequences} already exists. Would you like to overwrite it? (y/n)")
        if choice.lower() != 'y':
            goto_upload = False
    if goto_upload:
        files.upload_file(args.input_sequences)

    choice = input("Would you like to upload a file containing region boundaries? (y/n)")
    if choice.lower() == 'y':
        args.input_regions = 'input_regions.csv'
        files.upload_file(args.input_regions)
    else:
        args.input_regions = None

    args.feature_file = 'feature_config.json'
    goto_upload = True
    if os.path.exists(args.feature_file):
        choice = input(f"The file {args.feature_file} already exists. Would you like to overwrite it? (y/n)")
        if choice.lower() != 'y':
            goto_upload = False
            print(f"Ignoring {args.feature_file}")
    else:
        choice = input("Would you like to upload a file containing feature configuration? (y/n)")
        if choice.lower() != 'y':
            goto_upload = False
    if goto_upload:
        files.upload_file(args.feature_file)
    else:
        args.feature_file = None

    abort = False
    args.output_file = output_name
    if os.path.exists(args.output_file):
        choice = input(f"The file {args.output_file} already exists. Would you like to overwrite it? (y/n)")
        if choice.lower() != 'y':
            abort = True
    if abort:
        print(f"Aborting because user wants to do something with {args.output_file}.")
        return

    args.rtf = rtf

    main(args)
    display_csv(args.output_file)
    print(f"Downloading output file to {args.output_file}")
    files.download(args.output_file)

In [None]:
# This cell will:
#
# 1. Ask for input files:
#    - (required) a FASTA file of sequences
#    - (optional) a CSV file of region boundaries, see extra section below
#    - (optional) a JSON file for defining custom features, see extra section below
# 2. Featurize sequences from the FASTA file
# 3. Ask to download the output file, called `output_features.csv`
#
# Run this cell after running the above cells as many times as you would like.

run_colab_wrapper("output_features.csv")
print("Done!")


Troubleshooting
===============

FASTA parsing errors
--------------------

When you see many warnings that look like:

    /usr/local/lib/python3.11/dist-packages/idrfeatlib/utils.py:24: UserWarning: non-empty line before first fasta header: ...

Typically the script will run to completion and print `Done!` but the file downloaded is a CSV with no data, only a header.

This typically means that you have not selected something that looks like a FASTA file, with `>` header lines. If you're still getting that error but it looks like a a FASTA file, see the `RTF files` section below.

RTF files
---------

Sometimes, when viewing a fasta.rtf file in a rich text editor (like TextEdit on MacOS), it will appear as if there are no characters before the first `>` but there will be errors like:

    /usr/local/lib/python3.11/dist-packages/idrfeatlib/utils.py:24: UserWarning: non-empty line before first fasta header: {\rtf1\ansi\ansicpg1252\cocoartf2513

Again, the script runs to completion and prints `Done!` but the file downloaded is a CSV with no data, only a header.

This is likely because there are invisible characters that help format the text but are messing up the FASTA parser. To solve this issue, you can run the two code cells below:

In [None]:
# Install and prepare a fasta.rtf parsing script for use in the cell below.
!pip install striprtf
from striprtf.striprtf import rtf_to_text
import warnings
def read_fasta_rtf(path):
    """Same as idrfeatlib.read_fasta, except with rtf_to_text before it."""
    with open(path, "r") as file:
        text = file.read()
    content = rtf_to_text(text)
    fasta_list = []
    for line in content.splitlines():
        line = line.strip()
        if line.startswith(">"):
            header = line[1:]
            fasta_list.append([header, []])
        elif fasta_list:
            fasta_list[-1][1].append(line)
        elif line:
            warnings.warn("Ignoring non-empty line before first fasta header: {}".format(line))
    for index, (header, seqlines) in enumerate(fasta_list):
        fasta_list[index] = (header, "".join(seqlines))
    return fasta_list


In [None]:
# This cell will:
#
# 1. Ask for input files:
#    - (required) a FASTA.RTF file of sequences
#    - (optional) a CSV file of region boundaries, see extra section below
#    - (optional) a JSON file for defining custom features, see extra section below
# 2. Featurize sequences from the FASTA file
# 3. Ask to download the output file, called `output_features.csv`
#
# Run this cell after running the above cells as many times as you would like.

run_colab_wrapper("output_features.csv", rtf=True)
print("Done!")

Extra
=====
The absolutely essential bits are above, but below are some extra tips.


Uploading regions
-----------------
If you have a fasta file of full protein sequences and a csv full of named protein regions and their start/stop boundaries, like the following:
```
ProteinID,RegionID,Start,Stop
DDX3X,N-IDR,0,142
protein_x,region3,24,72
...
```
Then upload this file when asked about a region boundaries csv and the notebook will automatically cut out regions for you.

Using custom features
---------------------
Follow this section to featurize using your own features.

Suppose you have a script uploaded @ `FOO.py`:

    def bar(sequence, secret_param):
        return len(sequence) + secret_param

To turn this into a sequence feature, make a json object that looks like this:

    {
        "features": {
            "my_feature": {
                "compute": "custom",
                "libpath": "FOO.py",
                "funcname": "bar",
                "kwargs": {
                    "secret_param": 42
                }
            }
        }
    }

Upload it into the file `feature_config.json`, so that when prompted about the features json file, you can upload this file and compute `len(seq) + 42` for arbitrary sequences (for example).

If you have multiple functions, you can use multiple names this way like:

    {
        "features": {
            "feature_a": ...,
            "feature_b": ...,
            ...
        }
    }

Admittedly the use of this feature format to run pre-existing python functions is kind of cumbersome. However if you get used to doing it this way there's a chance you'll find it more useful when integrating with other tools (not currently available but maybe someday) that use this feature specification format.