# Protein feature extraction pipeline

This notebook will contain the pipeline for extracting features from protein sequences. It will be used as a way to show the output without needing to run the `pipeline.py` file locally.

In [1]:
import pyarrow as pa
import pandas as pd
import os
import glob
import logging
from fondant.pipeline import Pipeline
from fondant.pipeline.runner import DockerRunner

logging.basicConfig(
    level=logging.DEBUG,
    format='%(asctime)s - %(levelname)s - %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S'
)

from config import MOCK_DATA_PATH_FONDANT

# check if the manifest file is removed.
REMOVED_MANIFEST = False

# check if the output folder exists
OUTPUT_FOLDER = None

## Generate Mock data

In [2]:
!python utils/generate_mock_data.py

In [3]:
# show content of the mock data
import pandas as pd
mock_df = pd.read_parquet("." + MOCK_DATA_PATH_FONDANT)  # dot added to make it relative to the current directory
mock_df

Unnamed: 0,sequence,name
0,MNQRGMPIQSLVTNVKINRLEENDCIHTRHRVRPGRTDGKNLHAMM...,Seq1
1,MAGLKPEVPLHDGINKFGKSDFAGQEGPKIVTTTDKALLVANGALK...,Seq2
2,MVDLKKELKNFVDSDFPGSPKQEAQGIDVRILLSFNNAAFREALII...,Seq3
3,MELILAKARLEFECDWGLLMLEPCVPPTKIFADRNYAVGVMFESDK...,Seq4
4,MRVLCDGSTGYACAKNTRIRFREKVASVLAKIQGYEQTFPHHMPNM...,Seq5


## Loading the dataset

In [4]:
# Create a new pipeline

BASE_PATH = ".fondant"
PIPELINE_NAME = "feature_extraction_pipeline"

pipeline = Pipeline(
	name=PIPELINE_NAME,
	base_path=BASE_PATH,
	description="A pipeline to extract features from protein sequences."
)

## Creating the pipeline

In [5]:
# Read the dataset

dataset = pipeline.read(
	"load_from_parquet",
	arguments={
		"dataset_uri": MOCK_DATA_PATH_FONDANT,
	},
	produces={
		"sequence": pa.string()
	}
)

[2024-06-24 17:18:20,798 | fondant.pipeline.pipeline | INFO] The consumes section of the component spec is not defined. Can not infer consumes of the OperationSpec. Please define a consumes section in the dataset interface. 


## Components

---

### generate_protein_sequence_checksum_component

This component generates a checksum for the protein sequence.

---

### biopython_component

Extracts features from the protein sequence using Biopython.

---

### iFeatureOmega_component

Extracts features from the protein sequence using the [iFeatureOmega-CLI GitHub repo](https://github.com/Superzchen/iFeatureOmega-CLI). Arguments are used to specify the type of features to extract.

---

### filter_pdb_component

Filters PDB files that are already predicted to avoid redundant predictions. Arguments need to be specified before running the pipeline:
```json
"storage_type": "local",
"pdb_path": "/data/<your-pdb-folder-path>",
"bucket_name": "your-bucket-name",
"project_id": "your-project-id",
"google_cloud_credentials_path": "/data/<your-credentials>.json"
```

If only using local, keep bucket_name, project_id, and google_cloud_credentials_path as empty strings. Using remote requires a Google Cloud Storage bucket with credentials and a project ID.

---

### predict_protein_3D_structure_component

Predicts the 3D structure of the protein using ESMFold. This component requires a `.env` file with the following variables:
```env
HF_API_KEY=""
HF_ENDPOINT_URL=""
```

---

### store_pdb_component

Stores the PDB files in the provided storage_type. Arguments need to be specified before running the pipeline:
```json
"storage_type": "local",
"pdb_path": "/data/<your-pdb-folder-path>",
"bucket_name": "your-bucket-name",
"project_id": "your-project-id",
"google_cloud_credentials_path": "/data/<your-credentials>.json"
```

If only using local, keep bucket_name, project_id, and google_cloud_credentials_path as empty strings. Using remote requires a Google Cloud Storage bucket with credentials and a project ID.

---

### msa_component

Generates the multiple sequence alignment for the protein sequence using [Clustal Omega](http://www.clustal.org/omega/). It's recommended to use a smaller number of sequences or none at all due to potential time consumption.

---

### unikp_component

Uses the UniKP endpoint on HuggingFace to predict the kinetic parameters of a protein sequence and substrate (SMILES) combination. See README for the description of the contents of this file.

```yaml
"protein_smiles_path": "/data/<path_protein_smiles>"
```

---

### peptide_component

Calculates the features from the protein sequence using the `peptides` package.

---

### deepTMpred_component

Predicts the transmembrane regions of the protein sequence using the [DeepTMpred GitHub repository](https://github.com/ISYSLAB-HUST/DeepTMpred)

In [6]:
_ = dataset.apply(
    "./components/biopython_component"
).apply(
    "./components/generate_protein_sequence_checksum_component"
).apply(
    "./components/iFeatureOmega_component",
    # currently forcing the number of rows to 5, but there needs to be a better way to do this, see readme for more info
    input_partition_rows=5,
    arguments={
        "descriptors": ["AAC", "CTDC", "CTDT"]
    }
).apply(
    "./components/filter_pdb_component",
    arguments={
        "method": "local",
        "local_pdb_path": "/data/pdb_files",
        "bucket_name": "",
        "project_id": "",
        "google_cloud_credentials_path": ""
    }
).apply(
    "./components/predict_protein_3D_structure_component",
).apply(
    "./components/store_pdb_component",
    arguments={
        "method": "local",
        "local_pdb_path": "/data/pdb_files/",
        "bucket_name": "elated-chassis-400207_dbtl_pipeline_outputs",
        "project_id": "elated-chassis-400207",
        "google_cloud_credentials_path": "/data/google_cloud_credentials.json"
    }
).apply(
    "./components/msa_component",
    input_partition_rows='10000'
# ).apply(
#     "./components/pdb_features_component"
# ).apply(
#     "./components/unikp_component",
#     arguments={
#         "protein_smiles_path": "/data/protein_smiles.json",
#     },
).apply(
    "./components/peptide_features_component"
# ).apply(
#     "./components/DeepTMpred_component"
)



## Run the pipeline

The `pipeline.py` file needs to be run using the command line. The following command will run the pipeline:

```bash
fondant < full_path_to_pipeline.py >\data:/data
```

In [7]:
# import shutil

# remove the most recent output folder if the manifest file is removed
# without a manifest file in the most recent output folder, the pipeline cannot be run
# if OUTPUT_FOLDER and REMOVED_MANIFEST:
# 	shutil.rmtree(OUTPUT_FOLDER)
# 	# remove cache
# 	shutil.rmtree(os.path.join(BASE_PATH, PIPELINE_NAME, "cache"))

# get current full path to the project
mounted_data = os.path.join(os.path.abspath("data"), ":/data")

DockerRunner().run(input=pipeline, extra_volumes=mounted_data)

[2024-06-24 17:18:20,966 | root | INFO] Found reference to un-compiled pipeline... compiling
[2024-06-24 17:18:20,966 | fondant.pipeline.compiler | INFO] Compiling feature_extraction_pipeline to .fondant/compose.yaml
[2024-06-24 17:18:20,967 | fondant.pipeline.compiler | INFO] Base path found on local system, setting up .fondant as mount volume
[2024-06-24 17:18:20,967 | fondant.pipeline.pipeline | INFO] Sorting pipeline component graph topologically.
[2024-06-24 17:18:20,990 | fondant.pipeline.pipeline | INFO] All pipeline component specifications match.
[2024-06-24 17:18:20,991 | fondant.pipeline.compiler | INFO] Compiling service for load_from_parquet
[2024-06-24 17:18:20,991 | fondant.pipeline.compiler | INFO] Compiling service for biopython_component
[2024-06-24 17:18:20,991 | fondant.pipeline.compiler | INFO] Found Dockerfile for biopython_component, adding build step.
[2024-06-24 17:18:20,992 | fondant.pipeline.compiler | INFO] Compiling service for generate_protein_sequence_che

Starting pipeline run...


 load_from_parquet Pulled 


#0 building with "desktop-linux" instance using docker driver

#1 [biopython_component internal] load build definition from Dockerfile
#1 transferring dockerfile: 480B done
#1 DONE 0.0s

#2 [biopython_component internal] load metadata for docker.io/fndnt/fondant:0.11.dev5-py3.10
#2 DONE 0.0s

#3 [biopython_component internal] load .dockerignore
#3 transferring context: 2B done
#3 DONE 0.0s

#4 [biopython_component 1/6] FROM docker.io/fndnt/fondant:0.11.dev5-py3.10
#4 DONE 0.0s

#5 [biopython_component internal] load build context
#5 transferring context: 3.25kB done
#5 DONE 0.0s

#6 [biopython_component 3/6] COPY requirements.txt ./
#6 CACHED

#7 [biopython_component 2/6] RUN apt-get update &&     apt-get upgrade -y &&     apt-get install git -y
#7 CACHED

#8 [biopython_component 4/6] RUN pip3 install --no-cache-dir -r requirements.txt
#8 CACHED

#9 [biopython_component 5/6] WORKDIR /component/src
#9 CACHED

#10 [biopython_component 6/6] COPY src/ .
#10 CACHED

#11 [biopython_component

 Container feature_extraction_pipeline-deeptm_prediction_component-1  Stopping
 Container feature_extraction_pipeline-deeptm_prediction_component-1  Stopped
 Container feature_extraction_pipeline-deeptm_prediction_component-1  Removing
 Container feature_extraction_pipeline-deeptm_prediction_component-1  Removed
 Container feature_extraction_pipeline-load_from_parquet-1  Recreate
 Container feature_extraction_pipeline-load_from_parquet-1  Recreated
 Container feature_extraction_pipeline-biopython_component-1  Recreate
 Container feature_extraction_pipeline-biopython_component-1  Recreated
 Container feature_extraction_pipeline-generate_protein_sequence_checksum_component-1  Recreate
 Container feature_extraction_pipeline-generate_protein_sequence_checksum_component-1  Recreated
 Container feature_extraction_pipeline-ifeatureomega_component-1  Recreate
 Container feature_extraction_pipeline-ifeatureomega_component-1  Recreated
 Container feature_extraction_pipeline-filter_pdb_component-

Attaching to biopython_component-1, filter_pdb_component-1, generate_protein_sequence_checksum_component-1, ifeatureomega_component-1, load_from_parquet-1, msa_component-1, peptide_features_component-1, predict_protein_3d_structure_component-1, store_pdb_component-1


load_from_parquet-1                             | [2024-06-24 15:18:27,915 | fondant.cli | INFO] Component `LoadFromParquet` found in module main
load_from_parquet-1                             | [2024-06-24 15:18:27,919 | fondant.component.executor | INFO] Skipping component execution
load_from_parquet-1                             | [2024-06-24 15:18:27,920 | fondant.component.executor | INFO] Matching execution detected for component. The last execution of the component originated from `feature_extraction_pipeline-20240613115444`.
load_from_parquet-1                             | [2024-06-24 15:18:27,921 | fondant.component.executor | INFO] Saving output manifest to /.fondant/feature_extraction_pipeline/feature_extraction_pipeline-20240624171820/load_from_parquet/manifest.json
load_from_parquet-1                             | [2024-06-24 15:18:27,921 | fondant.component.executor | INFO] Writing cache key with manifest reference to /.fondant/feature_extraction_pipeline/cache/d41a53a1

[Kload_from_parquet-1 exited with code 0


biopython_component-1                           | [2024-06-24 15:18:29,687 | fondant.cli | INFO] Component `BiopythonComponent` found in module main
biopython_component-1                           | [2024-06-24 15:18:29,690 | fondant.component.executor | INFO] Caching disabled for the component
biopython_component-1                           | [2024-06-24 15:18:29,690 | root | INFO] Executing component
biopython_component-1                           | [2024-06-24 15:18:29,967 | distributed.http.proxy | INFO] To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
biopython_component-1                           | [2024-06-24 15:18:29,986 | distributed.scheduler | INFO] State start
biopython_component-1                           | [2024-06-24 15:18:29,991 | distributed.scheduler | INFO]   Scheduler at:     tcp://127.0.0.1:43155
biopython_component-1                           | [2024-06-24 15:18:29,992 | distributed.schedu

[Kbiopython_component-1 exited with code 0


generate_protein_sequence_checksum_component-1  | [2024-06-24 15:18:34,335 | fondant.cli | INFO] Component `GenerateProteinSequenceChecksumComponent` found in module main
generate_protein_sequence_checksum_component-1  | [2024-06-24 15:18:34,338 | fondant.component.executor | INFO] Caching disabled for the component
generate_protein_sequence_checksum_component-1  | [2024-06-24 15:18:34,338 | root | INFO] Executing component
generate_protein_sequence_checksum_component-1  | [2024-06-24 15:18:34,607 | distributed.http.proxy | INFO] To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
generate_protein_sequence_checksum_component-1  | [2024-06-24 15:18:34,626 | distributed.scheduler | INFO] State start
generate_protein_sequence_checksum_component-1  | [2024-06-24 15:18:34,631 | distributed.scheduler | INFO]   Scheduler at:     tcp://127.0.0.1:39847
generate_protein_sequence_checksum_component-1  | [2024-06-24 15:18:34,63

[Kgenerate_protein_sequence_checksum_component-1 exited with code 0


ifeatureomega_component-1                       | [2024-06-24 15:18:39,910 | matplotlib.font_manager | INFO] generated new fontManager
ifeatureomega_component-1                       | [2024-06-24 15:18:40,204 | fondant.cli | INFO] Component `IFeatureOmegaComponent` found in module main
ifeatureomega_component-1                       | [2024-06-24 15:18:40,211 | fondant.component.executor | INFO] Caching disabled for the component
ifeatureomega_component-1                       | [2024-06-24 15:18:40,211 | root | INFO] Executing component
ifeatureomega_component-1                       | [2024-06-24 15:18:40,494 | distributed.http.proxy | INFO] To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
ifeatureomega_component-1                       | [2024-06-24 15:18:40,513 | distributed.scheduler | INFO] State start
ifeatureomega_component-1                       | [2024-06-24 15:18:40,517 | distributed.scheduler | INFO

[Kifeatureomega_component-1 exited with code 0


filter_pdb_component-1                          | [2024-06-24 15:18:46,766 | fondant.cli | INFO] Component `FilterPDBComponent` found in module main
filter_pdb_component-1                          | [2024-06-24 15:18:46,772 | fondant.component.executor | INFO] Caching disabled for the component
filter_pdb_component-1                          | [2024-06-24 15:18:46,772 | root | INFO] Executing component
filter_pdb_component-1                          | [2024-06-24 15:18:47,079 | distributed.http.proxy | INFO] To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
filter_pdb_component-1                          | [2024-06-24 15:18:47,097 | distributed.scheduler | INFO] State start
filter_pdb_component-1                          | [2024-06-24 15:18:47,110 | distributed.scheduler | INFO]   Scheduler at:     tcp://127.0.0.1:39747
filter_pdb_component-1                          | [2024-06-24 15:18:47,110 | distributed.schedu

[Kfilter_pdb_component-1 exited with code 0


predict_protein_3d_structure_component-1        | [2024-06-24 15:18:51,410 | fondant.cli | INFO] Component `PredictProtein3DStructureComponent` found in module main
predict_protein_3d_structure_component-1        | [2024-06-24 15:18:51,415 | fondant.component.executor | INFO] Caching disabled for the component
predict_protein_3d_structure_component-1        | [2024-06-24 15:18:51,415 | root | INFO] Executing component
predict_protein_3d_structure_component-1        | [2024-06-24 15:18:51,700 | distributed.http.proxy | INFO] To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
predict_protein_3d_structure_component-1        | [2024-06-24 15:18:51,720 | distributed.scheduler | INFO] State start
predict_protein_3d_structure_component-1        | [2024-06-24 15:18:51,724 | distributed.scheduler | INFO]   Scheduler at:     tcp://127.0.0.1:35995
predict_protein_3d_structure_component-1        | [2024-06-24 15:18:51,724 | di

[Kpredict_protein_3d_structure_component-1 exited with code 0


store_pdb_component-1                           | [2024-06-24 15:18:55,612 | fondant.cli | INFO] Component `StorePDBComponent` found in module main
store_pdb_component-1                           | [2024-06-24 15:18:55,619 | fondant.component.executor | INFO] Caching disabled for the component
store_pdb_component-1                           | [2024-06-24 15:18:55,619 | root | INFO] Executing component
store_pdb_component-1                           | [2024-06-24 15:18:55,875 | distributed.http.proxy | INFO] To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
store_pdb_component-1                           | [2024-06-24 15:18:55,894 | distributed.scheduler | INFO] State start
store_pdb_component-1                           | [2024-06-24 15:18:55,898 | distributed.scheduler | INFO]   Scheduler at:     tcp://127.0.0.1:45693
store_pdb_component-1                           | [2024-06-24 15:18:55,898 | distributed.schedul

[Kstore_pdb_component-1 exited with code 0


msa_component-1                                 | [2024-06-24 15:18:59,836 | fondant.cli | INFO] Component `MSAComponent` found in module main
msa_component-1                                 | [2024-06-24 15:18:59,841 | fondant.component.executor | INFO] Caching disabled for the component
msa_component-1                                 | [2024-06-24 15:18:59,841 | root | INFO] Executing component
msa_component-1                                 | [2024-06-24 15:19:00,112 | distributed.http.proxy | INFO] To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
msa_component-1                                 | [2024-06-24 15:19:00,131 | distributed.scheduler | INFO] State start
msa_component-1                                 | [2024-06-24 15:19:00,135 | distributed.scheduler | INFO]   Scheduler at:     tcp://127.0.0.1:45005
msa_component-1                                 | [2024-06-24 15:19:00,135 | distributed.scheduler | 

[Kmsa_component-1 exited with code 0


peptide_features_component-1                    | [2024-06-24 15:19:04,093 | fondant.cli | INFO] Component `PeptideFeaturesComponent` found in module main
peptide_features_component-1                    | [2024-06-24 15:19:04,099 | fondant.component.executor | INFO] Caching disabled for the component
peptide_features_component-1                    | [2024-06-24 15:19:04,099 | root | INFO] Executing component
peptide_features_component-1                    | [2024-06-24 15:19:04,365 | distributed.http.proxy | INFO] To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
peptide_features_component-1                    | [2024-06-24 15:19:04,385 | distributed.scheduler | INFO] State start
peptide_features_component-1                    | [2024-06-24 15:19:04,389 | distributed.scheduler | INFO]   Scheduler at:     tcp://127.0.0.1:38037
peptide_features_component-1                    | [2024-06-24 15:19:04,390 | distributed.

[Kpeptide_features_component-1 exited with code 0
Finished pipeline run.


## Results

The following results have been taken from the output of the pipeline, which is stored in the `.fondant` directory. This directory contains the output of each component, together with the cache of the previous run. Currently, the pipeline doesn't implement the `write_to_file` component, so the results will be taken individually from the output of each component.

In [8]:
# find the most recent output folder
# get the most recent folder in the folder named: BASE_PATH + PIPELINE_NAME + PIPELINE_NAME-<timestamp>
matching_folders = glob.glob(f"{BASE_PATH}/{PIPELINE_NAME}/{PIPELINE_NAME}-*")

if matching_folders:
    last_folder = max(matching_folders, key=os.path.getctime)

logging.info(f"Last folder: {last_folder}")


[2024-06-24 17:19:07,100 | root | INFO] Last folder: .fondant/feature_extraction_pipeline/feature_extraction_pipeline-20240624171820


In [9]:
from pathlib import Path

def merge_parquet_folders(folder_path):
    df_list = []

    for folder in Path(folder_path).iterdir():
        if folder.is_dir():
            logging.info(f"Reading parquet partitions from: {folder}")
            parquet_files = list(folder.glob("*.parquet"))
            logging.info(f"Found {len(parquet_files)} parquet files")
            dfs = [pd.read_parquet(file) for file in parquet_files]
            dfs = [x for x in dfs if not x.empty]
            if len(dfs) == 0:
                continue
            df = pd.concat(dfs)
            df_list.append(df)

    return df_list

In [10]:
dataframe_list = merge_parquet_folders(last_folder)


df_final = pd.concat(dataframe_list, axis=1)
df_final = df_final.loc[:,~df_final.columns.duplicated()]

# filtering out columns that are not properly stored in a csv
columns_to_remove = ['pdb_string']
df_final = df_final.drop(columns=columns_to_remove)

# write to file
df_final.to_csv(f"{last_folder}/final_output.csv", index=False)



[2024-06-24 17:19:07,128 | root | INFO] Reading parquet partitions from: .fondant/feature_extraction_pipeline/feature_extraction_pipeline-20240624171820/filter_pdb_component
[2024-06-24 17:19:07,129 | root | INFO] Found 8 parquet files
[2024-06-24 17:19:07,157 | root | INFO] Reading parquet partitions from: .fondant/feature_extraction_pipeline/feature_extraction_pipeline-20240624171820/predict_protein_3d_structure_component
[2024-06-24 17:19:07,158 | root | INFO] Found 8 parquet files
[2024-06-24 17:19:07,182 | root | INFO] Reading parquet partitions from: .fondant/feature_extraction_pipeline/feature_extraction_pipeline-20240624171820/biopython_component
[2024-06-24 17:19:07,183 | root | INFO] Found 8 parquet files
[2024-06-24 17:19:07,206 | root | INFO] Reading parquet partitions from: .fondant/feature_extraction_pipeline/feature_extraction_pipeline-20240624171820/load_from_parquet
[2024-06-24 17:19:07,207 | root | INFO] Found 0 parquet files
[2024-06-24 17:19:07,207 | root | INFO] Re