# Protein feature extraction pipeline

This notebook will contain the pipeline for extracting features from protein sequences. It will be used as a way to show the output without needing to run the `pipeline.py` file locally.

In [1]:
import pyarrow as pa
# import pandas as pd
from fondant.dataset import Dataset
import os
from config import MOCK_DATA_PATH_FONDANT

# check if the manifest file is removed.
REMOVED_MANIFEST = False

# check if the output folder exists
OUTPUT_FOLDER = None

## Generate Mock data

In [2]:
!python utils/generate_mock_data.py

In [3]:
# show content of the mock data
import pandas as pd
mock_df = pd.read_parquet("." + MOCK_DATA_PATH_FONDANT)  # dot added to make it relative to the current directory
mock_df

Unnamed: 0,sequence,name
0,MNQRGMPIQSLVTNVKINRLEENDCIHTRHRVRPGRTDGKNLHAMM...,Seq1
1,MAGLKPEVPLHDGINKFGKSDFAGQEGPKIVTTTDKALLVANGALK...,Seq2
2,MVDLKKELKNFVDSDFPGSPKQEAQGIDVRILLSFNNAAFREALII...,Seq3
3,MELILAKARLEFECDWGLLMLEPCVPPTKIFADRNYAVGVMFESDK...,Seq4
4,MRVLCDGSTGYACAKNTRIRFREKVASVLAKIQGYEQTFPHHMPNM...,Seq5


## Loading the dataset

In [4]:
# # Create a new pipeline

BASE_PATH = ".fondant"
PIPELINE_NAME = "feature_extraction_pipeline"

# dataset = Dataset(
# 	name=PIPELINE_NAME,
# 	base_path=BASE_PATH,
# 	description="A pipeline to extract features from protein sequences."
# )

## Creating the pipeline

In [5]:
# Read the dataset

raw_data = Dataset.create(
	"load_from_parquet",
	arguments={
		"dataset_uri": MOCK_DATA_PATH_FONDANT,
	},
	produces={
		"sequence": pa.string()
	}
)



[2024-06-12 13:15:22,274 | fondant.dataset.dataset | INFO] The consumes section of the component spec is not defined. Can not infer consumes of the OperationSpec. Please define a consumes section in the dataset interface. 


## Components

---

### generate_protein_sequence_checksum_component

This component generates a checksum for the protein sequence.

---

### biopython_component

Extracts features from the protein sequence using Biopython.

---

### iFeatureOmega_component

Extracts features from the protein sequence using the [iFeatureOmega-CLI GitHub repo](https://github.com/Superzchen/iFeatureOmega-CLI). Arguments are used to specify the type of features to extract.

---

### filter_pdb_component

Filters PDB files that are already predicted to avoid redundant predictions. Arguments need to be specified before running the pipeline:
```json
"storage_type": "local",
"pdb_path": "/data/<your-pdb-folder-path>",
"bucket_name": "your-bucket-name",
"project_id": "your-project-id",
"google_cloud_credentials_path": "/data/<your-credentials>.json"
```

If only using local, keep bucket_name, project_id, and google_cloud_credentials_path as empty strings. Using remote requires a Google Cloud Storage bucket with credentials and a project ID.

---

### predict_protein_3D_structure_component

Predicts the 3D structure of the protein using ESMFold. This component requires a `.env` file with the following variables:
```env
HF_API_KEY=""
HF_ENDPOINT_URL=""
```

---

### store_pdb_component

Stores the PDB files in the provided storage_type. Arguments need to be specified before running the pipeline:
```json
"storage_type": "local",
"pdb_path": "/data/<your-pdb-folder-path>",
"bucket_name": "your-bucket-name",
"project_id": "your-project-id",
"google_cloud_credentials_path": "/data/<your-credentials>.json"
```

If only using local, keep bucket_name, project_id, and google_cloud_credentials_path as empty strings. Using remote requires a Google Cloud Storage bucket with credentials and a project ID.

---

### msa_component

Generates the multiple sequence alignment for the protein sequence using [Clustal Omega](http://www.clustal.org/omega/). It's recommended to use a smaller number of sequences or none at all due to potential time consumption.

---

### unikp_component

Uses the UniKP endpoint on HuggingFace to predict the kinetic parameters of a protein sequence and substrate (SMILES) combination. See README for the description of the contents of this file.

```yaml
"protein_smiles_path": "/data/<path_protein_smiles>"
```

---

### peptide_component

Calculates the features from the protein sequence using the `peptides` package.

---

### deepTMpred_component

Predicts the transmembrane regions of the protein sequence using the [DeepTMpred GitHub repository](https://github.com/ISYSLAB-HUST/DeepTMpred)

In [6]:
final_dataset = raw_data.apply(
	"./components/biopython_component"
# ).apply(
# 	"./components/generate_protein_sequence_checksum_component"
# ).apply(
# 	"./components/iFeatureOmega_component",
# 	# currently forcing the number of rows to 5, but there needs to be a better way to do this, see readme for more info
# 	input_partition_rows=5,
# 	arguments={
# 		"descriptors": ["AAC", "CTDC", "CTDT"]
# 	}
# ).apply(
# 	"./components/filter_pdb_component",
# 	arguments={
# 		"method": "local",
# 		"local_pdb_path": "/data/pdb_files",
# 		"bucket_name": "",
# 		"project_id": "",
# 		"google_cloud_credentials_path": ""
# 	}
# ).apply(
# 	"./components/predict_protein_3D_structure_component",
# ).apply(
# 	"./components/store_pdb_component",
# 	arguments={
# 		"method": "local",
# 		"local_pdb_path": "/data/pdb_files/",
# 		"bucket_name": "",
# 		"project_id": "",
# 		"google_cloud_credentials_path": ""
# 	}
# ).apply(
# 	"./components/msa_component",
# ).apply(
# 	"./components/pdb_features_component"
# ).apply(
# 	"./components/unikp_component",
# 	arguments={
# 		"protein_smiles_path": "/data/protein_smiles.json",
# 	},
# ).apply(
# 	"./components/peptide_features_component"
# ).apply(
# 	"./components/DeepTMpred_component"
)



## Run the pipeline

The `pipeline.py` file needs to be run using the command line. The following command will run the pipeline:

```bash
fondant < full_path_to_pipeline.py >\data:/data
```

In [7]:
from fondant.dataset.runner import DockerRunner
import shutil

# remove the most recent output folder if the manifest file is removed
# without a manifest file in the most recent output folder, the pipeline cannot be run
if OUTPUT_FOLDER and REMOVED_MANIFEST:
	shutil.rmtree(OUTPUT_FOLDER)
	# remove cache
	shutil.rmtree(os.path.join(BASE_PATH, PIPELINE_NAME, "cache"))

# get current full path to the project
mounted_data = os.path.join(os.path.abspath("data"), ":/data")

runner = DockerRunner()
runner.run(dataset=final_dataset, working_directory=mounted_data, extra_volumes=mounted_data)

[2024-06-12 13:15:22,363 | root | INFO] Found reference to un-compiled workflow... compiling
[2024-06-12 13:15:22,365 | fondant.dataset.compiler | INFO] Base path not found on local system, created base path and setting up /home/pietercoussement/Software/Sandbox/protein-feature-extraction/data/:/data as mount volume
[2024-06-12 13:15:22,365 | fondant.dataset.dataset | INFO] Sorting workflow graph topologically.
[2024-06-12 13:15:22,367 | fondant.dataset.dataset | INFO] All workflow component specifications match.
[2024-06-12 13:15:22,368 | fondant.dataset.compiler | INFO] Compiling service for load_from_parquet
[2024-06-12 13:15:22,368 | fondant.dataset.compiler | INFO] Compiling service for biopython_component
[2024-06-12 13:15:22,369 | fondant.dataset.compiler | INFO] Found Dockerfile for biopython_component, adding build step.
[2024-06-12 13:15:22,376 | fondant.dataset.compiler | INFO] Successfully compiled to .fondant/compose.yaml
 load_from_parquet Pulling 


Docker version:
(26, 1, 4)
Starting workflow run...


 13808c22b207 Pulling fs layer 
 6c9a484475c1 Pulling fs layer 
 fb408522af25 Pulling fs layer 
 54ac57f98245 Pulling fs layer 
 de5947f22207 Pulling fs layer 
 ce6504e44327 Pulling fs layer 
 967763eca59b Pulling fs layer 
 c36f755488a2 Pulling fs layer 
 61055d3d12d4 Pulling fs layer 
 6b82aacabbe1 Pulling fs layer 
 8e6d0a0acda7 Pulling fs layer 
 967763eca59b Waiting 
 c36f755488a2 Waiting 
 61055d3d12d4 Waiting 
 6b82aacabbe1 Waiting 
 54ac57f98245 Waiting 
 de5947f22207 Waiting 
 8e6d0a0acda7 Waiting 
 ce6504e44327 Waiting 
 6c9a484475c1 Downloading [>                                                  ]  36.13kB/3.508MB
 fb408522af25 Downloading [>                                                  ]  127.2kB/12.38MB
 fb408522af25 Downloading [===>                                               ]  920.8kB/12.38MB
 13808c22b207 Downloading [>                                                  ]  294.9kB/29.13MB
 13808c22b207 Downloading [=>                                               

#0 building with "desktop-linux" instance using docker driver

#1 [biopython_component internal] load build definition from Dockerfile
#1 transferring dockerfile: 476B done
#1 DONE 0.1s

#2 [biopython_component internal] load metadata for docker.io/fndnt/fondant:0.12.1-py3.9
#2 DONE 1.5s

#3 [biopython_component internal] load .dockerignore
#3 transferring context: 2B done
#3 DONE 0.2s

#4 [biopython_component internal] load build context
#4 transferring context: 123B done
#4 DONE 0.0s

#5 [biopython_component 1/6] FROM docker.io/fndnt/fondant:0.12.1-py3.9@sha256:138b4b8ddef694c5256ec9c0e67f37460b1258a7aed66f1225c0e6c8b4764950
#5 resolve docker.io/fndnt/fondant:0.12.1-py3.9@sha256:138b4b8ddef694c5256ec9c0e67f37460b1258a7aed66f1225c0e6c8b4764950 0.0s done
#5 sha256:e8690706172b28215b62640a817fd89cc50c6a011c0877e20be0956b6312bc70 0B / 244B 0.1s
#5 sha256:4a81626d2c6be5f7d7ee1da92369e83a013dd773f83958f463314daa30b7e2dd 0B / 11.89MB 0.1s
#5 sha256:8c60f620628dd111479032e759bd057e2c05944ad3

 Network dataset-17_default  Creating
 Network dataset-17_default  Created
 Container dataset-17-load_from_parquet-1  Creating
 Container dataset-17-load_from_parquet-1  Created
 Container dataset-17-biopython_component-1  Creating
 Container dataset-17-biopython_component-1  Created
load_from_parquet-1    | [2024-06-12 11:16:27,624 | fondant.cli | INFO] Component `LoadFromParquet` found in module main
load_from_parquet-1    | [2024-06-12 11:16:27,628 | fondant.component.executor | INFO] Caching is currently temporarily disabled.
load_from_parquet-1    | [2024-06-12 11:16:27,628 | fondant.component.executor | INFO] No matching execution for component detected
load_from_parquet-1    | [2024-06-12 11:16:27,628 | root | INFO] Executing component
load_from_parquet-1    | [2024-06-12 11:16:27,968 | distributed.http.proxy | INFO] To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
load_from_parquet-1    | [2024-06-12 11:1

[Kload_from_parquet-1 exited with code 0


biopython_component-1  | [2024-06-12 11:16:33,319 | fondant.cli | INFO] Component `BiopythonComponent` found in module main
biopython_component-1  | [2024-06-12 11:16:33,324 | fondant.component.executor | INFO] Caching disabled for the component
biopython_component-1  | [2024-06-12 11:16:33,325 | root | INFO] Executing component
biopython_component-1  | [2024-06-12 11:16:33,851 | distributed.http.proxy | INFO] To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
biopython_component-1  | [2024-06-12 11:16:33,902 | distributed.scheduler | INFO] State start
biopython_component-1  | [2024-06-12 11:16:33,908 | distributed.scheduler | INFO]   Scheduler at:     tcp://127.0.0.1:37543
biopython_component-1  | [2024-06-12 11:16:33,908 | distributed.scheduler | INFO]   dashboard at:  http://127.0.0.1:8787/status
biopython_component-1  | [2024-06-12 11:16:33,909 | distributed.scheduler | INFO] Registering Worker plugin shuffle
b

[Kbiopython_component-1 exited with code 1
Finished workflow run.


## Results

The following results have been taken from the output of the pipeline, which is stored in the `.fondant` directory. This directory contains the output of each component, together with the cache of the previous run. Currently, the pipeline doesn't implement the `write_to_file` component, so the results will be taken individually from the output of each component.

In [8]:
import glob

# get the most recent folder in the folder named: BASE_PATH + PIPELINE_NAME + PIPELINE_NAME-<timestamp>
matching_folders = glob.glob(f"{BASE_PATH}/{PIPELINE_NAME}/{PIPELINE_NAME}-*")

if matching_folders:
    OUTPUT_FOLDER = max(matching_folders, key=os.path.getctime)
else:
    print("No matching folders found")
    exit()

if os.path.exists(OUTPUT_FOLDER):
	# remove the manifest file from each folder in the output folder
	for root, dirs, files in os.walk(OUTPUT_FOLDER):
		for file in files:
			if file == "manifest.json":
				os.remove(os.path.join(root, file))
				REMOVED_MANIFEST = True

No matching folders found


TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

: 

In [None]:
import os
import pandas as pd

def merge_parquet_folders(folder_path):
	merge_df = pd.DataFrame()
	
	for folder in os.listdir(folder_path):
		parquet_partitions = os.path.join(folder_path, folder)
		df = pd.read_parquet(parquet_partitions)
		
		if merge_df.empty:
			merge_df = df
		else:
			merge_df = merge_df.merge(df, on="sequence")
	
	return merge_df

In [None]:
if REMOVED_MANIFEST and os.path.exists(OUTPUT_FOLDER):
	merged_df = merge_parquet_folders(OUTPUT_FOLDER)
	merged_df

In [None]:
if REMOVED_MANIFEST and os.path.exists(OUTPUT_FOLDER):
	if not os.path.exists(os.path.join(os.path.abspath("data"), "export")):
		os.makedirs(os.path.join(os.path.abspath("data"), "export"))

	output_path = os.path.join(os.path.abspath("data"), "export")

	merged_df.to_parquet(os.path.join(output_path, "results.parquet"))

In [None]:
# read the output file

output_df = pd.read_parquet("./data/export/results.parquet")
output_df