# Protein feature extraction pipeline

This notebook will contain the pipeline for extracting features from protein sequences. It will be used as a way to show the output without needing to run the `pipeline.py` file locally.

In [1]:
import pyarrow as pa
import pandas as pd
from fondant.pipeline import Pipeline
import os
from config import MOCK_DATA_PATH_FONDANT

# check if the manifest file is removed.
REMOVED_MANIFEST = False

# check if the output folder exists
OUTPUT_FOLDER = None

## Generate Mock data

In [2]:
!python utils/generate_mock_data.py

In [3]:
# show content of the mock data
import pandas as pd
mock_df = pd.read_parquet("." + MOCK_DATA_PATH_FONDANT)  # dot added to make it relative to the current directory
mock_df

Unnamed: 0,sequence,name
0,MNQRGMPIQSLVTNVKINRLEENDCIHTRHRVRPGRTDGKNLHAMM...,Seq1
1,MAGLKPEVPLHDGINKFGKSDFAGQEGPKIVTTTDKALLVANGALK...,Seq2
2,MVDLKKELKNFVDSDFPGSPKQEAQGIDVRILLSFNNAAFREALII...,Seq3
3,MELILAKARLEFECDWGLLMLEPCVPPTKIFADRNYAVGVMFESDK...,Seq4
4,MRVLCDGSTGYACAKNTRIRFREKVASVLAKIQGYEQTFPHHMPNM...,Seq5


## Loading the dataset

In [4]:
# Create a new pipeline

BASE_PATH = ".fondant"
PIPELINE_NAME = "feature_extraction_pipeline"

pipeline = Pipeline(
	name=PIPELINE_NAME,
	base_path=BASE_PATH,
	description="A pipeline to extract features from protein sequences."
)

## Creating the pipeline

In [5]:
# Read the dataset

dataset = pipeline.read(
	"load_from_parquet",
	arguments={
		"dataset_uri": MOCK_DATA_PATH_FONDANT,
	},
	produces={
		"sequence": pa.string()
	}
)

[2024-06-12 15:27:11,305 | fondant.pipeline.pipeline | INFO] The consumes section of the component spec is not defined. Can not infer consumes of the OperationSpec. Please define a consumes section in the dataset interface. 


## Components

---

### generate_protein_sequence_checksum_component

This component generates a checksum for the protein sequence.

---

### biopython_component

Extracts features from the protein sequence using Biopython.

---

### iFeatureOmega_component

Extracts features from the protein sequence using the [iFeatureOmega-CLI GitHub repo](https://github.com/Superzchen/iFeatureOmega-CLI). Arguments are used to specify the type of features to extract.

---

### filter_pdb_component

Filters PDB files that are already predicted to avoid redundant predictions. Arguments need to be specified before running the pipeline:
```json
"storage_type": "local",
"pdb_path": "/data/<your-pdb-folder-path>",
"bucket_name": "your-bucket-name",
"project_id": "your-project-id",
"google_cloud_credentials_path": "/data/<your-credentials>.json"
```

If only using local, keep bucket_name, project_id, and google_cloud_credentials_path as empty strings. Using remote requires a Google Cloud Storage bucket with credentials and a project ID.

---

### predict_protein_3D_structure_component

Predicts the 3D structure of the protein using ESMFold. This component requires a `.env` file with the following variables:
```env
HF_API_KEY=""
HF_ENDPOINT_URL=""
```

---

### store_pdb_component

Stores the PDB files in the provided storage_type. Arguments need to be specified before running the pipeline:
```json
"storage_type": "local",
"pdb_path": "/data/<your-pdb-folder-path>",
"bucket_name": "your-bucket-name",
"project_id": "your-project-id",
"google_cloud_credentials_path": "/data/<your-credentials>.json"
```

If only using local, keep bucket_name, project_id, and google_cloud_credentials_path as empty strings. Using remote requires a Google Cloud Storage bucket with credentials and a project ID.

---

### msa_component

Generates the multiple sequence alignment for the protein sequence using [Clustal Omega](http://www.clustal.org/omega/). It's recommended to use a smaller number of sequences or none at all due to potential time consumption.

---

### unikp_component

Uses the UniKP endpoint on HuggingFace to predict the kinetic parameters of a protein sequence and substrate (SMILES) combination. See README for the description of the contents of this file.

```yaml
"protein_smiles_path": "/data/<path_protein_smiles>"
```

---

### peptide_component

Calculates the features from the protein sequence using the `peptides` package.

---

### deepTMpred_component

Predicts the transmembrane regions of the protein sequence using the [DeepTMpred GitHub repository](https://github.com/ISYSLAB-HUST/DeepTMpred)

In [8]:
_ = dataset.apply(
	"./components/biopython_component"
).apply(
	"./components/generate_protein_sequence_checksum_component"
# ).apply(
# 	"./components/iFeatureOmega_component",
# 	# currently forcing the number of rows to 5, but there needs to be a better way to do this, see readme for more info
# 	input_partition_rows=5,
# 	arguments={
# 		"descriptors": ["AAC", "CTDC", "CTDT"]
# 	}
# ).apply(
# 	"./components/filter_pdb_component",
# 	arguments={
# 		"method": "local",
# 		"local_pdb_path": "/data/pdb_files",
# 		"bucket_name": "",
# 		"project_id": "",
# 		"google_cloud_credentials_path": ""
# 	}
# ).apply(
# 	"./components/predict_protein_3D_structure_component",
# ).apply(
# 	"./components/store_pdb_component",
# 	arguments={
# 		"method": "local",
# 		"local_pdb_path": "/data/pdb_files/",
# 		"bucket_name": "elated-chassis-400207_dbtl_pipeline_outputs",
# 		"project_id": "elated-chassis-400207",
# 		"google_cloud_credentials_path": "/data/google_cloud_credentials.json"
# 	}
).apply(
	"./components/msa_component",
# ).apply(
# 	"./components/pdb_features_component"
# ).apply(
# 	"./components/unikp_component",
# 	arguments={
# 		"protein_smiles_path": "/data/protein_smiles.json",
# 	},
# ).apply(
# 	"./components/peptide_features_component"
# ).apply(
# 	"./components/DeepTMpred_component"
)





## Run the pipeline

The `pipeline.py` file needs to be run using the command line. The following command will run the pipeline:

```bash
fondant < full_path_to_pipeline.py >\data:/data
```

In [9]:
from fondant.pipeline.runner import DockerRunner
import shutil

# remove the most recent output folder if the manifest file is removed
# without a manifest file in the most recent output folder, the pipeline cannot be run
if OUTPUT_FOLDER and REMOVED_MANIFEST:
	shutil.rmtree(OUTPUT_FOLDER)
	# remove cache
	shutil.rmtree(os.path.join(BASE_PATH, PIPELINE_NAME, "cache"))

# get current full path to the project
mounted_data = os.path.join(os.path.abspath("data"), ":/data")

DockerRunner().run(input=pipeline, extra_volumes=mounted_data)

[2024-06-12 15:27:42,680 | root | INFO] Found reference to un-compiled pipeline... compiling
[2024-06-12 15:27:42,680 | fondant.pipeline.compiler | INFO] Compiling feature_extraction_pipeline to .fondant/compose.yaml
[2024-06-12 15:27:42,681 | fondant.pipeline.compiler | INFO] Base path found on local system, setting up .fondant as mount volume
[2024-06-12 15:27:42,681 | fondant.pipeline.pipeline | INFO] Sorting pipeline component graph topologically.


InvalidPipelineDefinition: Component 'pdb_features_component' is trying to invoke thefield 'pdb_string', which has not been defined or createdin the previous components. 
Available field names: ['sequence', 'sequence_length', 'molecular_weight', 'aromaticity', 'isoelectric_point', 'instability_index', 'gravy', 'helix', 'turn', 'sheet', 'charge_at_ph3', 'charge_at_ph5', 'charge_at_ph7', 'charge_at_ph9', 'molar_extinction_coefficient_oxidized', 'molar_extinction_coefficient_reduced', 'flexibility_max', 'flexibility_min', 'flexibility_mean', 'sequence_checksum', 'msa_sequence']

## Results

The following results have been taken from the output of the pipeline, which is stored in the `.fondant` directory. This directory contains the output of each component, together with the cache of the previous run. Currently, the pipeline doesn't implement the `write_to_file` component, so the results will be taken individually from the output of each component.

In [None]:
import glob

# get the most recent folder in the folder named: BASE_PATH + PIPELINE_NAME + PIPELINE_NAME-<timestamp>
matching_folders = glob.glob(f"{BASE_PATH}/{PIPELINE_NAME}/{PIPELINE_NAME}-*")

if matching_folders:
    OUTPUT_FOLDER = max(matching_folders, key=os.path.getctime)
else:
    print("No matching folders found")
    exit()

if os.path.exists(OUTPUT_FOLDER):
	# remove the manifest file from each folder in the output folder
	for root, dirs, files in os.walk(OUTPUT_FOLDER):
		for file in files:
			if file == "manifest.json":
				os.remove(os.path.join(root, file))
				REMOVED_MANIFEST = True

In [None]:
import os
import pandas as pd

def merge_parquet_folders(folder_path):
	merge_df = pd.DataFrame()
	
	for folder in os.listdir(folder_path):
		parquet_partitions = os.path.join(folder_path, folder)
		df = pd.read_parquet(parquet_partitions)
		
		if merge_df.empty:
			merge_df = df
		else:
			merge_df = merge_df.merge(df, on="sequence")
	
	return merge_df

In [None]:
if REMOVED_MANIFEST and os.path.exists(OUTPUT_FOLDER):
	merged_df = merge_parquet_folders(OUTPUT_FOLDER)
	merged_df

In [None]:
if REMOVED_MANIFEST and os.path.exists(OUTPUT_FOLDER):
	if not os.path.exists(os.path.join(os.path.abspath("data"), "export")):
		os.makedirs(os.path.join(os.path.abspath("data"), "export"))

	output_path = os.path.join(os.path.abspath("data"), "export")

	merged_df.to_parquet(os.path.join(output_path, "results.parquet"))

In [None]:
# read the output file

output_df = pd.read_parquet("./data/export/results.parquet")
output_df