# Protein feature extraction pipeline

This notebook will contain the pipeline for extracting features from protein sequences. It will be used as a way to show the output without needing to run the `pipeline.py` file locally.

In [2]:
import pyarrow as pa
from fondant.pipeline import Pipeline
import sys
import os

# Import the mock data path from the config file
config_path = os.path.abspath(os.path.join('..'))
sys.path.append(config_path)
from config import MOCK_DATA_PATH_FONDANT

## Loading the dataset

In [8]:
# Create a new pipeline

pipeline = Pipeline(
	name="feature_extraction_pipeline",
	base_path=".fondant",
	description="A pipeline to extract features from protein sequences."
)

## Creating the pipeline

In [9]:
# Read the dataset

dataset = pipeline.read(
	"load_from_parquet",
	arguments={
		"dataset_uri": MOCK_DATA_PATH_FONDANT,
	},
	produces={
		"sequence": pa.string()
	}
)

## Components

This section will contain the components that will be used in the pipeline.

These are the components that will be used in the pipeline:

- `generate_protein_sequence_checksum_component`: This component will generate a checksum for the protein sequence.

- `biopython_component`: This component will extract features from the protein sequence using Biopython.

- `iFeatureOmega_component`: This component will extract features from the protein sequence using iFeature Omega. This component uses arguments to specify the type of features to extract.

- `filter_pdb_component`: This component will filter the PDB files that are already predicted, so the pipeline doesn't need to predict them again. You'll need to specify the following arguments before running the pipeline:
```json
"storage_type": "local",
"pdb_path": "/data/pdb_files/",
"bucket_name": "your-bucket-name",
"project_id": "your-project-id",
"google_cloud_credentials_path": "/data/google_cloud_credentials.json"
```

If you're only using `local`, then you can keep the `bucket_name`, `project_id` and `google_cloud_credentials_path` as empty strings. Using `remote` will require you to have a Google Cloud Storage bucket with credentials and a project ID.

- `predict_protein_3D_structure_component`: This component will predict the 3D structure of the protein using ESMFold.

- `store_pdb_component`: This component will store the PDB files in the provided `storage_type`. You'll need to specify the following arguments before running the pipeline:
```json
"storage_type": "local",
"pdb_path": "/data/pdb_files/",
"bucket_name": "your-bucket-name",
"project_id": "your-project-id",
"google_cloud_credentials_path": "/data/google_cloud_credentials.json"
```

If you're only using `local`, then you can keep the `bucket_name`, `project_id` and `google_cloud_credentials_path` as empty strings. Using `remote` will require you to have a Google Cloud Storage bucket with credentials and a project ID.

In [11]:
# Apply the components to the dataset

_ = dataset.apply(
	"../components/generate_protein_sequence_checksum_component"
).apply(
	"../components/biopython_component"
).apply(
	"../components/iFeatureOmega_component",
	# currently forcing the number of rows to 5, but there needs to be a better way to do this, see readme for more info
	input_partition_rows=5,
	arguments={
		"descriptors": ["AAC", "GAAC", "Moran", "Geary", "NMBroto", "APAAC"]
	}
).apply(
	"../components/filter_pdb_component",
	arguments={
		"storage_type": "local",
		"pdb_path": "/data/pdb_files",
		"bucket_name": "elated-chassis-400207_dbtl_pipeline_outputs",
		"project_id": "elated-chassis-400207",
		"google_cloud_credentials_path": "/data/google_cloud_credentials.json"
	},
).apply(
	"../components/predict_protein_3D_structure_component",
).apply(
	"../components/store_pdb_component",
	arguments={
		"storage_type": "local",
		"pdb_path": "/data/pdb_files/",
		"bucket_name": "elated-chassis-400207_dbtl_pipeline_outputs",
		"project_id": "elated-chassis-400207",
		"google_cloud_credentials_path": "/data/google_cloud_credentials.json"
	}
)



## Run the pipeline

The `pipeline.py` file needs to be run using the command line. The following command will run the pipeline:

```bash
fondant < full_path_to_pipeline.py >\data:/data
```

But since this is a notebook, the following section will use the `DockerRunner` function of `Fondant` to run the pipeline.

In [12]:
from fondant.pipeline.runner import DockerRunner

DockerRunner().run(input=pipeline, extra_volumes="/data")

[2024-03-25 17:12:54,646 | root | INFO] Found reference to un-compiled pipeline... compiling
[2024-03-25 17:12:54,647 | fondant.pipeline.compiler | INFO] Compiling feature_extraction_pipeline to .fondant/compose.yaml
[2024-03-25 17:12:54,647 | fondant.pipeline.compiler | INFO] Base path found on local system, setting up .fondant as mount volume
[2024-03-25 17:12:54,648 | fondant.pipeline.pipeline | INFO] Sorting pipeline component graph topologically.
[2024-03-25 17:12:54,673 | fondant.pipeline.pipeline | INFO] All pipeline component specifications match.
[2024-03-25 17:12:54,674 | fondant.pipeline.compiler | INFO] Compiling service for load_from_parquet
[2024-03-25 17:12:54,675 | fondant.pipeline.compiler | INFO] Compiling service for generate_protein_sequence_checksum_component
[2024-03-25 17:12:54,676 | fondant.pipeline.compiler | INFO] Found Dockerfile for generate_protein_sequence_checksum_component, adding build step.
[2024-03-25 17:12:54,678 | fondant.pipeline.compiler | INFO] C

Starting pipeline run...
Finished pipeline run.


## Results

The following results have been taken from the output of the pipeline, which is stored in the `.fondant` directory. This directory contains the output of each component, together with the cache of the previous run.

Currently, the pipeline doesn't implement the `write_to_file` component, so the results will be taken individually from the output of each component.

In [12]:
import pandas as pd

In [13]:
df_load_from_parquet = pd.read_parquet('../data/export/load_from_parquet/')
df_load_from_parquet.head()

Unnamed: 0_level_0,sequence
id,Unnamed: 1_level_1
0_1,MNQRGMPIQSLVTNVKINRLEENDCIHTRHRVRPGRTDGKNLHAMM...
0_2,MAGLKPEVPLHDGINKFGKSDFAGQEGPKIVTTTDKALLVANGALK...
0_3,MVDLKKELKNFVDSDFPGSPKQEAQGIDVRILLSFNNAAFREALII...
0_4,MELILAKARLEFECDWGLLMLEPCVPPTKIFADRNYAVGVMFESDK...
0_5,MRVLCDGSTGYACAKNTRIRFREKVASVLAKIQGYEQTFPHHMPNM...


In [14]:
df_biopython = pd.read_parquet('../data/export/biopython_component/')
df_biopython.head()

Unnamed: 0_level_0,sequence,sequence_length,molecular_weight,aromaticity,isoelectric_point,instability_index,gravy,helix,turn,sheet,charge_at_ph7,charge_at_ph5,molar_extinction_coefficient_oxidized,molar_extinction_coefficient_reduced
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0_1,MNQRGMPIQSLVTNVKINRLEENDCIHTRHRVRPGRTDGKNLHAMM...,600,66369.0679,0.06,5.397908,38.074,-0.334833,0.385,0.268333,0.328333,-17.260409,9.933956,54320,55070
0_5,MRVLCDGSTGYACAKNTRIRFREKVASVLAKIQGYEQTFPHHMPNM...,420,47355.5634,0.088095,5.392736,42.44,-0.518333,0.319048,0.283333,0.32619,-16.985081,7.337325,58330,58580
0_2,MAGLKPEVPLHDGINKFGKSDFAGQEGPKIVTTTDKALLVANGALK...,400,43254.8112,0.055,5.964593,38.948025,-0.26375,0.335,0.3225,0.335,-6.448485,11.400447,31860,32110
0_3,MVDLKKELKNFVDSDFPGSPKQEAQGIDVRILLSFNNAAFREALII...,550,60158.722,0.072727,5.349652,38.161636,-0.183273,0.341818,0.296364,0.34,-12.666847,6.634677,31860,32110
0_4,MELILAKARLEFECDWGLLMLEPCVPPTKIFADRNYAVGVMFESDK...,350,39615.9422,0.091429,4.825028,40.802857,-0.248,0.38,0.248571,0.357143,-18.025837,-3.682992,61420,61795


In [15]:
df_ifeature_omega_component = pd.read_parquet('../data/export/ifeature_omega_component/')
df_ifeature_omega_component.head()

Unnamed: 0_level_0,sequence,AAC_A,AAC_C,AAC_D,AAC_E,AAC_F,AAC_G,AAC_H,AAC_I,AAC_K,...,NMBroto_BEGF750101.lag3,NMBroto_BEGF750102.lag1,NMBroto_BEGF750102.lag2,NMBroto_BEGF750102.lag3,NMBroto_BEGF750103.lag1,NMBroto_BEGF750103.lag2,NMBroto_BEGF750103.lag3,NMBroto_BHAR880101.lag1,NMBroto_BHAR880101.lag2,NMBroto_BHAR880101.lag3
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0_1,MNQRGMPIQSLVTNVKINRLEENDCIHTRHRVRPGRTDGKNLHAMM...,0.09,0.02,0.048333,0.093333,0.021667,0.081667,0.023333,0.046667,0.056667,...,-0.001961,-0.041237,0.100768,-0.049149,0.086146,0.057148,0.028043,-0.021107,0.125624,-0.004387
0_2,MAGLKPEVPLHDGINKFGKSDFAGQEGPKIVTTTDKALLVANGALK...,0.0875,0.0125,0.05,0.0625,0.015,0.08,0.03,0.0675,0.06,...,-0.054463,-0.00263,0.063674,-0.035028,-0.056075,0.069239,-0.02858,0.026617,0.061418,0.047221
0_3,MVDLKKELKNFVDSDFPGSPKQEAQGIDVRILLSFNNAAFREALII...,0.107273,0.009091,0.06,0.067273,0.043636,0.063636,0.016364,0.067273,0.052727,...,0.087268,0.03219,0.011671,0.006448,0.069078,0.121472,0.136282,0.044959,0.04226,0.005347
0_4,MELILAKARLEFECDWGLLMLEPCVPPTKIFADRNYAVGVMFESDK...,0.071429,0.017143,0.04,0.108571,0.042857,0.074286,0.014286,0.057143,0.048571,...,-0.016027,0.015219,0.038238,-0.181714,0.066448,0.071851,-0.089194,-0.005339,0.032348,-0.036633
0_5,MRVLCDGSTGYACAKNTRIRFREKVASVLAKIQGYEQTFPHHMPNM...,0.1,0.011905,0.078571,0.071429,0.033333,0.061905,0.040476,0.061905,0.052381,...,0.038768,0.092617,-0.027349,-0.093597,0.083518,0.039504,-0.037585,0.091911,0.019569,0.001394


> This section uses mock data from already existing pdb files. This is the reason why you'll see 'mock_pdb_string' in the results.

> The pdb files are placed in the `data/pdb_files` directory, which is not included in the repository.

In [16]:
df_filter = pd.read_parquet('../data/export/filter_pdb_component/')
df_filter.head()

Unnamed: 0_level_0,sequence,sequence_checksum,pdb_string
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0_1,MNQRGMPIQSLVTNVKINRLEENDCIHTRHRVRPGRTDGKNLHAMM...,CRC-94CF2EE011C80480,mock_pdb_string
0_5,MRVLCDGSTGYACAKNTRIRFREKVASVLAKIQGYEQTFPHHMPNM...,CRC-747F108552578E1D,mock_pdb_string
0_2,MAGLKPEVPLHDGINKFGKSDFAGQEGPKIVTTTDKALLVANGALK...,CRC-68D748EC385E9BEC,mock_pdb_string
0_3,MVDLKKELKNFVDSDFPGSPKQEAQGIDVRILLSFNNAAFREALII...,CRC-3B9E0764E7D3C737,mock_pdb_string
0_4,MELILAKARLEFECDWGLLMLEPCVPPTKIFADRNYAVGVMFESDK...,CRC-B08C4E4E86E87F17,mock_pdb_string


In [17]:
df_predict = pd.read_parquet('../data/export/filter_pdb_component/')
df_predict.head()

Unnamed: 0_level_0,sequence,sequence_checksum,pdb_string
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0_1,MNQRGMPIQSLVTNVKINRLEENDCIHTRHRVRPGRTDGKNLHAMM...,CRC-94CF2EE011C80480,mock_pdb_string
0_5,MRVLCDGSTGYACAKNTRIRFREKVASVLAKIQGYEQTFPHHMPNM...,CRC-747F108552578E1D,mock_pdb_string
0_2,MAGLKPEVPLHDGINKFGKSDFAGQEGPKIVTTTDKALLVANGALK...,CRC-68D748EC385E9BEC,mock_pdb_string
0_3,MVDLKKELKNFVDSDFPGSPKQEAQGIDVRILLSFNNAAFREALII...,CRC-3B9E0764E7D3C737,mock_pdb_string
0_4,MELILAKARLEFECDWGLLMLEPCVPPTKIFADRNYAVGVMFESDK...,CRC-B08C4E4E86E87F17,mock_pdb_string


In [18]:
df_store = pd.read_parquet('../data/export/store_pdb_component/')
df_store.head()

Unnamed: 0_level_0,sequence,sequence_checksum,pdb_string
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0_1,MNQRGMPIQSLVTNVKINRLEENDCIHTRHRVRPGRTDGKNLHAMM...,CRC-94CF2EE011C80480,mock_pdb_string
0_5,MRVLCDGSTGYACAKNTRIRFREKVASVLAKIQGYEQTFPHHMPNM...,CRC-747F108552578E1D,mock_pdb_string
0_2,MAGLKPEVPLHDGINKFGKSDFAGQEGPKIVTTTDKALLVANGALK...,CRC-68D748EC385E9BEC,mock_pdb_string
0_3,MVDLKKELKNFVDSDFPGSPKQEAQGIDVRILLSFNNAAFREALII...,CRC-3B9E0764E7D3C737,mock_pdb_string
0_4,MELILAKARLEFECDWGLLMLEPCVPPTKIFADRNYAVGVMFESDK...,CRC-B08C4E4E86E87F17,mock_pdb_string


In [19]:
df_unikp = pd.read_parquet('../data/export/unikp_component/')
df_unikp.head()

Unnamed: 0_level_0,sequence,unikp_kinetic_prediction
id,Unnamed: 1_level_1,Unnamed: 2_level_1
0_5,MRVLCDGSTGYACAKNTRIRFREKVASVLAKIQGYEQTFPHHMPNM...,"{""CC(=O)Nc1ccc(O)cc1"": {""Km"": 0.15384583775759..."
0_1,MNQRGMPIQSLVTNVKINRLEENDCIHTRHRVRPGRTDGKNLHAMM...,"{""CC(=O)Nc1ccc(O)cc1"": {""Km"": 0.14835407296487..."
0_2,MAGLKPEVPLHDGINKFGKSDFAGQEGPKIVTTTDKALLVANGALK...,"{""CC(=O)Nc1ccc(O)cc1"": {""Km"": 0.12049170449441..."
0_3,MVDLKKELKNFVDSDFPGSPKQEAQGIDVRILLSFNNAAFREALII...,"{""CC(=O)Nc1ccc(O)cc1"": {""Km"": 0.11393831815358..."
0_4,MELILAKARLEFECDWGLLMLEPCVPPTKIFADRNYAVGVMFESDK...,"{""CC(=O)Nc1ccc(O)cc1"": {""Km"": 0.15480049145100..."


In [20]:
# get the content of the smiles for the first row the the df_unikp
import json

str_prediction = df_unikp['unikp_kinetic_prediction'][0]

json.loads(str_prediction)

  str_prediction = df_unikp['unikp_kinetic_prediction'][0]


{'CC(=O)Nc1ccc(O)cc1': {'Km': 0.15384583775759242,
  'Kcat': 4.331166579781699,
  'Vmax': 5.554580358129042},
 'CC[C@H]1C(=O)N(CC(=O)N([C@H](C(=O)N[C@H](C(=O)N([C@H](C(=O)N[C@H](C(=O)N[C@@H](C(=O)N([C@H](C(=O)N([C@H](C(=O)N([C@H](C(=O)N([C@H](C(=O)N1)[C@@H]([C@H](C)C/C=C/C)O)C)C(C)C)C)CC(C)C)C)CC(C)C)C)C)C)CC(C)C)C)C(C)C)CC(C)C)C)C': {'Km': 0.12757331986399847,
  'Kcat': 2.5686074344240586,
  'Vmax': 4.100352828666978}}