First, let's import the required methods and define some directories to store data.

In [2]:
import os
import pandas as pd

from phytochempy.compound_properties import get_npclassifier_classes_from_df
from phytochempy.data_compilation_utilities import merge_and_tidy_compound_datasets, tidy_final_dataset
from phytochempy.knapsack_searches import get_knapsack_data
from phytochempy.wikidata_searches import get_wikidata, get_wikidata_id_for_taxon

temporary_output_folder = 'temp_outputs'
tidy_outputs_folder = 'outputs'



In this first example, we look at the phytochemistry of a single species. In order to do this, we collect phytochemicals for the entire family. This may seem like overkill for a single species and could be optimised in future, but in general we strongly recommend looking for data at higher taxonomic levels than the taxa of interest and then resolving the returned names and selecting the relevant data. This is because nomenclature of families/orders is much more stable than species/genera and due to intricacies of searching KNApSAcK, searches are set up to work at the level of families.

In [3]:
sp = 'Catalpa bignonioides'

relevant_family = 'Bignoniaceae'

Now lets get the wikidata. This first requires acquiring the WikiData ID for the clade of interest. We have added a utility function to get this ID (`get_wikidata_id_for_taxon`), but recommend double-checking this as there are a few reasons a simple search for this may break.

In [4]:
wiki_data_id_for_family = 'Q213453'

get_wikidata(wiki_data_id_for_family, os.path.join(temporary_output_folder, 'wikidata_Bignoniaceae.csv'),
             os.path.join(tidy_outputs_folder, 'wikidata_Bignoniaceae.csv'))

SELECT DISTINCT ?structure ?structureLabel ?structure_smiles ?structure_cas ?structure_inchikey ?organism ?organism_name ?ipniID ?chembl_id WHERE {VALUES ?taxon { wd:Q213453}?organism (wdt:P171*) ?taxon;wdt:P225 ?organism_name.?structure (p:P703/ps:P703) ?organism. OPTIONAL {?structure wdt:P235 ?structure_inchikey.}OPTIONAL {?structure wdt:P233 ?structure_smiles.}OPTIONAL {?structure wdt:P231 ?structure_cas.}OPTIONAL {?organism wdt:P961 ?ipniID.}OPTIONAL {?structure wdt:P592 ?chembl_id.}      SERVICE wikibase:label {bd:serviceParam wikibase:language "en".}}    LIMIT 100000
Loading WCVP locally if exists...
from: C:\Users\ari11kg\Documents\venvs\phytochempy\lib\site-packages\wcvpy\wcvp_download\inputs\wcvp.zip
Downloading latest WCVP version...
to: C:\Users\ari11kg\Documents\venvs\phytochempy\lib\site-packages\wcvpy\wcvp_download\inputs\wcvp.zip
Parsing the checklist
Time elapsed for (down)loading WCVP: 159.66256189346313s


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ipni_matched['matched_by'] = 'ipni_id'


Loading WCVP locally if exists...
from: C:\Users\ari11kg\Documents\venvs\phytochempy\lib\site-packages\wcvpy\wcvp_download\inputs\wcvp.zip
Using up to date WCVP.
Parsing the checklist
Time elapsed for (down)loading WCVP: 74.84020256996155s


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  records['submitted'].ffill(inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  records['match_state'].ffill(inplace=True)


Trying to resolve 0 names with OpenRefine
Temp file for this run:
name matching temp outputs\final_resolutionsf33cff49a5b12f9a07a4dddfb7a82295.csv


Now let's get the KNApSAcK data too. There are a few reasons this might fail for some of the genera in the family, which will cause the process to raise an error and not save the data. This could be improved by caching results.

In [4]:
get_knapsack_data([relevant_family], temporary_output_folder, os.path.join(tidy_outputs_folder, 'knapsack_data_Bignoniaceae.csv'))

Loading WCVP locally if exists...
from: C:\Users\ari11kg\Documents\venvs\phytochempy\lib\site-packages\wcvpy\wcvp_download\inputs\wcvp.zip
Using up to date WCVP.
Parsing the checklist
Time elapsed for (down)loading WCVP: 146.9084050655365s


Searching genera in Knapsack for Bignoniaceae…: 100%|█| 270/270 [05:01<00:00,  1


Loading WCVP locally if exists...
from: C:\Users\ari11kg\Documents\venvs\phytochempy\lib\site-packages\wcvpy\wcvp_download\inputs\wcvp.zip
Using up to date WCVP.
Parsing the checklist
Time elapsed for (down)loading WCVP: 89.14287567138672s


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  records['submitted'].ffill(inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  records['match_state'].ffill(inplace=True)


Check tempfile: name matching temp outputs\unmatched_samples_with_multiple_knms_hits598a7a68120eeacf96440f6799ed4baa.csv.
Trying to resolve 45 names with OpenRefine
Resolving submitted names which weren't initially matched using KNMS..
Check tempfile: name matching temp outputs\unmatched_to_autoresolve41ea604d004bdb6a175f32223167f347.csv.
This may take some time... This can be sped up by specifying families of interest (if you haven't already done so) or checking the temp file for misspelled submissions.


Searching automated matches: 100%|██████| 20/20 [00:00<00:00, 31.54it/s]


Check tempfile: name matching temp outputs\unmatched_samplesfeac4a81abe8d3c572af0793caecf456.csv.
Temp file for this run:
name matching temp outputs\final_resolutions95e583c5f8cc1dee36d32f22ac9bbd3c.csv


Resolving CAS IDs..: 100%|██████████| 210/210 [05:30<00:00,  1.58s/it]


While the `temporary_output_folder` saves the raw data downloaded from the data sources, 'cleaned' versions of the downloaded data should now be saved in the `tidy_outputs_folder` which include the names resolved using `wcvpy`. 

Now let's look at the given species and merge both datasets together. This will combine the data and add a `Standard_SMILES` column which uses rdkit sanitzation to standardise molecules and resolve to parent fragments.

In [5]:
tidy_wiki_data = pd.read_csv(os.path.join(tidy_outputs_folder, 'wikidata_Bignoniaceae.csv'), index_col=0)
wiki_species_data = tidy_wiki_data[tidy_wiki_data['accepted_species'] == sp]

tidy_knapsack_data = pd.read_csv(os.path.join(tidy_outputs_folder, 'knapsack_data_Bignoniaceae.csv'), index_col=0)
knapsack_species_data = tidy_knapsack_data[tidy_knapsack_data['accepted_species'] == sp]

all_compounds_in_species = merge_and_tidy_compound_datasets([wiki_species_data,knapsack_species_data],
                                                            os.path.join(tidy_outputs_folder, 'Catalpa_bignonioides_data.csv'))



Standardising SMILES
Getting MAIP standardisation of SMILES


[16:18:51] Initializing MetalDisconnector
[16:18:51] Running MetalDisconnector
[16:18:51] Initializing Normalizer
[16:18:51] Running Normalizer
[16:18:51] Running LargestFragmentChooser
[16:18:51] Initializing MetalDisconnector
[16:18:51] Running MetalDisconnector
[16:18:51] Initializing Normalizer
[16:18:51] Running Normalizer
[16:18:51] Running LargestFragmentChooser
[16:18:51] Initializing MetalDisconnector
[16:18:51] Running MetalDisconnector
[16:18:51] Initializing Normalizer
[16:18:51] Running Normalizer
[16:18:51] Running LargestFragmentChooser
[16:18:51] Initializing MetalDisconnector
[16:18:51] Running MetalDisconnector
[16:18:51] Initializing Normalizer
[16:18:51] Running Normalizer
[16:18:51] Running LargestFragmentChooser
[16:18:51] Initializing MetalDisconnector
[16:18:51] Running MetalDisconnector
[16:18:51] Initializing Normalizer
[16:18:51] Running Normalizer
[16:18:51] Running LargestFragmentChooser
[16:18:51] Initializing MetalDisconnector
[16:18:51] Running MetalDisc

Now that we have our compound presence data, we can begin to enrich it. There are other methods in this library, but here we'll focus on NPClassifier. We first find the pathways for each compound, and then separate these pathways into separate binary columns i.e. they are one-hot encoded.

In [4]:
data_with_npclass_classes = get_npclassifier_classes_from_df(all_compounds_in_species, 'Standard_SMILES', temporary_output_folder)

0it [00:00, ?it/s]


Now let's tidy this all up by removing rows with missing values and dropping duplicate organism-compound pairs based on the defined `COMPOUND_ID_COL`.

In [5]:
COMPOUND_ID_COL = 'Standard_SMILES'

tidy_final_dataset(data_with_npclass_classes, os.path.join(tidy_outputs_folder, 'Catalpa_bignonioides_deduplicated_data.csv'), COMPOUND_ID_COL)
