This tutorial will look at acquiring data for a single species and compiling it into a useful format. Note that the methods are easily extensible to genera, families and other taxonomic groupings.

First, let's import the required methods and define some directories to store data. In general, the methods output raw data and cached results into a defined temporary output folder, while cleaned outputs are saved into a tidy output folder.

In order to collect species data, we collect phytochemicals for the entire family. This may seem like overkill for a single species and could be optimised in future, but in general we strongly recommend looking for data at higher taxonomic levels than the taxa of interest and then resolving the returned names and selecting the relevant data. This is because nomenclature of families/orders is much more stable than species/genera and due to intricacies of searching KNApSAcK, searches are set up to work at the level of families.

In [None]:
import os
import pandas as pd

from phytochempy.data_compilation_utilities import merge_and_tidy_compound_datasets, tidy_final_dataset
from phytochempy.knapsack_searches import get_knapsack_data
from phytochempy.wikidata_searches import get_wikidata

temporary_output_folder = 'temp_outputs'
tidy_outputs_folder = 'outputs'

sp = 'Catalpa bignonioides'
relevant_family = 'Bignoniaceae'

Now lets get the WikiData. This first requires acquiring the WikiData ID for the clade of interest. We have added a utility function to get this ID (`get_wikidata_id_for_taxon`), but recommend manually checking this (i.e. by searching on https://www.wikidata.org/) as there are a few reasons a simple search for this may break.

In [None]:
wiki_data_id_for_higher_taxon = 'Q213453' # This is the ID for Bignoniaceae

get_wikidata(wiki_data_id_for_higher_taxon, os.path.join(temporary_output_folder, 'wikidata_Bignoniaceae.csv'),
             os.path.join(tidy_outputs_folder, 'wikidata_Bignoniaceae.csv'))

Now let's get the KNApSAcK data too. Note that there are a few reasons this might fail for some of the genera in a family, which will cause the process to raise an error and not save the data. If you run into this issue, please raise an issue and I will try to find a solution.

In [None]:
list_of_families_to_search = [relevant_family]
get_knapsack_data(list_of_families_to_search, temporary_output_folder, os.path.join(tidy_outputs_folder, 'knapsack_data_Bignoniaceae.csv'))

While the `temporary_output_folder` saves the raw data downloaded from the data sources, 'cleaned' versions of the downloaded data should now be saved in the `tidy_outputs_folder` which include the names resolved using `wcvpy`. 

Now that we have the resolved names, we can filter the records based on accepted names.

In [2]:
tidy_wiki_data = pd.read_csv(os.path.join(tidy_outputs_folder, 'wikidata_Bignoniaceae.csv'), index_col=0)
wiki_species_data = tidy_wiki_data[tidy_wiki_data['accepted_species'] == sp]

tidy_knapsack_data = pd.read_csv(os.path.join(tidy_outputs_folder, 'knapsack_data_Bignoniaceae.csv'), index_col=0)
knapsack_species_data = tidy_knapsack_data[tidy_knapsack_data['accepted_species'] == sp]

Now you have the species data!

If you want to continue enriching the data, a utility function `merge_and_tidy_compound_datasets` will merge the acquired datasets and add some useful columns. In particular, a `Standard_SMILES` is calculated from the given SMILES strings in the data using rdkit sanitization to standardise molecules and resolve to parent fragments.

In [None]:
all_compounds_in_species = merge_and_tidy_compound_datasets([wiki_species_data,knapsack_species_data],
                                                            os.path.join(tidy_outputs_folder, 'Catalpa_bignonioides_data.csv'))

Having all the data is great, but you probably won't want to keep duplicate organism-compound pairs. A utility function `tidy_final_dataset` will deduplicate the data for you for a given compound ID.

In [4]:
COMPOUND_ID_COL = 'Standard_SMILES'
TAXON_GROUPING = 'accepted_species'


tidy_final_dataset(all_compounds_in_species, os.path.join(tidy_outputs_folder, 'Catalpa_bignonioides_deduplicated_data.csv'), COMPOUND_ID_COL, TAXON_GROUPING)