First let's import the methods and data we used from the acquiring data tutorial.

In [None]:
import os
import pandas as pd

from phytochempy.compound_properties import get_npclassifier_classes_from_df
from phytochempy.chemical_diversity_metrics import calculate_FAD_measures, get_pathway_based_diversity_measures

temporary_output_folder = 'temp_outputs'
tidy_outputs_folder = 'outputs'

COMPOUND_ID_COL = 'Standard_SMILES'
TAXON_GROUPING = 'accepted_species'

Catalpa_bignonioides_deduplicated_data = pd.read_csv(os.path.join(tidy_outputs_folder, 'Catalpa_bignonioides_deduplicated_data.csv'), index_col=0)

Before continuing, lets get the NPClassifier class information for the compounds. There are other methods in this library for enriching data, but here we'll focus on NPClassifier as this allows calculation of some diversity metrics.

In [None]:
data_with_npclass_classes = get_npclassifier_classes_from_df(Catalpa_bignonioides_deduplicated_data, 'Standard_SMILES', temporary_output_folder)

Next remove compound groups that contain only a single compound. This isn't totally necessary, but some diversity indices are poorly defined in the case of a group containing only a single compound

In [3]:
# Remove genera with only a single known compound prior to calculations
counts = data_with_npclass_classes.value_counts(TAXON_GROUPING)
groups_with_single_compounds = pd.DataFrame({TAXON_GROUPING: counts.index, 'N': counts.values})
groups_with_single_compounds = groups_with_single_compounds[groups_with_single_compounds['N'] < 2][TAXON_GROUPING].values.tolist()

group_compound_data = data_with_npclass_classes[~data_with_npclass_classes[TAXON_GROUPING].isin(groups_with_single_compounds)]

Now we can start calculating chemodiversity indices for groups of compounds in the dataset.

In [None]:
FAD_measures = calculate_FAD_measures(group_compound_data, TAXON_GROUPING)

The NPClassifier methods finds pathways for each compound, and then separates these pathways into binary columns i.e. they are one-hot encoded, based on the taxon grouping and compound id columns. This is then used to calculate the average number of compounds falling into each pathway, for each group of compounds, and finally these values are used to calculate the pathway diversity measures.

In [None]:
abundance_diversity = get_pathway_based_diversity_measures(group_compound_data, TAXON_GROUPING, COMPOUND_ID_COL)