# Variationist Example 2: Custom Metrics

**Welcome! In this example guide, you will be shown how to use your own custom metrics inside 🕵️‍♀️ Variationist.**


In [1]:
# Install using pip
!pip install variationist

Collecting variationist
  Downloading variationist-0.1.4-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m1.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting altair==5.2.0 (from variationist)
  Downloading altair-5.2.0-py3-none-any.whl (996 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m996.9/996.9 kB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets==2.17.1 (from variationist)
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji==2.10.1 (from variationist)
  Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.5/421.5 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting geopandas==0.14.3 (from variationist)
  Downloading geopandas-0.14.3-py3-none-any.whl (1.1

In [2]:
from variationist import Inspector, InspectorArgs, Visualizer, VisualizerArgs

Load your own dataset in .tsv or .csv format, or use a dataset that is available on HuggingFace


In [3]:
# my_dataset = "my_dataset.tsv"
my_dataset = "hf::truthful_qa::generation::validation"

Let's use a metric of our own in addition to the built-in ones of Variationist.

Here is a metric that counts how often questions contain at least one of the 5 W-words (what, where, who, why, when) in the questions of each category of the `truthful_qa` dataset.

In [4]:
def count_5w(label_values_dict, subsets_of_interest, inspector_args):
    results_dict = {}
    # for each variable we are analyzing
    for label in label_values_dict:
        results_dict[label] = {}
        # we loop through each value that variable can take
        for i in range(len(label_values_dict[label])):
            results_dict[label][label_values_dict[label][i]] = 0
            # we loop through each text in the subset we are currently analyzing
            for text in subsets_of_interest[label][i]:
                for w in ["what", "where", "who", "why", "when"]:
                    if w in text:
                        results_dict[label][label_values_dict[label][i]] += 1

    return results_dict

In [5]:
# Define the inspector arguments. For now, let's just select a handful of metrics.
inspector_args = InspectorArgs(
		 text_names=["question"],
		 var_names=["category"],
		 metrics=["pmi", "stats", "npw_relevance", "ttr", "freq", count_5w],
		 language="en",
		 n_tokens=1,
		 n_cooc=1
)

In [6]:
# Create an inspector instance, run it, and get the results in json
results = Inspector(dataset=my_dataset, args=inspector_args).inspect()

INFO: No values have been set for var_types. Defaults to nominal.
INFO: No values have been set for var_semantics. Defaults to general.
INFO: The metadata we will be using for the current analysis are:
{'text_names': ['question'], 'var_names': ['category'], 'metrics': ['pmi', 'stats', 'npw_relevance', 'ttr', 'freq', 'count_5w'], 'var_types': ['nominal'], 'var_semantics': ['general'], 'var_subsets': None, 'var_bins': [0], 'tokenizer': 'whitespace', 'language': 'en', 'n_tokens': 1, 'n_cooc': 1, 'unique_cooc': False, 'cooc_window_size': 0, 'freq_cutoff': 3, 'stopwords': False, 'custom_stopwords': None, 'lowercase': False, 'ignore_null_var': False}
INFO: all column identifiers are treated as column names.
INFO: 'Loading hf::truthful_qa::generation::validation' as a HuggingFace dataset. We assume the third element in the specified string is the subset ("generation") and the last is the split ("validation").


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/9.59k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/223k [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/817 [00:00<?, ? examples/s]

INFO: Tokenizing the question column...


100%|██████████| 817/817 [00:00<00:00, 16718.53it/s]


INFO: Currently calculating metric: 'pmi'


100%|██████████| 38/38 [00:00<00:00, 5942.49it/s]


INFO: Currently calculating metric: 'stats'
INFO: Currently calculating metric: 'npw_relevance'


100%|██████████| 38/38 [00:00<00:00, 5765.57it/s]


INFO: Currently calculating metric: 'ttr'


100%|██████████| 38/38 [00:00<00:00, 4173.87it/s]


INFO: Currently calculating metric: 'freq'


100%|██████████| 38/38 [00:00<00:00, 6293.77it/s]

INFO: Currently calculating metric: 'count_5w'





In [7]:
# Define the visualizer arguments
visualizer_args = VisualizerArgs(
	output_folder="output", zoomable=True, ngrams=None, output_formats=["html", "png"])

# Create dynamic visualizations of the results
charts = Visualizer(input_json=results, args=visualizer_args).create()

Reading json data...
INFO: Creating a BarChart object for metric "pmi"...
INFO: Saving it to the filepath: "output/pmi/BarChart.html".
INFO: Saving it to the filepath: "output/pmi/BarChart.png".
INFO: Creating a BarChart object for metric "stats"...
INFO: Saving it to the filepath: "output/stats/StatsBarChart.html".
INFO: Saving it to the filepath: "output/stats/StatsBarChart.png".
INFO: Creating a BarChart object for metric "npw_relevance"...
INFO: Saving it to the filepath: "output/npw_relevance/BarChart.html".
INFO: Saving it to the filepath: "output/npw_relevance/BarChart.png".
INFO: Creating a BarChart object for metric "ttr"...
INFO: Saving it to the filepath: "output/ttr/DiversityBarChart.html".
INFO: Saving it to the filepath: "output/ttr/DiversityBarChart.png".
INFO: Creating a BarChart object for metric "freq"...
INFO: Saving it to the filepath: "output/freq/BarChart.html".
INFO: Saving it to the filepath: "output/freq/BarChart.png".
INFO: Creating a BarChart object for metri

In [8]:
charts["count_5w"]["BarChart"]