# Variationist Example 1: Custom Tokenizers

**Welcome! In this example guide, you will be shown how to use your own custom tokenizer inside 🕵️‍♀️ Variationist.**

In [1]:
# Install using pip
!pip install --upgrade variationist



In [2]:
from variationist import Inspector, InspectorArgs, Visualizer, VisualizerArgs

  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Load your own dataset in .tsv or .csv format, or use a dataset that is available on HuggingFace. This time, we will use the test set of the `yelp_review_full` dataset.


In [20]:
my_dataset = "hf::yelp_review_full::test"

Let's define a custom tokenization function. In this case, it will just consider characters as _units_.

In [21]:
def char_tokenization_fn(text_column, inspector_args):
    tok_lines = text_column.apply(list)
    return tok_lines

Now we set the tokenizer to our custom function

In [22]:
# Define the inspector arguments. For now, let's just select a handful of metrics.
inspector_args = InspectorArgs(text_names=["text"],
                               var_names=["label"],
                               metrics=["pmi",
							            "stats",
									    "npw_relevance",
										"ttr",
										"freq"],
                               tokenizer=char_tokenization_fn,
                               stopwords=True,
                               language="en",
                               n_tokens=1,
                               n_cooc=1)

In [23]:
# Create an inspector instance, run it, and get the results in json
results = Inspector(dataset=my_dataset, args=inspector_args).inspect()

INFO: No values have been set for var_types. Defaults to nominal.
INFO: No values have been set for var_semantics. Defaults to general.
INFO: The metadata we will be using for the current analysis are:
{'text_names': ['text'], 'var_names': ['label'], 'metrics': ['pmi', 'stats', 'npw_relevance', 'ttr', 'freq'], 'var_types': ['nominal'], 'var_semantics': ['general'], 'var_subsets': None, 'var_bins': [0], 'tokenizer': 'char_tokenization_fn', 'language': 'en', 'n_tokens': 1, 'n_cooc': 1, 'unique_cooc': False, 'cooc_window_size': 0, 'freq_cutoff': 3, 'stopwords': True, 'custom_stopwords': None, 'lowercase': False, 'ignore_null_var': False}
INFO: all column identifiers are treated as column names.
INFO: 'Loading hf::yelp_review_full::test' as a HuggingFace dataset. We assume the last element in the specified string is the split ("test").
INFO: Tokenizing the text column...
INFO: Currently calculating metric: 'pmi'


100%|██████████| 5/5 [00:00<00:00, 10.64it/s]


INFO: Currently calculating metric: 'stats'
INFO: Currently calculating metric: 'npw_relevance'


100%|██████████| 5/5 [00:00<00:00, 10.64it/s]


INFO: Currently calculating metric: 'ttr'


100%|██████████| 5/5 [00:00<00:00, 30.79it/s]


INFO: Currently calculating metric: 'freq'


100%|██████████| 5/5 [00:00<00:00, 10.85it/s]


In [24]:
# Define the visualizer arguments
visualizer_args = VisualizerArgs(
	output_folder="output", zoomable=True, ngrams=None, output_formats=["html", "png"])

# Create dynamic visualizations of the results
charts = Visualizer(input_json=results, args=visualizer_args).create()

Reading json data...
INFO: Creating a BarChart object for metric "pmi"...
INFO: Saving it to the filepath: "output/pmi/BarChart.html".
INFO: Saving it to the filepath: "output/pmi/BarChart.png".
INFO: Creating a BarChart object for metric "stats"...
INFO: Saving it to the filepath: "output/stats/StatsBarChart.html".
INFO: Saving it to the filepath: "output/stats/StatsBarChart.png".
INFO: Creating a BarChart object for metric "npw_relevance"...
INFO: Saving it to the filepath: "output/npw_relevance/BarChart.html".
INFO: Saving it to the filepath: "output/npw_relevance/BarChart.png".
INFO: Creating a BarChart object for metric "ttr"...
INFO: Saving it to the filepath: "output/ttr/DiversityBarChart.html".
INFO: Saving it to the filepath: "output/ttr/DiversityBarChart.png".
INFO: Creating a BarChart object for metric "freq"...
INFO: Saving it to the filepath: "output/freq/BarChart.html".
INFO: Saving it to the filepath: "output/freq/BarChart.png".


In [25]:
charts["stats"]["BarChart"]