# Variationist Quickstart

**Welcome! In this quickstart guide, you will be shown the basics of how Variationist works, and a sample of what you can do with it.**

🕵️‍♀️ Variationist is a highly-modular, flexible, and customizable tool to analyze and explore language variation and bias in written language data. It allows researchers, from NLP practitioners to linguists and social scientists, to seamlessly investigate language use across a wide range of use cases.



## Installing Variationist

You can install Variationist either using `pip` or by cloning our `github` repo.

In [1]:
# Install using pip
!pip install variationist

Collecting variationist
  Downloading variationist-0.1.4-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting altair==5.2.0 (from variationist)
  Downloading altair-5.2.0-py3-none-any.whl (996 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m996.9/996.9 kB[0m [31m31.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting datasets==2.17.1 (from variationist)
  Downloading datasets-2.17.1-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.7/536.7 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting emoji==2.10.1 (from variationist)
  Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.5/421.5 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting geopandas==0.14.3 (from variationist)
  Downloading geopandas-0.14.3-py3-none-any.whl (1.1

In [2]:
# If your dataset has a large number of examples, you might need this to display large charts
!pip install "vegafusion[embed]>=1.4.0"
import altair as alt
alt.data_transformers.enable("vegafusion")

Collecting vegafusion[embed]>=1.4.0
  Downloading vegafusion-1.6.9-py3-none-any.whl (54 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/54.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.5/54.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting vegafusion-python-embed==1.6.9 (from vegafusion[embed]>=1.4.0)
  Downloading vegafusion_python_embed-1.6.9-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (25.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m25.1/25.1 MB[0m [31m54.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: vegafusion-python-embed, vegafusion
Successfully installed vegafusion-1.6.9 vegafusion-python-embed-1.6.9


DataTransformerRegistry.enable('vegafusion')

In [3]:
from variationist import Inspector, InspectorArgs, Visualizer, VisualizerArgs

Load your own dataset in .tsv or .csv format, or use a dataset that is available on HuggingFace


In [4]:
# my_dataset = "my_dataset.tsv"
my_dataset = "hf::yelp_review_full::train"

In [5]:
# Define the inspector arguments. For now, let's just select a handful of metrics.
inspector_args = InspectorArgs(
		 text_names=["text"],
		 var_names=["label"],
		 metrics=["pmi", "stats", "npw_relevance", "ttr", "freq"],
		 stopwords=True,
		 language="en",
		 n_tokens=1,
		 n_cooc=1
)

In [6]:
# Create an inspector instance, run it, and get the results in json
results = Inspector(dataset=my_dataset, args=inspector_args).inspect()

INFO: No values have been set for var_types. Defaults to nominal.
INFO: No values have been set for var_semantics. Defaults to general.
INFO: The metadata we will be using for the current analysis are:
{'text_names': ['text'], 'var_names': ['label'], 'metrics': ['pmi', 'stats', 'npw_relevance', 'ttr', 'freq'], 'var_types': ['nominal'], 'var_semantics': ['general'], 'var_subsets': None, 'var_bins': [0], 'tokenizer': 'whitespace', 'language': 'en', 'n_tokens': 1, 'n_cooc': 1, 'unique_cooc': False, 'cooc_window_size': 0, 'freq_cutoff': 3, 'stopwords': True, 'custom_stopwords': None, 'lowercase': False, 'ignore_null_var': False}
INFO: all column identifiers are treated as column names.
INFO: 'Loading hf::yelp_review_full::train' as a HuggingFace dataset. We assume the last element in the specified string is the split ("train").


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/6.72k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/299M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/23.5M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/650000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/50000 [00:00<?, ? examples/s]

INFO: Tokenizing the text column...


100%|██████████| 650000/650000 [03:09<00:00, 3432.04it/s]


INFO: Currently calculating metric: 'pmi'


100%|██████████| 5/5 [00:15<00:00,  3.18s/it]


INFO: Currently calculating metric: 'stats'
INFO: Currently calculating metric: 'npw_relevance'


100%|██████████| 5/5 [00:14<00:00,  2.80s/it]


INFO: Currently calculating metric: 'ttr'


100%|██████████| 5/5 [00:06<00:00,  1.22s/it]


INFO: Currently calculating metric: 'freq'


100%|██████████| 5/5 [00:12<00:00,  2.48s/it]


In [7]:
# Define the visualizer arguments
visualizer_args = VisualizerArgs(
	output_folder="output",
	output_formats=["html", "png"]
)

# Create dynamic visualizations of the results
charts = Visualizer(input_json=results, args=visualizer_args).create()

Reading json data...
INFO: Creating a BarChart object for metric "pmi"...
INFO: Saving it to the filepath: "output/pmi/BarChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.
INFO: Creating a BarChart object for metric "stats"...
INFO: Saving it to the filepath: "output/stats/StatsBarChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.
INFO: Creating a BarChart object for metric "npw_relevance"...
INFO: Saving it to the filepath: "output/npw_relevance/BarChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.
INFO: Creating a BarChart object for metric "ttr"...
INFO: Saving it to the filepath: "output/ttr/DiversityBarChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.
INFO: Creating a BarChart object for metric "freq"...
INFO: Saving it to the filepath: "output/freq/BarChart.html

In [8]:
charts["freq"]["BarChart"]

In [9]:
charts["npw_relevance"]["BarChart"]

Hm. Looks like "food" and "time" are very common across all of our labels, so this does not tell us much. Let's try filtering them out with the `custom_stopwords` parameter of Inspector.



In [10]:
inspector_args = InspectorArgs(
		 text_names=["text"],
		 var_names=["label"],
		 metrics=["pmi", "stats", "npw_relevance", "ttr", "freq"],
		 stopwords=True,
		 custom_stopwords=["food", "time"],
		 language="en",
		 n_tokens=1,
		 n_cooc=1
)

In [11]:
# Create an inspector instance, run it, and get the results in json
results = Inspector(dataset=my_dataset, args=inspector_args).inspect()

INFO: No values have been set for var_types. Defaults to nominal.
INFO: No values have been set for var_semantics. Defaults to general.
INFO: The metadata we will be using for the current analysis are:
{'text_names': ['text'], 'var_names': ['label'], 'metrics': ['pmi', 'stats', 'npw_relevance', 'ttr', 'freq'], 'var_types': ['nominal'], 'var_semantics': ['general'], 'var_subsets': None, 'var_bins': [0], 'tokenizer': 'whitespace', 'language': 'en', 'n_tokens': 1, 'n_cooc': 1, 'unique_cooc': False, 'cooc_window_size': 0, 'freq_cutoff': 3, 'stopwords': True, 'custom_stopwords': ['food', 'time'], 'lowercase': False, 'ignore_null_var': False}
INFO: all column identifiers are treated as column names.
INFO: 'Loading hf::yelp_review_full::train' as a HuggingFace dataset. We assume the last element in the specified string is the split ("train").
INFO: Tokenizing the text column...


100%|██████████| 650000/650000 [03:12<00:00, 3383.15it/s]


INFO: Currently calculating metric: 'pmi'


100%|██████████| 5/5 [00:11<00:00,  2.34s/it]


INFO: Currently calculating metric: 'stats'
INFO: Currently calculating metric: 'npw_relevance'


100%|██████████| 5/5 [00:11<00:00,  2.20s/it]


INFO: Currently calculating metric: 'ttr'


100%|██████████| 5/5 [00:05<00:00,  1.15s/it]


INFO: Currently calculating metric: 'freq'


100%|██████████| 5/5 [00:11<00:00,  2.23s/it]


In [13]:
# Define the visualizer arguments
visualizer_args = VisualizerArgs(
	output_folder="output", output_formats=["html", "png"])

# Create dynamic visualizations of the results
charts = Visualizer(input_json=results, args=visualizer_args).create()

Reading json data...
INFO: Creating a BarChart object for metric "pmi"...
INFO: Saving it to the filepath: "output/pmi/BarChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.
INFO: Creating a BarChart object for metric "stats"...
INFO: Saving it to the filepath: "output/stats/StatsBarChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.
INFO: Creating a BarChart object for metric "npw_relevance"...
INFO: Saving it to the filepath: "output/npw_relevance/BarChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.
INFO: Creating a BarChart object for metric "ttr"...
INFO: Saving it to the filepath: "output/ttr/DiversityBarChart.html".
The dataset is too big to be serialized as PNG efficiently. Please use the interactive HTML.
INFO: Creating a BarChart object for metric "freq"...
INFO: Saving it to the filepath: "output/freq/BarChart.html

In [14]:
charts["npw_relevance"]["BarChart"]