# Use case 1: Language Variation Across Space

Welcome! In this use case example, you will be shown how we carry out an analysis using 🕵️‍♀️ Variationist to explore language variation across space.

## Preliminaries

As a first step, we install 🕵️‍♀️ Variationist and import the needed components.

In [1]:
# Install Variationist from PyPI using pip
!pip install variationist

# Import the library
from variationist import Inspector, InspectorArgs, Visualizer, VisualizerArgs

# If your dataset has a large number of examples, you might need this to display large charts
!pip install "vegafusion[embed]>=1.4.0"
import altair as alt
alt.data_transformers.enable("vegafusion")



  from .autonotebook import tqdm as notebook_tqdm
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.




DataTransformerRegistry.enable('vegafusion')

In [11]:
# Specify the dataset filepath
dataset = "../data/diatopit.tsv"

### 1) Define the inspector arguments

We use the `text` as our text column and `region` as our variable (`nominal` type with `spatial` semantics).

We use `npw_pmi` as our metric, choosing `1` token as our unit of language, created using a simple whitespace tokenizer (default).

For preprocessing, we require stopwords removal (using a default *stopwords-iso* list) in `it` (Italian), and appending to it some extra tokens (i.e., the anonymization placeholders `user` and `url`) using the custom_stopwords parameter. We also require lowercasing of our units.

In [46]:
ins_args = InspectorArgs(
    text_names=["text"],
    var_names=["region"], var_types=["nominal"], var_semantics=["spatial"],
    n_tokens=1,
    metrics=["npw_relevance"],
    stopwords=True, language="it", custom_stopwords=["user", "url"], lowercase=True)

### 2) Run the inspector and get the results

We then initialize the `Inspector` by passing our `dataset` and the `ins_args` arguments defined above, and call the `inspect()` function. For this example, we will not serialize the results and instead save the interchangeable json object in the variable `res` for a direct analysis.

In [47]:
res = Inspector(dataset, ins_args).inspect()

INFO: The metadata we will be using for the current analysis are:
{'text_names': ['text'], 'var_names': ['region'], 'metrics': ['npw_relevance'], 'var_types': ['nominal'], 'var_semantics': ['spatial'], 'var_subsets': None, 'var_bins': [0], 'tokenizer': 'whitespace', 'language': 'it', 'n_tokens': 1, 'n_cooc': 1, 'unique_cooc': False, 'cooc_window_size': 0, 'freq_cutoff': 3, 'stopwords': True, 'custom_stopwords': ['user', 'url'], 'lowercase': True, 'ignore_null_var': False}
INFO: all column identifiers are treated as column names.
INFO: '../data/diatopit.tsv' is loaded as a TSV file.
INFO: given the provided column names, we consider the first line as the header.
INFO: Tokenizing the text column...


100%|██████████| 15039/15039 [00:00<00:00, 34551.04it/s]


INFO: Currently calculating metric: 'npw_relevance'


100%|██████████| 20/20 [00:00<00:00, 499.71it/s]


### 3) Define the visualizer arguments

We then define the arguments for the visualizer. We choose to not request serialization of the charts, and simply pass out `shapefile_path` and the `shapefile_var_name` column (required parameters for creating charts for analyses including `spatial` semantics variables).

Shapefiles are required to visualize a background map in the chart. *.shp* files can be found in many online repositories such as [geodata](https://geodata.lib.berkeley.edu/), and are also typically provided by national/regional institutions in their websites. We just need to specify the *.shp* file, whereas other files associated to it are directly taken from the same folder (they always have the same name but different extension).

The parameter `shapefile_var_name` refers to a column in the `shapefile_path` file which contains the names for the areas which should match the possible values for the variable of interest (e.g., if the variable of interest is "state", here should go the name of the variable name encoded in the shapefile containing the possible states).

In our case, we take the shapefile from [here](https://www.istat.it/storage/cartografia/confini_amministrativi/generalizzati/Limiti01012022_g.zip), and the column denoting regions is named "DEN_REG".

In [48]:
# Define the path to the shapefile and the column referring to the variable of interest
shapefile = "../data/shp-italy-regions/Reg01012022_g_WGS84.shp"
shapefile_var = "DEN_REG"

# Run the visualizer
vis_args = VisualizerArgs(shapefile_path=shapefile, shapefile_var_name=shapefile_var)

### 4) Create interactive charts for all metrics

Finally, we initialize the `Visualizer` by passing our `res` and the `vis_args` arguments defined above, and call the `create()` function.

To avoid mismatches between the names in our `region` column and those of the `shapefile_var_name`, we may occasionally need to rename the variable values prior to instantiating the class (Variationist warns the user about that). For instance, there are hyphen differences between values in the DiatopIt `region` column and the "DEN_REG" `shapefile_var_name`. We thus rename some values prior to visualization.

In [49]:
res

{'metadata': {'text_names': ['text'],
  'var_names': ['region'],
  'metrics': ['npw_relevance'],
  'var_types': ['nominal'],
  'var_semantics': ['spatial'],
  'var_subsets': None,
  'var_bins': [0],
  'tokenizer': 'whitespace',
  'language': 'it',
  'n_tokens': 1,
  'n_cooc': 1,
  'unique_cooc': False,
  'cooc_window_size': 0,
  'freq_cutoff': 3,
  'stopwords': True,
  'custom_stopwords': ['user', 'url'],
  'lowercase': True,
  'ignore_null_var': False,
  'dataset': '../data/diatopit.tsv'},
 'metrics': {'npw_relevance': {'region': {'Marche': {'na': 1.0,
     'scritturebrevi': 0.8920687778591132,
     'sblab2021': 0.8553928368450885,
     'linguamadre': 0.7080708589869262,
     'de': 0.6268814252049437,
     'lu': 0.6188003909701256,
     'so': 0.4389471714016869,
     '😭': 0.4240483348931391,
     'ce': 0.42239789289840163,
     '😁': 0.3989709777721206,
     'pe': 0.35561068749930524,
     'n': 0.31402793093824777,
     'daje': 0.2795175076126371,
     '🤣': 0.2760090238092514,
     'je

In [50]:
# Rename the mismatching column values
for metric in res["metrics"]:
  res["metrics"][metric]["region"]["Friuli Venezia Giulia"] = res["metrics"][metric]["region"].pop("Friuli-Venezia Giulia")
  res["metrics"][metric]["region"]["Emilia-Romagna"] = res["metrics"][metric]["region"].pop("Emilia Romagna")

charts = Visualizer(input_json=res, args=vis_args).create()

Reading json data...
INFO: Creating a BarChart object for metric "npw_relevance"...
INFO: Creating a ChoroplethChart object for metric "npw_relevance"...


We can then print the chart objects for our chosen metric (i.e., `npw_pmi`). To do so, we just need to know which charts have been created for that particular metric (by looking at the `charts["npw_pmi"]` keys or from the 🕵️‍♀️ Variationist console output above).

We see that both a BarChart and a ChoroplethChart has been created. Let's visualize the **BarChart** first.

In [51]:
charts["npw_relevance"]["BarChart"]

In [None]:
# charts["npw_relevance"]["ChoroplethChart"]

### What if we have coordinates?

We use the same settings as before, but now define `latitude` and `longitude` as our variables (both are of the `coordinates` type and have a `spatial` semantics). The rest of the instructions remains the same, and we do not need to define an area defined by `shapefile_var_name`.

In [None]:
# Dataset filepath
dataset = "../data/diatopit.tsv"

# 1) Define the inspector arguments
ins_args = InspectorArgs(
    text_names=["text"],
    var_names=["latitude", "longitude"],
    var_types=["coordinates", "coordinates"], var_semantics=["spatial", "spatial"],
    n_tokens=1,
    metrics=["npw_relevance"],
    stopwords=True, language="it", custom_stopwords=["user", "url"], lowercase=True)

# 2) Run the inspector and get the results
res = Inspector(dataset, ins_args).inspect()

# 3) Define the visualizer arguments
shapefile = "../data/shp-italy-regions/Reg01012022_g_WGS84.shp" # path to the shapefile
vis_args = VisualizerArgs(output_folder="results", shapefile_path=shapefile)

# 4) Create interactive charts for all metrics
charts = Visualizer(input_json=res, args=vis_args).create()

In [None]:
# This does not work on notebooks, but we can just load the html file!
charts["npw_relevance"]["ScatterGeoChart"]

# from IPython.display import IFrame

# IFrame(src='../results/npw_pmi/ScatterGeoChart.html', width=500, height=500)

# from IPython.display import HTML
# HTML('../results/npw_pmi/ScatterGeoChart.html')