# pygamma-agreement Example Notebook

This notebook will show you some basic usage of the pygamma-agreement library.

First, let's load a csv file into a `Continuum` object and inspect
this object's properties. We'll use one of the short example files contained
in the `tests/data/` directory.

In [None]:
from pygamma_agreement import Continuum

continuum = Continuum.from_csv("tests/data/AlexPaulSuzan.csv")
continuum

A continuum is made of _units_. A unit is a segment, optionally with some
text annotation, that has been annotated by some human or automatic annotator.
Let's have a look at these units:

In [None]:
for annotator, annotations in continuum:
    for segment, unit in annotations.items():
        print(annotator, unit)

We can also list some basic properties of this continuum instance

In [None]:
print(f"All annotators: {continuum.annotators}")
print(f"All categories: {continuum.categories}")
print(f"Unit count: {continuum.num_units}")
print(f"Average number of units per annotator: {continuum.avg_num_annotations_per_annotator}")
print(f"Average units lengths: {continuum.avg_length_unit}")

Enough playing around. Let's get down to business and actually do what you probably
came here for: computing the Gamma Inter-annotator agreement.
For that, we'll use a "combined categorical dissimilarity", which is, simply put,
a dissimilarity that measures both the temporal and categorical differences between
annotated units (for two different annotators from the continuum).

We'll also ask for 30 random continuum to be sampled in the Gamma measure's
chance estimation.

In [None]:
from pygamma_agreement import CombinedCategoricalDissimilarity

dissimilarity = CombinedCategoricalDissimilarity(continuum.categories)

gamma_results = continuum.compute_gamma(dissimilarity,
                                        n_samples=30)
print(f"The gamma agreement is {gamma_results.gamma}")

We can also retrieve the best alignment for that continuum. This is the
list of per-annotator tuples of units that has the lowest _disorder_
(intuitively, when units from different annotators are paired in tuples
in the best possible way):

In [None]:
best_alignment = continuum.get_best_alignment(dissimilarity)
best_alignment

Up until now, we used a very simple and very short example file. All the
gamma computations should have been pretty fast. However, computing the gamma
measure can be quite costly for bigger files. Let's re-run the gamma measure
on some bigger files, as to give you a better "feeling" of how long it be.

We'll also be using the `precision_level` parameter. The closer it is to 0,
the more precise our measure of the gamma will be (albeit more costly).

In [None]:
from time import time

files = [
    "tests/data/2by1000.csv",
    "tests/data/2by5000.csv",
    "tests/data/3by100.csv",
]

for file in files:
    start_time = time()
    continuum = Continuum.from_csv(file)
    gamma_results = continuum.compute_gamma(dissimilarity, precision_level=0.02)
    end_time = time()
    print(f"Took {end_time - start_time}s for {file}")
    print(f"Had to sample {gamma_results.n_samples} continuua")
    print(f"Gamma is {gamma_results.gamma} (with estimation range {gamma_results.approx_gamma_range})")