# ==== INTERACTIVE CLUSTERING : ANNOTATION TIME STUDY ====
> ### Stage 1 : Modelize annotation time with Interactive Clustering Methodology and Plot some figures.

-----

## READ-ME BEFORE RUNNING

### Quick Description

This notebook is **aimed at modelize interactive clustering annotation time experiments**.
- Environments are represented by subdirectories in the `/experiments` folder.
- Each subdirectories of `/experiments` folder represents an annotation experiment with several annotators.

### Description each steps

First of all, **load experiment synthesis XLSX file** that have made during annotation experiment.
- It contains sessions of annotation for each annotator.
- Each session contains the number of constraints annotated and the time needed for it.

Then, several analyses are performed:
1. Check hypotheses for parametric modelization
2. Modelize annotation time in function of constraints number
2. Modelize annotation speed in function of session number

-----

## 1. IMPORT PYTHON DEPENDENCIES

In [None]:
from typing import Dict, List, Optional, Tuple, Union
import json
import numpy as np
import openpyxl
import pandas as pd
from matplotlib import pyplot as plt
from matplotlib.figure import Figure
import matplotlib.cm as cm
from matplotlib.colors import Normalize
from scipy import stats as scipystats
import statistics
import statsmodels
import statsmodels.api
import statsmodels.formula.api

-----

## 2. LOAD DATA

### 2.1. Load data from XLSX file.

In [None]:
df_annotation_time: pd.DataFrame = pd.read_excel(
    io="../experiments/mlsum_fr_train_subset_v1.0.0.schild/results.xlsx",
    sheet_name="time",
    engine="openpyxl",
)
#df_annotation_time["CONSTRAINTS_PER_MINUTE"] = df_annotation_time["CONSTRAINTS_PER_MINUTE"].replace(",", ".").astype(float)
#df_annotation_time["CONSTRAINTS_PER_HOUR"] = df_annotation_time["CONSTRAINTS_PER_HOUR"].replace(",", ".").astype(float)
#df_annotation_time["SECONDS_PER_CONSTRAINT"] = df_annotation_time["SECONDS_PER_CONSTRAINT"].replace(",", ".").astype(float)
df_annotation_time.head()

In [None]:
print("Constraints number: mean={0:.2f}, median={1:.2f}, min={2:.2f}, max={3:.2f}, sigma={4:.2f}, sem={5:.2f}".format(
    np.mean(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["CONSTRAINTS_NUMBER"]),
    np.median(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["CONSTRAINTS_NUMBER"]),
    min(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["CONSTRAINTS_NUMBER"]),
    max(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["CONSTRAINTS_NUMBER"]),
    np.std(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["CONSTRAINTS_NUMBER"]),
    scipystats.sem(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["CONSTRAINTS_NUMBER"]),
))

In [None]:
print("Session number: mean={0:.2f}, min={1:.2f}, max={2:.2f}, sigma={3:.2f}".format(
    np.mean(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["SESSION_ID"]),
    min(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["SESSION_ID"]),
    max(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["SESSION_ID"]),
    np.std(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["SESSION_ID"]),
))

In [None]:
print("Needed seconds per constraints: mean={0:.2f}, min={1:.2f}, max={2:.2f}, sigma={3:.2f}".format(
    np.mean(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["SECONDS_PER_CONSTRAINT"]),
    min(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["SECONDS_PER_CONSTRAINT"]),
    max(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["SECONDS_PER_CONSTRAINT"]),
    np.std(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["SECONDS_PER_CONSTRAINT"]),
))

In [None]:
print("Annotated constraints per minute: mean={0:.2f}, min={1:.2f}, max={2:.2f}, sigma={3:.2f}, sem={4:.2f}".format(
    np.mean(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["CONSTRAINTS_PER_MINUTE"]),
    min(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["CONSTRAINTS_PER_MINUTE"]),
    max(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["CONSTRAINTS_PER_MINUTE"]),
    np.std(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["CONSTRAINTS_PER_MINUTE"]),
    scipystats.sem(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["CONSTRAINTS_PER_MINUTE"]),
))

### 2.2. Check hypotheses to run parametric modelization

The Shapiro-Wilk test tests the null hypothesis that the data was drawn from a normal distribution.

In [None]:
scipystats.shapiro(x=df_annotation_time["CONSTRAINTS_PER_MINUTE"]).pvalue
# 4.17e-05 => "CONSTRAINTS_PER_MINUTE" wasn't drawn from a normal distribution.

The Kolmogorov-Smirnov test tests the null hypothesis that the data was drawn from a given distribution (here: a normal distribution).

In [None]:
scipystats.kstest(rvs=df_annotation_time["CONSTRAINTS_PER_MINUTE"], cdf=scipystats.norm.cdf).pvalue
# 2.71e-251 => "CONSTRAINTS_PER_MINUTE" wasn't drawn from a normal distribution.

> Conclusion: Need a non-parametric modelizations

-----

## 3. ANALYZE DATA

### 3.0. Estimation of theorical annotation time

Load constraints and texts to annotate.

In [None]:
# Get texts.
with open("../experiments/mlsum_fr_train_subset_v1.0.0.schild/texts.json", "r") as text_file:
    dict_of_texts = json.load(text_file)
# Get constraints.
with open("../experiments/mlsum_fr_train_subset_v1.0.0.schild/constraints_-_template_to_annotate_1.json", "r") as constraints_file:
    dict_of_constraints = json.load(constraints_file)

Estimation the time needed to read one text.

In [None]:
# Get list of text to read.
list_of_preprocessed_texts: List[str] = []
for constraint in dict_of_constraints.values():
    list_of_preprocessed_texts.append(dict_of_texts[constraint["data"]["id_1"]]["text_preprocessed"])
    list_of_preprocessed_texts.append(dict_of_texts[constraint["data"]["id_2"]]["text_preprocessed"])
# Get texts size.
list_of_preprocessed_text_sizes: List[int] = [
    len(text.split(" "))
    for text in list_of_preprocessed_texts
]
print("Text size: mean={0:.2f}, min={1:.2f}, max={2:.2f}, sigma={3:.2f}, sem={4:.2f}".format(
    np.mean(list_of_preprocessed_text_sizes),
    min(list_of_preprocessed_text_sizes),
    max(list_of_preprocessed_text_sizes),
    np.std(list_of_preprocessed_text_sizes),
    scipystats.sem(list_of_preprocessed_text_sizes),
))

In [None]:
# Constant: read speed in word per minutes.
mean_word_read_per_minute: float = 238  # https://psycnet.apa.org/record/2019-59523-001
mean_word_read_per_second: float = mean_word_read_per_minute / 60  # https://psycnet.apa.org/record/2019-59523-001

# Estimate texts read time.
list_of_preprocessed_text_read_times: List[float] = [
    text_size / mean_word_read_per_second
    for text_size in list_of_preprocessed_text_sizes
]
print("Read time: mean={0:.2f}, min={1:.2f}, max={2:.2f}, sigma={3:.2f}, sem={4:.2f}".format(
    np.mean(list_of_preprocessed_text_read_times),
    min(list_of_preprocessed_text_read_times),
    max(list_of_preprocessed_text_read_times),
    np.std(list_of_preprocessed_text_read_times),
    scipystats.sem(list_of_preprocessed_text_read_times),
))

Estimate the time needed to understand a text.

In [None]:
neuroscience_P600: float = 0.600  # 600 ms, https://en.wikipedia.org/wiki/P600_(neuroscience)
needed_time_to_understand_a_text: float = neuroscience_P600
print("Understand one text:", needed_time_to_understand_a_text)
needed_time_to_understand_two_texts_concordance: float = neuroscience_P600
print("Understand two texts concordance:", needed_time_to_understand_two_texts_concordance)

Estimate the time of reaction and application.

In [None]:
#needed_time_to_act: float = 1
#print("Motor reaction:", needed_time_to_act)
application_delays: float = 1
print("Wait for application:", application_delays)

Complete estimation.

In [None]:
# Estimate texts read time.
list_of_preprocessed_text_read_times: List[float] = [
    (
        # Read and understand two text.
        2 * text_read_time  # 2 * (needed_time_to_read_a_text + needed_time_to_understand_a_text)
        # Estimate similarity and choose constraint to add.
        + needed_time_to_understand_two_texts_concordance
        # Add the constraint in the application.
        + application_delays
    )
    for text_read_time in list_of_preprocessed_text_read_times
]
print("Total theorical needed time: mean={0:.2f}, min={1:.2f}, max={2:.2f}, sigma={3:.2f}, sem={4:.2f}".format(
    np.mean(list_of_preprocessed_text_read_times),
    min(list_of_preprocessed_text_read_times),
    max(list_of_preprocessed_text_read_times),
    np.std(list_of_preprocessed_text_read_times),
    scipystats.sem(list_of_preprocessed_text_read_times),
))

Some reference from `Snow et al (2008). Article Cheap and Fast — But is it Good?`

In [None]:
# Word Sense Disambiguation
annotation_time_needed: float = 8.59 * 3600  # in second
dataset_size: int = 1770
time_for_one_annotation: float = annotation_time_needed / dataset_size
print("time_for_one_annotation:", time_for_one_annotation)
theorical_approximation: float = 2 * 15/mean_word_read_per_second + needed_time_to_understand_two_texts_concordance + application_delays
print("theorical_approximation:", theorical_approximation)

In [None]:
# Word Similarity
annotation_time_needed: float = 0.17 * 3600  # in second
dataset_size: int = 300
time_for_one_annotation: float = annotation_time_needed / dataset_size
print("time_for_one_annotation:", time_for_one_annotation)
theorical_approximation: float = 2 * 1/mean_word_read_per_second + needed_time_to_understand_two_texts_concordance + application_delays
print("theorical_approximation:", theorical_approximation)

In [None]:
# Recognizing Textual Entailment
annotation_time_needed: float = 89.3 * 3600  # in second
dataset_size: int = 8000
time_for_one_annotation: float = annotation_time_needed / dataset_size
print("time_for_one_annotation:", time_for_one_annotation)
theorical_approximation: float = 2 * 15/mean_word_read_per_second + 10 + application_delays
print("theorical_approximation:", theorical_approximation)

### 3.1. Analyze annotation time par constraint

In [None]:
# Fit the model to the data and print results.
model_annotation_time = statsmodels.formula.api.glm(
    formula="NEEDED_SECONDS ~ 0 + CONSTRAINTS_NUMBER",
    #formula="NEEDED_SECONDS ~ 1 + CONSTRAINTS_NUMBER",
    data=df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1],
)
results_annotation_time = model_annotation_time.fit()
print(results_annotation_time.summary())

In [None]:
# Print the modelization.
print(
    "NEEDED_SECONDS ~",
    "{0:.2E}".format(results_annotation_time.params["Intercept"]) if "Intercept" in results_annotation_time.params.keys() else "",
    "+ {0:.2E}*{1}".format(results_annotation_time.params["CONSTRAINTS_NUMBER"], "CONSTRAINTS_NUMBER")
)

In [None]:
# Define the interpolation function.
def interpolation_annotation_time(constraints_number) -> Tuple[float, float, float]:
    # Initialization.
    res_low: float = 0.0
    res: float = 0.0
    res_high: float = 0.0
    # Intercept.
    if "Intercept" in results_annotation_time.params.keys():
        res_low += (results_annotation_time.params["Intercept"] - results_annotation_time.bse["Intercept"])
        res += results_annotation_time.params["Intercept"]
        res_high += (results_annotation_time.params["Intercept"] + results_annotation_time.bse["Intercept"])
    # constraints_number.
    res_low += (results_annotation_time.params["CONSTRAINTS_NUMBER"] - results_annotation_time.bse["CONSTRAINTS_NUMBER"]) * constraints_number
    res += results_annotation_time.params["CONSTRAINTS_NUMBER"] * constraints_number
    res_high += (results_annotation_time.params["CONSTRAINTS_NUMBER"] + results_annotation_time.bse["CONSTRAINTS_NUMBER"]) * constraints_number
    # Return.
    return res_low, res, res_high

In [None]:
# Create a new figure.
fig_plot_annotation_time: Figure = plt.figure(figsize=(15, 7.5), dpi=300)
axis_plot_annotation_time = fig_plot_annotation_time.gca()

# Set range of axis.
axis_plot_annotation_time.set_xlim(xmin=-3, xmax=555)
axis_plot_annotation_time.set_ylim(ymin=-1, ymax=101)

# Plot annotation time.
axis_plot_annotation_time.plot(
    df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["CONSTRAINTS_NUMBER"],  # x
    df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["NEEDED_SECONDS"]/60,  # y
    label="Temps d'annotation observé",
    marker="x",
    markerfacecolor="red",
    markersize=5,
    color="red",
    linewidth=0,
    linestyle="",
)
axis_plot_annotation_time.plot(
    range(50, 501, 10),  # x
    [
        interpolation_annotation_time(x)[1]/60
        for x in range(50, 501, 10)
    ],  # y
    label="Temps d'annotation modélisé",
    marker="",
    markerfacecolor="red",
    markersize=3,
    color="red",
    linewidth=2,
    linestyle="--",
)
axis_plot_annotation_time.fill_between(
    x=range(50, 501, 10),  # x
    y1=[
        interpolation_annotation_time(x)[0]/60
        for x in range(50, 501, 10)
    ],  # y1
    y2=[
        interpolation_annotation_time(x)[2]/60
        for x in range(50, 501, 10)
    ],  # y2
    color="red",
    alpha=0.2,
)

# Set axis name.
axis_plot_annotation_time.set_xlabel("nombre de contraintes [#]", fontsize=18,)
axis_plot_annotation_time.set_ylabel("temps d'annotation [m]", fontsize=18,)

# Plot the legend.
axis_plot_annotation_time.legend(
    loc="upper left",
    fontsize=15,
)

# Plot the grid.
axis_plot_annotation_time.grid(True)
    
# Store the graph.
fig_plot_annotation_time.savefig(
    "../results/etude-temps-annotation-1-modelisation-temps.png",
    dpi=300,
    transparent=True,
    bbox_inches="tight",
)

### 3.2. Modelize annotation speed per session

In [None]:
# Fit the model to the data and print results.
model_annotation_speed = statsmodels.formula.api.glm(
    formula="CONSTRAINTS_PER_MINUTE ~ 1 + SESSION_ID",
    data=df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1],
    family=statsmodels.api.families.Gaussian(
        link=statsmodels.genmod.families.links.identity()
        #link=statsmodels.genmod.families.links.log()
    ),
)
results_annotation_speed = model_annotation_speed.fit()
print(results_annotation_speed.summary())

> Conclusion : Variance inter-annotators too high, so no conclusion on session id effect.

### 3.3. Case study of some annotators

> Specific study of annotators `3`,`7`,`9` ; `1`,`5`.

In [None]:
# Create a new figure.
fig_plot_annotator_speed_study: Figure = plt.figure(figsize=(15, 7.5), dpi=300)
axis_plot_annotator_speed_study = fig_plot_annotator_speed_study.gca()

# Set axis.
axis_plot_annotator_speed_study.set_xlim(xmin=0.9, xmax=9.1)
axis_plot_annotator_speed_study.set_ylim(ymin=-0.1, ymax=16.1)

# Plot for annotation speed for some annotators.
colors = [
    "orange", "red",
    "blue", "purple",
]
markers = [
    ">", ">",
    "^", "^",
]
for i, annotator_id in enumerate([
    1, 5,  # constant slope
    7, 9, # increasing slope
]):
    axis_plot_annotator_speed_study.plot(
        df_annotation_time[(df_annotation_time["ANNOTATOR_ID"]==annotator_id)&(df_annotation_time["EXPERIMENT_ID"]==1)]["SESSION_ID"],  # x
        df_annotation_time[(df_annotation_time["ANNOTATOR_ID"]==annotator_id)&(df_annotation_time["EXPERIMENT_ID"]==1)]["CONSTRAINTS_PER_MINUTE"],  # y
        label="Vitesse d'annotation observée pour l'annotateur "+str(annotator_id),
        marker=markers[i],
        markerfacecolor=colors[i],
        markersize=3,
        color=colors[i],
        linewidth=1,
        linestyle="--",
    )
    
# mean of annotation speed.
average_annotation_speed: float = np.mean(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["CONSTRAINTS_PER_MINUTE"])
sem_annotatiom_speed: float = scipystats.sem(df_annotation_time[df_annotation_time["EXPERIMENT_ID"]==1]["CONSTRAINTS_PER_MINUTE"])
axis_plot_annotator_speed_study.plot(
    range(0, 10),  # x
    [
        average_annotation_speed
        for _ in range(0, 10)
    ],  # y
    label="Vitesse d'annotation moyenne",
    marker="",
    markerfacecolor="black",
    markersize=5,
    color="black",
    linewidth=2,
    linestyle="-",
)
axis_plot_annotator_speed_study.fill_between(
    x=range(0, 10),  # x
    y1=[
        average_annotation_speed-sem_annotatiom_speed
        for _ in range(0, 10)
    ],  # y
    y2=[
        average_annotation_speed+sem_annotatiom_speed
        for _ in range(0, 10)
    ],  # y
    color="black",
    alpha=0.2,
)

# Set axis name.
axis_plot_annotator_speed_study.set_xlabel("session d'annotation [#]", fontsize=18,)
axis_plot_annotator_speed_study.set_ylabel("vitesse d'annotation [#/m]", fontsize=18,)

# Plot the legend.
axis_plot_annotator_speed_study.legend(
    loc="lower right",
    fontsize=15,
)

# Plot the grid.
axis_plot_annotator_speed_study.grid(True)
    
# Store the graph.
fig_plot_annotator_speed_study.savefig(
    "../results/etude-temps-annotation-3-etude-de-cas.png",
    dpi=300,
    transparent=True,
    bbox_inches="tight",
)

----
## Discussion

1. hypothèse temps annotation est linéaire
    - OK: afficher temps/constraint

2. hypothèse vitesse augmente en fonction du nombre de session
    - KO: variation inter-annotateur trop forte
    - Stats descriptives
    - Discussion de quelques cas : un qui augmente, un qui stagne ?