Skip to content

Commit

Permalink
docs: update all README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
erwanschild committed Nov 13, 2023
1 parent 90520a3 commit 71c3d43
Show file tree
Hide file tree
Showing 10 changed files with 95 additions and 42 deletions.
4 changes: 2 additions & 2 deletions 1_efficience_study/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Then follow notebooks instructions.

Due to the volume of data generated (around 2 GB), not all results are versioned on GitHub.

- results are zipped in a `.tar.gz` file and versioned on Zenodo : `TODO`.
- results are zipped in a `.tar.gz` file and versioned on Zenodo : `Schild, E. (2021). cognitivefactory/interactive-clustering-comparative-study. Zenodo. https://doi.org/10.5281/zenodo.5648255`.
- a summary of results are stored in `results`.

NB: In order to make a save in a `.tar.gz` file, you can use the following command:
Expand All @@ -66,4 +66,4 @@ tar -czf 1_efficience_study.tar.gz experiments/ notebook/ results/ README.md
## Scientific contribution

- A research paper is dedicated to this study : `Schild, E., Durantin, G., Lamirel, J., & Miconi, F. (2022). Iterative and Semi-Supervised Design of Chatbots Using Interactive Clustering. International Journal of Data Warehousing and Mining (IJDWM), 18(2), 1-19. http://doi.org/10.4018/IJDWM.298007. <hal-03648041>.`
- Two sections of my PhD report are dedicated to this study : `TODO`.
- Two sections of my PhD report are dedicated to this study : `Schild, E. (2024, in press). De l'Importance de Valoriser l'Expertise Humaine dans l’Annotation : Application à la Modélisation de Textes en Intentions à l'aide d’un Clustering Interactif. Université de Lorraine.` (Sections 4.1 and 4.2)
6 changes: 3 additions & 3 deletions 2_computation_time_study/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Interactive Clustering : 2. Annotation Error Study
# Interactive Clustering : 2. Annotation Time Study

The main goal of this study is to **estimate the execution time needed** for algorithms to reach their objectives.

Expand Down Expand Up @@ -45,7 +45,7 @@ Then follow notebooks instructions.

Due to the volume of data generated (around 2 GB), not all results are versioned on GitHub.

- results are zipped in a `.tar.gz` file and versioned on Zenodo : `TODO`.
- results are zipped in a `.tar.gz` file and versioned on Zenodo : `Schild, E. (2021). cognitivefactory/interactive-clustering-comparative-study. Zenodo. https://doi.org/10.5281/zenodo.5648255`.
- a summary of results are stored in `results`.

In order to make a save in a `.tar.gz` file, you can use the following command:
Expand All @@ -56,4 +56,4 @@ tar -czf 2_computation_time_study.tar.gz experiments/ notebook/ results/ README.

## Scientific contribution

- One section of my PhD report is dedicated to this study : `TODO`.
- One section of my PhD report is dedicated to this study : `Schild, E. (2024, in press). De l'Importance de Valoriser l'Expertise Humaine dans l’Annotation : Application à la Modélisation de Textes en Intentions à l'aide d’un Clustering Interactif. Université de Lorraine.` (Section 4.3)
9 changes: 5 additions & 4 deletions 3_annotation_time_study/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,15 +21,16 @@ Several instructions are given to annotators:
- *Objective of the experiment*: "I want to know the time required to annotate a certain number of constraints; In other words: To annotate 1000 constraints, how long do I need?" ;
- *Annotation instructions*: "Perform at least 15 minutes of annottaion for regularity; If possible, isolate yourself in order to be not disturbed and to not distort the results; For each series, note the annotated time and number of constraints; If you don't know what to annotate (too ambiguous, unknown vocabulary, ...), go to the next one without annotating (you are supposed to be press experts!)".

Then, GLM modelisations are made on annotation time per bacth size and speed evolution over session..
Then, GLM modelisations are made on annotation time per bacth size and speed evolution over session.

## Implementation

1. Constraints to annotate are randomly selected.
2. Annotation project can be imported in the annotation app with zipped archive.
2. Annotation project can be imported in the annotation app with zipped archive, and operators can use the app to annotate constraints.
3. After annotation, time modelization are made based on experiment results.
4. Then, several graphs are made to represent annotation time.

All these steps are implemented in `Python`, and can be run within `Jupyter Notebooks`.

## Installation and Execution

Expand All @@ -44,7 +45,7 @@ Then follow notebooks instructions.

Due to the volume of data generated (around 1 GB), not all results are versioned on GitHub.

- results are zipped in a `.tar.gz` file and versioned on Zenodo : `TODO`.
- results are zipped in a `.tar.gz` file and versioned on Zenodo : `Schild, E. (2021). cognitivefactory/interactive-clustering-comparative-study. Zenodo. https://doi.org/10.5281/zenodo.5648255`.
- a summary of results are stored in `results`.

In order to make a save in a `.tar.gz` file, you can use the following command:
Expand All @@ -55,4 +56,4 @@ tar -czf 3_annotation_time_study.tar.gz experiments/ notebook/ results/ README.m

## Scientific contribution

- One section of my PhD report is dedicated to this study : `TODO`.
- One section of my PhD report is dedicated to this study : `Schild, E. (2024, in press). De l'Importance de Valoriser l'Expertise Humaine dans l’Annotation : Application à la Modélisation de Textes en Intentions à l'aide d’un Clustering Interactif. Université de Lorraine.` (Section 4.3)
6 changes: 4 additions & 2 deletions 4_constraints_number_study/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,8 @@ To analyze the constraints nedeed, GLM modelization are performed with experimen
3. When all experiments are run, constraints number requirements are modelized based on experiment results.
4. Graph and total cost can be estimated with these modelization.

All these steps are implemented in `Python`, and can be run within `Jupyter Notebooks`.


## Installation and Execution

Expand All @@ -41,7 +43,7 @@ Then follow notebooks instructions.

Due to the volume of data generated (around 15 GB), not all results are versioned on GitHub.

- results are zipped in a `.tar.gz` file and versioned on Zenodo : `TODO`.
- results are zipped in a `.tar.gz` file and versioned on Zenodo : `Schild, E. (2021). cognitivefactory/interactive-clustering-comparative-study. Zenodo. https://doi.org/10.5281/zenodo.5648255`.
- a summary of results are stored in `results`.

In order to make a save in a `.tar.gz` file, you can use the following command:
Expand All @@ -52,4 +54,4 @@ tar -czf 4_constraints_number_study.tar.gz experiments/ notebook/ results/ READM

## Scientific contribution

- One section of my PhD report is dedicated to this study : `TODO`.
- One section of my PhD report is dedicated to this study : `Schild, E. (2024, in press). De l'Importance de Valoriser l'Expertise Humaine dans l’Annotation : Application à la Modélisation de Textes en Intentions à l'aide d’un Clustering Interactif. Université de Lorraine.` (Section 4.3)
8 changes: 5 additions & 3 deletions 5_relevance_study/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,13 @@ Business relevance annotations are redo and are focused on summary (not on clust
## Implementation

1. Get previous results of convergence, and choose some iterations to analyze.
2. Perform manuel business relevance annotation.
2. Perform manual business relevance annotation.
3. Compute linguistic analysis, then perform semi-assisted business relevance annotation.
4. Call large language model to resume topics in clusters, then perform assisted business relevance annotation.
5. Compare each methods.

All these steps are implemented in `Python`, and can be run within `Jupyter Notebooks`.


## Installation and Execution

Expand All @@ -62,7 +64,7 @@ See the export notebook of this study and paste the exported files in the `previ

Due to the volume of data generated (around 1 GB), not all results are versioned on GitHub.

- results are zipped in a `.tar.gz` file and versioned on Zenodo : `TODO`.
- results are zipped in a `.tar.gz` file and versioned on Zenodo : `Schild, E. (2021). cognitivefactory/interactive-clustering-comparative-study. Zenodo. https://doi.org/10.5281/zenodo.5648255`.
- a summary of results are stored in `results`.

In order to make a save in a `.tar.gz` file, you can use the following command:
Expand All @@ -73,4 +75,4 @@ tar -czf 5_relevance_study.tar.gz experiments/ notebook/ previous/ results/ READ

## Scientific contribution

- One section of my PhD report is dedicated to this study : `TODO`.
- One section of my PhD report is dedicated to this study : `Schild, E. (2024, in press). De l'Importance de Valoriser l'Expertise Humaine dans l’Annotation : Application à la Modélisation de Textes en Intentions à l'aide d’un Clustering Interactif. Université de Lorraine.` (Section 4.4)
19 changes: 15 additions & 4 deletions 6_rentability_study/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,12 +12,23 @@ In fact, we study two options : (1) evolution of agreement between annotation an

## Experimental protocol

`TODO`.
Based on previous execution of _Interactive Clustering_, we will study the rentability of each new iteration of the method.

Two methods are used :
- evolution of agreement between annotation and previous clustering : if annotations are different from clustering suggestion, then there are still corrections to apply ; otherwise, if annotation and clustering results agreed, then no more corrections are needed.
- evolution of similarity between two successive clusterings : if clustering do not change after several iterations, then clustering is stable and no more corrections are needed.

To choose the best solution, we compute the correction between the evolution of these measures and the evolution of v-measure.


## Implementation

`TODO`.
1. Get previous results of convergence, and choose some iterations to analyze.
2. Iteration by iteration, compute agreement score between annotation and previous clustering, then compute correlation between this agreement and the groundtruth v-measure.
3. Iteration by iteration, compute consecutive clustering comparison, then compute correlation between this agreement and the groundtruth v-measure.
4. Compare each methods and display diagrams of evolutions.

All these steps are implemented in `Python`, and can be run within `Jupyter Notebooks`.


## Installation and Execution
Expand All @@ -38,7 +49,7 @@ See the export notebook of this study and paste the exported files in the `previ

Due to the volume of data generated (around 2 GB), not all results are versioned on GitHub.

- results are zipped in a `.tar.gz` file and versioned on Zenodo : `TODO`.
- results are zipped in a `.tar.gz` file and versioned on Zenodo : `Schild, E. (2021). cognitivefactory/interactive-clustering-comparative-study. Zenodo. https://doi.org/10.5281/zenodo.5648255`.
- a summary of results are stored in `results`.

In order to make a save in a `.tar.gz` file, you can use the following command:
Expand All @@ -49,4 +60,4 @@ tar -czf 6_rentability_study.tar.gz experiments/ notebook/ previous/ results/ RE

## Scientific contribution

- One section of my PhD report is dedicated to this study : `TODO`.
- One section of my PhD report is dedicated to this study : `Schild, E. (2024, in press). De l'Importance de Valoriser l'Expertise Humaine dans l’Annotation : Application à la Modélisation de Textes en Intentions à l'aide d’un Clustering Interactif. Université de Lorraine.` (Section 4.5)
29 changes: 20 additions & 9 deletions 7_inter_annotators_score_study/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,24 +6,35 @@ The main goal of this study is to **estimate the inter-annotators score** during
## Hypotheses

This sub-repository provides an environment to carry out a comparative study of _Interactive Clustering_ implementation around one hypothese.
- **Robustness hypothesis**: _TODO._

In this study, we focus on annotation time.
- **Robustness hypothesis**: _During an annotation methodology based on Interactive Clustering, it is possible to estimate the rate of inconsistencies appearing in constraints and their impact on the results of the method._

In this study, we focus on inter-annotators score during constraints annotation.

## Experimental protocol

`TODO`.
The proposed study consists in performing constraints annotation by several operators and estime the inter-annotators agreement. We reuse the instructions of annotation experiment for annotation time estimation (`3_annotation_time_study`).

Several instructions are given to annotators:
- *Operator context*: “You are newspapper experts; You want to classify articles into categories based on their title; You do not know precisely which categories you will use to classify your articles; But you know how to identify the similarity of two articles”;
- *Context on the dataset*: "Topics are common newspapper categories; The ground truth contains between 10 and 20 of the most common categories of the press; Ground truth contains between 30 and 100 newspappers per category; You can watch the unannotated dataset as much as you want";
- *Objective of the experiment*: "Annotate 400 constraints, and estimate the difference among 4 annotators" ;
- *Annotation instructions*: "Perform at least 15 minutes of annottaion for regularity; If possible, isolate yourself in order to be not disturbed and to not distort the results; For each series; If you don't know what to annotate (too ambiguous, unknown vocabulary, ...), go to the next one without annotating (you are supposed to be press experts!)".

Then, use alpha of Krippendorff to compute agreement scores.


## Implementation

`TODO`.
1. Constraints to annotate are selected (200 MUST-LINK, 200 CANNOT-LINK, based on groundtruth).
2. Annotation project can be imported in the annotation app with zipped archive, and operators can use the app to annotate constraints.
3. After annotation, agreement scores are computed based on experiment results.

All these steps are implemented in `Python`, and can be run within `Jupyter Notebooks`.


## Installation and Execution

Follow the description of `README.md` repository file in order to setup your Python/R environment.
Follow the description of `README.md` repository file in order to setup your `Python` environment.

Then follow notebooks instructions.

Expand All @@ -32,8 +43,8 @@ Then follow notebooks instructions.

Due to the volume of data generated (around 1 GB), not all results are versioned on GitHub.

- results are zipped in a `.tar.gz` file and versioned on Zenodo : `TODO`.
- a summary of results are stored in `results`..
- results are zipped in a `.tar.gz` file and versioned on Zenodo : `Schild, E. (2021). cognitivefactory/interactive-clustering-comparative-study. Zenodo. https://doi.org/10.5281/zenodo.5648255`.
- a summary of results are stored in `results`.

In order to make a save in a `.tar.gz` file, you can use the following command:
```bash
Expand All @@ -43,4 +54,4 @@ tar -czf 7_inter_annotators_score_study.tar.gz experiments/ notebook/ results/ R

## Scientific contribution

- One section of my PhD is dedicated to this study : `TODO`.
- One section of my PhD is dedicated to this study : `Schild, E. (2024, in press). De l'Importance de Valoriser l'Expertise Humaine dans l’Annotation : Application à la Modélisation de Textes en Intentions à l'aide d’un Clustering Interactif. Université de Lorraine.` (Section 4.6)
26 changes: 19 additions & 7 deletions 8_annotation_error_fix_study/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,29 @@ The main goal of this study is to **evaluate errors impact** and **verify confli
## Hypotheses

This sub-repository provides an environment to carry out a comparative study of _Interactive Clustering_ implementation around one hypothese.
- **Robustness hypothesis**: _TODO._
- **Robustness hypothesis**: _During an annotation methodology based on Interactive Clustering, it is possible to estimate the rate of inconsistencies appearing in constraints and their impact on the results of the method._

In this study, we focus on the impact of annotation errors during constraints annotation and the importance of conflicts corections.


## Experimental protocol

`TODO`.
The proposed study consists in performing _Interactive Clustering_ iterations with specific settings in order to annotate an unlabeled dataset, starting from no known constraints and ending when all the possible constraints between questions are defined.
The human annotator is simulated by the algorithm, and annotations are made by comparing with ground truth labels: two questions are annotated with a `MUST_LINK` if they come from the same intent, and with a `CANNOT_LINK` constraint otherwise. During annotation, some wrong annotations are inserted to simulate operator mistakes (groundtruth is the reference).

The study focus on impact of errors rate in clustering results.
If annotation conflicts are detected, two strategies are compared : (1) no correction and (2) correction with true annotation.

To analyse errors impact, we compare evolution of groundtruth agreement for error experiments and analyze the signifiance of correction.


## Implementation

`TODO`.
1. Run full _Interactive Clustering_ annotation methodology and insert a fixed rate of errors at each iteration. If conflicts error, use the experiment strategy to ignore or fix it.
2. When all experiments are run, display mean evolution of groundtruth agreement per errors rate and per conflicts fix strategy.
3. Compare evolution and discuss the signifiance of correction.

All these steps are implemented in `Python`, and can be run within `Jupyter Notebooks`.


## Installation and Execution
Expand All @@ -35,10 +47,10 @@ See the export notebook of this study and paste the exported files in the `previ

## Results

Due to the volume of data generated (around 35 GB), not all results are versioned on GitHub.
Due to the volume of data generated (around 30 GB), not all results are versioned on GitHub.

- results are zipped in a `.tar.gz` file and versioned on Zenodo : `TODO`.
- a summary of results are stored in `results`..
- results are zipped in a `.tar.gz` file and versioned on Zenodo : `Schild, E. (2021). cognitivefactory/interactive-clustering-comparative-study. Zenodo. https://doi.org/10.5281/zenodo.5648255`.
- a summary of results are stored in `results`.

In order to make a save in a `.tar.gz` file, you can use the following command:
```bash
Expand All @@ -48,4 +60,4 @@ tar -czf 8_annotation_error_fix_study.tar.gz experiments/ notebook/ previous/ re

## Scientific contribution

- One section of my PhD is dedicated to this study : `TODO`.
- One section of my PhD is dedicated to this study : `Schild, E. (2024, in press). De l'Importance de Valoriser l'Expertise Humaine dans l’Annotation : Application à la Modélisation de Textes en Intentions à l'aide d’un Clustering Interactif. Université de Lorraine.` (Section 4.6)
Loading

0 comments on commit 71c3d43

Please sign in to comment.