# Introduction to Natural Language Processing 2 Lab08

**This lab is mainly about data and model analysis. There is very little code. Make sure you send back a proper report with your code, guideline, annotated sheets, and theoretical answers.**


---

## Introduction (1 point)

Your company wants to sell a moderation API tackling toxic content on Twitter. They ask you to come up with a model which detect toxic tweets. You remember your NLP classes, and start looking for existing models or datasets, and find a collection of [academic Twitter dataset on HuggingFace hub](https://huggingface.co/datasets/tweet_eval). Especially, the `hate` and `offensive` datasets seem close to what you are looking for.

1. (1 point) Pick one of the datasets between `hate` and `offensive`, and justify your choice. Remember that it is for a commercial application.

Hate tweets are a subset of offensive tweets and both categories should be considered toxic by the moderation API. The 'offensive' subset contains some hateful tweets. We therefore chose to train our model on the offensive dataset as the model should be able to detect hateful tweets as offensive tweets.

## Evaluating the dataset (5 points)

Before using the data to train a model, you have the right reflex and start with a data analysis.

1. (1 point) Describe the dataset. Look at the splits, proportion of classes, and see what you can figure out by just looking at the text.

In [1]:
%%capture
! pip install transformers
! pip install datasets
! pip install bertopic
! pip install evaluate

In [2]:
from typing import Tuple
import pandas as pd
import numpy as np

from datasets import load_dataset, Dataset

from transformers import pipeline, AutoTokenizer, TFAutoModelForSequenceClassification, AutoModelForSequenceClassification
from bertopic import BERTopic
import evaluate

from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from statsmodels.stats.inter_rater import fleiss_kappa, aggregate_raters
from sklearn.pipeline import Pipeline

from umap import UMAP

SEED = 42

# Set random_state for BERTopic model
umap_model = UMAP(random_state=SEED)

In [None]:
# Loading dataset
dataset = load_dataset("tweet_eval", "offensive")

print("Dataset size:", len(dataset["train"]) + len(dataset["test"]) + len(dataset["validation"]))

splits = ["train", "validation", "test"]
for split in splits:

  # "Label 1 = Offensive tweet, label 0 = non-offensive tweet"
  hateful_rate = len(dataset[split].filter(lambda x: x["label"] == 1)) / len(dataset[split])

  print("Split:", split, ", proportion of hateful tweet:", hateful_rate)

print("\n" + "Offensive tweet exemples:")
print(np.array(dataset["train"].filter(lambda x: x["label"] == 1)["text"][:10]))

print("\n" + "Non-offensive tweets exemples:")
print(np.array(dataset["train"].filter(lambda x: x["label"] == 0)["text"][:10]))

Downloading builder script:   0%|          | 0.00/9.72k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/30.4k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/21.8k [00:00<?, ?B/s]

Downloading and preparing dataset tweet_eval/offensive to /root/.cache/huggingface/datasets/tweet_eval/offensive/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343...


Downloading data files:   0%|          | 0/6 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/608k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/2.26k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/58.4k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/210 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/72.6k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/304 [00:00<?, ?B/s]

      

Extracting data files #4:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #1:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #0:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #3:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #2:   0%|          | 0/1 [00:00<?, ?obj/s]

Extracting data files #5:   0%|          | 0/1 [00:00<?, ?obj/s]

Generating train split:   0%|          | 0/11916 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/860 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1324 [00:00<?, ? examples/s]

Dataset tweet_eval downloaded and prepared to /root/.cache/huggingface/datasets/tweet_eval/offensive/1.1.0/12aee5282b8784f3e95459466db4cdf45c6bf49719c25cdb0743d71ed0410343. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Dataset size: 14100


  0%|          | 0/12 [00:00<?, ?ba/s]

Split: train , proportion of hateful tweet: 0.3307317891910037


  0%|          | 0/2 [00:00<?, ?ba/s]

Split: validation , proportion of hateful tweet: 0.3466767371601209


  0%|          | 0/1 [00:00<?, ?ba/s]



Split: test , proportion of hateful tweet: 0.27906976744186046

Offensive tweet exemples:
['@user Eight years the republicans denied obama’s picks. Breitbarters outrage is as phony as their fake president.'
 "@user She has become a parody unto herself? She has certainly taken some heat for being such an....well idiot. Could be optic too  Who know with Liberals  They're all optics.  No substance"
 '@user Your looking more like a plant #maga #walkaway'
 '@user Antifa would burn a Conservatives house down and CNN would be there lighting the torches &amp; throwing gas on the flames.'
 '@user They cite Jones being banned for violating Twitter\'s ToS. There are blue checkmarks spewing the same, if not worse, kind of shit. If you are going to play the anyone can get banned"" card. Shouldn\'t these people also receive bans and suspensions? #VerifiedHate""'
 "@user also shitbiscuit stole most of the Tempe girls u can't blame that on me.."
 '@user @user Well, see, if I start talking at Dana abou

  0%|          | 0/12 [00:00<?, ?ba/s]

['@user Bono... who cares. Soon people will understand that they gain nothing from following a phony celebrity. Become a Leader of your people instead or help and support your fellow countrymen.'
 '@user Get him some line help. He is gonna be just fine. As the game went on you could see him progressing more with his reads. He brought what has been missing. The deep ball presence. Now he just needs a little more time'
 '@user @user She is great. Hi Fiona!'
 '@user @user @user @user @user @user @user @user @user @user @user @user @user @user @user This is the VetsResistSquadron"" is Bullshit.. They are girl scout veterans, I have never met any other veterans or served with anyone that was a gun control advocate? Have you?""'
 '@user @user Lol. Except he’s the most successful president in our lifetimes. He’s undone most of the damage Obummer did and set America on the right path again. #MAGA'
 '@user Been a Willie fan since before most of you were born....LOVE that he is holding a rally w

We can see that the 3 splits are almost stratified. 

We can see that there are a lot more words with negative conotation in offensive tweet (such as: "outrage", "idiot", "brun", "lie", "whore", "propaganda") compared to non-offensive tweets.

2. (3 points) Use [BERTopic](https://github.com/MaartenGr/BERTopic) to extract the topics within the data, and the main topics within each class. Please, think about [fixing the random seed](https://stackoverflow.com/questions/71320201/how-to-fix-random-seed-for-bertopic).
  * A [good model](https://github.com/MaartenGr/BERTopic#embedding-models) for sentence similarity is `all-MiniLM-L6-v2`, as it is [fast, light, and pretty accurate](https://www.sbert.net/docs/pretrained_models.html). You can use another one, but make sure to document your choice.

In [None]:
def get_main_topics(data: list) -> Tuple[np.ndarray, BERTopic]:
  '''
    Returns the main topics from the data.

      Parameters:
        data (list): List of strings representing the data from which topics are extracted.

    Returns:
        main_topics (Tuple[numpy.ndarray, BERTopic]): Tuple containing the main word for each extracted topic and the model used.
  '''
  # Creating the model
  model = BERTopic(embedding_model="all-MiniLM-L6-v2", umap_model=umap_model)
  topics, probs = model.fit_transform(data)

  # Fetching topics
  topics = model.get_topics()

  # Getting the word that best represent each topic
  main_topics = [topic[np.argmax([word[1] for word in topic])][0] for topic in list(topics.values())]
  return pd.Series(main_topics).unique(), model

neutral_topics, neutral_topic_model = get_main_topics(dataset["train"].filter(lambda x: x["label"] == 0)["text"])
offensive_topics, offensive_topic_model = get_main_topics(dataset["train"].filter(lambda x: x["label"] == 1)["text"])

print("Main word for each offensive topic:")
print(offensive_topics)

print("Main word for each neutral topic:")
print(neutral_topics)



Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/612 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/350 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]



Main word for each offensive topic:
['user' 'liberals' 'gun' 'antifa' 'maga' 'kavanaugh' 'women' 'bitch' 'she'
 'racist' 'nfl' 'tories' 'pope' 'sexy' 'nigga' 'iran' 'trudeau' 'news'
 'obama' 'trump' 'he' 'fat' 'court' 'was' 'puerto' 'try' 'abortion']
Main word for each neutral topic:
['the' 'he' 'antifa' 'gun' 'beautiful' 'liberals' 'conservatives'
 'kavanaugh' 'user' 'brexit' 'maga' 'god' 'canada' 'trump' 'housing'
 'money' 'kag' 'serena' 'nfl' 'black' 'jail' 'lying' 'innocent' 'puerto'
 'fbi' 'twitter' 'she' 'obama' 'thank' 'pple' 'women' 'birthday' 'read'
 'feinstein' 'flake' 'movement' 'economy' 'blocked' 'kerry' 'cnn' 'player'
 'vote' 'price' 'nra' 'hillary' 'president' 'levi' 'california'
 'conscientiousness' 'birds']


We can see below information about topics. Each line represent a topic, the column 'count' represent the number of tweets speaking about the topic and the column 'Name' is made of 4 word contained in the topic 

In [None]:
print("Offensive tweet topics")
offensive_topic_model.get_topic_info()

Offensive tweet topics


Unnamed: 0,Topic,Count,Name
0,-1,911,-1_user_is_the_to
1,0,1049,0_user_is_shit_you
2,1,373,1_liberals_the_user_conservatives
3,2,355,2_gun_control_the_to
4,3,287,3_antifa_user_the_and
5,4,143,4_maga_trump_and_the
6,5,80,5_kavanaugh_to_maga_the
7,6,66,6_women_sexual_the_liberals
8,7,65,7_bitch_bitches_me_love
9,8,58,8_she_her_to_is


In [None]:
print("Neutral tweet topics")
neutral_topic_model.get_topic_info()[:50]

Neutral tweet topics


Unnamed: 0,Topic,Count,Name
0,-1,1933,-1_the_to_is_and
1,0,1994,0_he_is_she_you
2,1,781,1_antifa_user_the_of
3,2,596,2_gun_control_guns_laws
4,3,369,3_beautiful_you_love_she
5,4,341,4_liberals_user_they_the
6,5,143,5_conservatives_are_that_the
7,6,135,6_kavanaugh_maga_judge_to
8,7,134,7_user_follow_treph_gt
9,8,126,8_brexit_uk_the_eu


Here we can see a visualisation of the offensive topics, which helps to distinguish similar subjects.

In [None]:
print("Offensive topic visualization")
offensive_topic_model.visualize_topics()

Offensive topic visualization


Same here for 'non-offensive' topics.

In [None]:
print("Neutral topic visualization")
neutral_topic_model.visualize_topics()

Neutral topic visualization


3. (1 point) What do you think about the results? How do you think it could impact a model trained on these data?


A lot of topics can be found in both 'offensive' and 'non-offensive' topics. There way more slurs in 'offensive' topics than in 'non-offensive' topics. Some cluster of topics can be found in both categories such as the cluster made of the biggest topics (Antifa and guns). There are more topics in the 'non-offensive' category. However some topics are exclusive to a category, for instance, the topic "Brexit" is exclusive to 'non-offensive' and most topic made of slurs are exclusive to the 'offensive' category.

So, a model trained on these data would probably classify a tweet containing slurs as 'offensive' and would classify tweets about topic exclusive to 'non-offensive' category as 'non-offensive'.

4. **Bonus** By default, BERTopic extracts single keywords. Play with the model to extract bigrams or more. See if you can go deeper in your analysis.

## Evaluate a model (5 points)

You were thinking about fine-tuning a [RoBERTa](https://arxiv.org/abs/1907.11692) model on the dataset, but RoBERTa has been train on 2019 data, which do not include any tweet. Moreover, pretraining a model from scratch can be costly. Fortunately, a [reliable entity](https://github.com/cardiffnlp) pretrained RoBERTa on Tweets and even fine-tuned it on both datasets [here](https://huggingface.co/cardiffnlp/twitter-roberta-base-offensive?text=I+like+you.+I+love+you) and [here](https://huggingface.co/cardiffnlp/twitter-roberta-base-hate?text=I+like+you.+I+love+you).

1. (2 points) Evaluate their model on the test split of the dataset you picked, using precision, recall, and F1-score.

In [None]:
# Defining the checkpoint we use for the pretrained RoBERTa
checkpoint = "cardiffnlp/twitter-roberta-base-offensive"

# Loading the weights and the architecture of the model from the checkpoint
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
# Loading the tokenizer of the model from the checkpoint
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
# Creating a pipeline with the tokenizer and the model
classifier = pipeline('text-classification', model=model, tokenizer=tokenizer)

Downloading:   0%|          | 0.00/593 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/150 [00:00<?, ?B/s]

In [None]:
# Creation of the evaluator
evaluator = evaluate.evaluator("text-classification")

# Evaluation of the precision, recall and f1 score of the model on the 'test' split of the dataset
evaluation = evaluator.compute(
    model_or_pipeline=classifier,
    data=dataset["test"],
    metric=evaluate.combine(["precision", "recall", "f1"]),
    label_mapping={"LABEL_0": 0, "LABEL_1": 1}
)

print("Precision:", evaluation["precision"])
print("Recall:", evaluation["recall"])
print("F1-score:", evaluation["f1"])

Downloading builder script:   0%|          | 0.00/7.55k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/7.36k [00:00<?, ?B/s]

Downloading builder script:   0%|          | 0.00/6.77k [00:00<?, ?B/s]

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Precision: 0.7960199004975125
Recall: 0.6666666666666666
F1-score: 0.7256235827664399


To see how the model would fare on production data, you have 10K English tweets and replies available on the tweets.json file (taken from [internet archive](https://archive.org/details/archiveteam-twitter-stream-2021-07)). Note that the language was filtered using the Twitter API, so there might still be tweets in more than just English. The JSON fields were trimmed to minimum and the text was already preprocessed to mask user handles and URLs, like the tweets in your dataset.

2. (3 points) Extract the top 50 tweets your model is most confident about in the target class (offensive or hateful), the top 50 in the neutral class, and the top 50 your model is most uncertain about. Do you believe the model is doing a great job? For at least 2 tweets the model wrongly classified in your target class, try explaining what could have gone wrong.

In [None]:
# Loading data from "tweets.json"
json_dataset = pd.read_json('tweets.json')
json_dataset = Dataset.from_pandas(json_dataset)

# Making predictions with the pretrained RoBERTa model on data from "tweets.json"
predictions = classifier.predict(json_dataset["text"])

In [None]:
# Fetching predictions confidence for 'non-offencive' tweets
pred_confidence = np.array([pred["score"] if pred["label"] == "LABEL_0" else 0 for pred in predictions])
# Getting the 50th highest confidence
confidence_threshold = sorted(pred_confidence, reverse=True)[50]
# Creating a mask masking confidence lower than the confidence of the 50th highest confidence
most_confident_mask = pred_confidence >= confidence_threshold
# Fetching the top 50 predictions with the most confidence for 'non-offensive' tweets
most_confident_preds = np.array(json_dataset["text"])[most_confident_mask]

# Same here for 'offensive' tweets
offensive_pred_confidence = np.array([pred["score"] if pred["label"] == "LABEL_1" else 0 for pred in predictions])
offensive_confidence_threshold = sorted(offensive_pred_confidence, reverse=True)[50]
offensive_most_confident_mask = offensive_pred_confidence >= offensive_confidence_threshold
offensive_most_confident_preds = np.array(json_dataset["text"])[offensive_most_confident_mask]

# Same here to get prediction with the lowest confidence in both categories
least_pred_confidence = np.array([pred["score"] for pred in predictions])
least_confidence_threshold = sorted(least_pred_confidence, reverse=False)[50]
least_confident_mask = least_pred_confidence <= least_confidence_threshold
least_confident_preds = np.array(json_dataset["text"])[least_confident_mask]

print("Top 50 most confident 'non-offensive' predictions:")
[print("- ", pred) for pred in most_confident_preds]

print("\n============================================================\n")

print("Top 50 most confident 'offensive' predictions:")
[print("- ", pred) for pred in offensive_most_confident_preds]

print("\n============================================================\n")

print("Top 50 least confident predictions:")
[print("- ", pred) for pred in least_confident_preds];

Top 50 most confident 'non-offensive' predictions:
-  Thanks Stephan! I appreciate you ♥ I'm just happy to be able to spread some more good vibes!
-  You make me happy Collin :) so thankful for all the memories we have &lt;3
-  It's what you've all been fighting so hard for! I'm so thrilled to see this happening - at last!
-  The #buildinpublic community is so supportive, I'm so grateful 🙏
-  Thank you😊 you too 😘
-  Thank you for your supporttt ! 😍❤️✨
-  Thanks for your kind words ❤️ our team are all delighted for you both
-  Oh, that would be great. I will be waiting, thank you💕
-  Love you! ♥️
-  Thank you! I’m glad!
-  Thanks Mike! And thanks for all your input during this time - I hope to continue our conversations
-  Thank u so much Yejiiii 💖💖💖
-  Good morning to you JJ, thank you 😊. I hope you’re well and enjoy a terrific Thursday
-  Thank you ☺️
-  Excellent dear and beautiful friends!! 👏👏👏🙏🏻😍💖💕🤗
-  Thank you ☺️☺️🙏🙏
-  hey anne!  I hope you have a good day ❤️
-  And thanks for t

The 50 most confident non-offensive predictions are indeed not offensive at all, they are very positive tweets. In fact, there are many thank-you tweets and tweets with mentions of love.

The 50 most confident offensive predictions are very offensive and negative tweets, they contains a lot of slurs insults and inapropriate messages.

As expected, the 50 least confident predictions are tweets that are often neither positive nor negative. For many of these tweets, the author is talking about himself or herself, whereas for tweets where the prediction is confident, the author is often talking about someone or something else (either to insult them for offensive tweets or to thank them and express love for non-offensive tweets)

Therefore, the model did what we expect it to do, it did a great job.

We will look at wrongly classified tweets and try to understand why the model failed.

In [None]:
# Getting predictions on the test split of the tweet_eval dataset (only on the first 200 tweet to avoid long computation time)
preds = classifier.predict(dataset["test"]["text"][:200])
# Creating a mask masking good predictions
error_mask = np.array([0 if pred["label"] == "LABEL_0" else 1 for pred in preds]) != dataset["test"]["label"][:200]
# Fetching wrong classification text
errors_text = np.array(dataset["test"]["text"][:200])[error_mask]
# Fetching wrong classification label
errors_label = np.array(dataset["test"]["label"][:200])[error_mask]

print("Some classification errors:")
for i in range(5):
  print("True label:", "offensive" if errors_label[i] == 1 else "non-offensive")
  print("Wrongly classified text:", errors_text[i])
  print()

Some classification errors:
True label: non-offensive
Wrongly classified text: @user @user @user I got in a pretty deep debate with my friend and she told me that latinos for Trump and blacks for Trump were paid supporters 😂 then I said you mean antifa are paid domestic terrorist and she said No they are  anti-fascist then I said they are the fascist are you kidding me?!

True label: non-offensive
Wrongly classified text: #BeckyLynch is beautiful one of the few women in wrestling that absolutely need no work done at all. She's absolutely beautiful just the way she is. Doesn't need giant boobs or a fake booty. @user is just simply amazing #HIAC

True label: offensive
Wrongly classified text: #ConstitutionDay It's very odd for the alt right conservatives to say that we are ruining the constitution just because we want #GunControlNow but they are the ones ruining the constitution getting upset because foreigners are coming to this land who are not White wanting to live

True label: offens

These wrongly classified tweets do not contain any insult or thanks, They are neither clearly positive nor clearly negative so they are harder to classify.

3. **Bonus** Use [SHAP](https://github.com/slundberg/shap/tree/45b85c1837283fdaeed7440ec6365a886af4a333#natural-language-example-transformers) on the provided tweets, or manually written texts, to see if you can find topics on which the model is biased.
4. **Bonus** Train a naive Bayes model on the data, and compare its results with this model.

In [None]:
# Creating the naive bayes model 
naive_bayes = Pipeline([('Vect', CountVectorizer()), ('Mnb', MultinomialNB())])
# Fitting the naive bayes
naive_bayes.fit(dataset["train"]["text"], dataset["train"]["label"])
#
naive_bayes_predictions = naive_bayes.predict(dataset["test"]["text"])
precision, recall, fscore, _ = precision_recall_fscore_support(dataset["test"]["label"], naive_bayes_predictions)

print("Navie bayes precision:", (precision[0] + precision[1]) / 2)
print("Navie bayes recall:", (recall[0] + recall[1]) / 2)
print("Navie bayes F1-score:", (fscore[0] + fscore[1]) / 2)

Navie bayes precision: 0.7294573643410853
Navie bayes recall: 0.7138440860215054
Navie bayes F1-score: 0.7205924510272337


The metrics of the naive bayes model are surprisingly high, the precision is lower than the pretrained RoBERTa model but the recall is better and the F1-score is similar to the F1-score of RoBERTa.

## Annotate data (7 points)

Regardless of the model's performances, you decide to annotate your own collection of tweets.

1. (1 point) Extract about 100 tweets containing at least 20% of your target class (offensive/hateful), from the 10K tweets provided. You can use the pretrained model to help you find tweets in the target class.


In [None]:
offensive_tweets = np.array(json_dataset["text"])[[pred["label"] == "LABEL_1" for pred in predictions]]
non_offensive_tweets = np.array(json_dataset["text"])[[pred["label"] == "LABEL_0" for pred in predictions]]

extracted_tweets = list(non_offensive_tweets[:55]) + list(offensive_tweets[:45])
pd.Series(extracted_tweets).to_csv("extracted_tweets.csv", header=False, index=True, encoding='utf-8')

2. (3 points) Altogether, write down an annotation guildeline (which should be at least 2/3 of a page long).
    * What does the target class look like?
    * Any examples you could provide for ambiguous cases?
    * Keep "Can't tell / not annotable" class. Make sure you document what this class mean in your guideline.


The guidlines can be found in the guidelines.pdf file.

3. (1 point) Every person in your group is going to annotate these tweets separately. So if you are 4, annotate them 4 times.
    * Typically, create a Google sheet or an excel document, one tab per person, in each tab one column for the text, and annother on the class.



Annotations can be found [here](https://docs.google.com/spreadsheets/d/1JJ3l4RV2JSSEi8DR6IapNUuO_3gWnvvYdyvx_I_ak3I/edit?usp=sharing)



4. (2 point) Evaluate your inter-annotaor agreement using Cohen Kappa (if you are 2) or Fleiss Kappa.
    * If, like your teacher, you have issues making the [NLTK implementation](https://www.nltk.org/_modules/nltk/metrics/agreement.html) work on the latest version of python (3.10+), you can use the [scikit-learn implementation](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.cohen_kappa_score.html) of Cohen Kappa, and compute a matrix by pair of annotators.
    * What does the score mean? Are you doing a good job annotating the data and, if not, why?

In [5]:
annotations = pd.read_csv("Rate_dataframe.csv")
raters = annotations.columns

# Get all combinaison of pair of rater
rater_pairs = [(a, b) for idx, a in enumerate(raters) for b in raters[idx + 1:]]

for rater1, rater2 in rater_pairs:
  print("Cohen Kappa score between", rater1, "and", rater2,":")
  print(cohen_kappa_score(annotations[rater1], annotations[rater2]))
  print()

# Replacing -1 annotations with 2 because aggregate_raters only take positive values
annotations = annotations.replace(-1, 2)
# Aggregating raters in order to compute Fleiss Kappa score
aggregated_raters = aggregate_raters(annotations.values, 3)
# Computing Fleiss Kappa score
fleiss_kappa_score = fleiss_kappa(aggregated_raters[0])
print("Fleiss Kappa score:", fleiss_kappa_score)

Cohen Kappa score between Adrien and Jules :
0.30991217063989973

Cohen Kappa score between Adrien and Noé :
0.3139757498404595

Cohen Kappa score between Adrien and Yacine :
0.34041184041184036

Cohen Kappa score between Jules and Noé :
0.37877770315648085

Cohen Kappa score between Jules and Yacine :
0.3699684984249213

Cohen Kappa score between Noé and Yacine :
0.4494761143668976

Fleiss Kappa score: 0.3535488634859599


The Kappa score measures the degree of agreement between several raters. It is equal to the observed rate of agreement minus the probability of agreement. Thus, a negative Kappa score means that the raters disagree and the closer the score is to 1, the more the raters agree.

We got a Fleiss Kappa score of 0.35 which shows a fair agreement between us. We could improve the agreement by specifing the guidelines in more detail.

In order to get a better anotation, we could also discuss together what label to give to a tweet instead of doing it separetly and independently.

5. **Bonus** Iterate on your annotation guideline with what you learned. Please send both version in your report.

Please provide the annotation sheets, the guideline, and the inter-annotator agreement in your report.


Agreement after updating guidelines and doing another iteration

In [6]:
annotations = pd.read_csv("Rate_dataframe2.csv")
raters = annotations.columns

# Get all combinaison of pair of rater
rater_pairs = [(a, b) for idx, a in enumerate(raters) for b in raters[idx + 1:]]

for rater1, rater2 in rater_pairs:
  print("Cohen Kappa score between", rater1, "and", rater2,":")
  print(cohen_kappa_score(annotations[rater1], annotations[rater2]))
  print()

# Replacing -1 annotations with 2 because aggregate_raters only take positive values
annotations = annotations.replace(-1, 2)
# Aggregating raters in order to compute Fleiss Kappa score
aggregated_raters = aggregate_raters(annotations.values, 3)
# Computing Fleiss Kappa score
fleiss_kappa_score = fleiss_kappa(aggregated_raters[0])
print("Fleiss Kappa score:", fleiss_kappa_score)

Cohen Kappa score between Adrien and Jules :
0.45465877220317863

Cohen Kappa score between Adrien and Noé :
0.48096885813148793

Cohen Kappa score between Adrien and Yacine :
0.40204563335955934

Cohen Kappa score between Jules and Noé :
0.6329152646069134

Cohen Kappa score between Jules and Yacine :
0.4022988505747126

Cohen Kappa score between Noé and Yacine :
0.49293177627535345

Fleiss Kappa score: 0.4771140318520388


We can see that the Fleiss Kappa score is way better now