
- I need you to dive deeper into the explanations of Gemini 2.5 Pro and how hashtags are used.
    - My quick analysis showed that Gemini is using the word “hashtag” in explanations a lot.
    - Question 1: Also the model uses “hashtag” explanations a lot more in some categories. Why is this? Is this a problem or is this human-like?
    - Question 2: You had hypothesized that hashtag are used as a shortcut, and not looking at the videos at all. However, is this true? OR is the model looking at the videos, and then just justifying using hashtags.
    - To look at the videos, copy a `post_id` and navigate to UI library > query=saxony-elections-… > subset=All Dates > Single Post Id=`post_id`
- Related hypothesis: Sometimes Gemini 2.5 Pro is right, and the humans are wrong.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
from bench_lib.evaluation import load_ai_labels, load_human_labels

human_labels_long, questions, comment_cols = load_human_labels(long=True)
ai_labels = load_ai_labels(["gemini-pro-no-schema"], questions, comment_cols)

In [None]:
# Get all the posts where humans and AI both provided comments, might include posts where human commented more than once,
# so for mean calculation below we will aggregate those.

# TODO: a dataframe with the human comments and ai comments for matching post_ids side by side

human_commented = human_labels_long[human_labels_long["comment"].notnull() & (human_labels_long["comment"].str.strip() != "")]
common_post_ids = set(human_commented["post_id"]).intersection(set(ai_labels["post_id"]))
len(common_post_ids)

human_common_commented = human_commented[human_commented["post_id"].isin(common_post_ids)]
ai_common_commented = ai_labels[ai_labels["post_id"].isin(common_post_ids)]
len(human_common_commented["post_id"].unique()) == len(ai_common_commented["post_id"].unique())

In [None]:
# TODO, instead might use python's emoji library to cover all range
import re
pattern = (
    r"(hashtag(s)?|#\w+|"  # hashtag, hashtags, or #word
    r"[\U0001F600-\U0001F64F]"  # emoticons - Common face emojis (😀-🙏)
    r"|[\U0001F300-\U0001F5FF]"  # symbols & pictographs (includes hearts) (💙, 🇩🇪)
    r"|[\U0001F1E6-\U0001F1FF])"  # regional indicator symbols (used for flags) (💙, 🇩🇪)
)

human_common_commented = human_common_commented.assign(
    comment_contains_hashtag=human_common_commented["comment"].str.contains(pattern, regex=True, flags=re.IGNORECASE, na=False)
)
ai_common_commented = ai_common_commented.assign(
    comment_contains_hashtag=ai_common_commented["comment"].str.contains(pattern, regex=True, flags=re.IGNORECASE, na=False)
)

In [None]:
agg = human_common_commented.groupby(["post_id", "variable", "value"])["comment_contains_hashtag"].mean().reset_index()

In [None]:
h_mean_per_cat = agg.groupby(["variable", "value"], as_index=False)["comment_contains_hashtag"].mean()
m_mean_per_cat = ai_common_commented.groupby(["variable", "value"], as_index=False)["comment_contains_hashtag"].mean()

In [None]:
h_mean_per_cat

In [None]:
m_mean_per_cat

#### @Tomas: My quick analysis showed that Gemini is using the word “hashtag” in explanations a lot. Question 1: Also the model uses “hashtag” explanations a lot more in some categories. Why is this? Is this a problem or is this human-like?

#### Out of all posts that are commented by both humans and Gemini, how does the hashtag distribution per category look like?
(The use of "hashtag" is more broadly defined here as we take into account explicit (copy-paste hashtag use by the model like emojis) hashtag usage by the model and this is why there is even more discrepancy between human vs. model this time.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

def plot_human_vs_model_means(h_mean_per_cat, m_mean_per_cat):
    
    h_mean_per_cat = h_mean_per_cat.copy()
    h_mean_per_cat["labeler"] = "Human"
    
    m_mean_per_cat = m_mean_per_cat.copy()
    m_mean_per_cat["labeler"] = "Model"
    
    combined_df = pd.concat([h_mean_per_cat, m_mean_per_cat], ignore_index=True)
    
    plt.figure(figsize=(12, 6))
    sns.barplot(
        data=combined_df,
        x="variable",
        y="comment_contains_hashtag",
        hue="labeler",
        ci=None,
        palette="muted"
    )
    
    plt.xlabel("Category (Variable)", fontsize=12)
    plt.ylabel("Mean (Comment contains hashtag)", fontsize=12)
    plt.title("Comparison of Human vs Model Means per Category and Value", fontsize=14)
    plt.xticks(rotation=45, ha="right", fontsize=10)
    plt.legend(title="Labeler", fontsize=10)
    plt.tight_layout()
    plt.show()

In [None]:
plot_human_vs_model_means(h_mean_per_cat, m_mean_per_cat)

It seems like there is definitely an alignment between the human labels and model labels when it comes to commenting categories such as is_intolerant, is_political, is_saxony with hashtags. My assumption is that even tough there exist categories like is_intolerant that got high disagreement score it's relatively easy to rely on hashtag for this group of categories(is_intolerant, is_political, is_saxony), whereas is_eudaimonic_entertainment and is_hedonic_entertainment are unique and it's difficult to explain these categories solely relying on hashtags, you would need an interplay of all modalities to correctly judge them such as video, audio and text. The model does seem comment more often than human labelers tough and it might indicate that if in case model taking shortcuts it might rather to justify its labels using hashtags, but for now it's just a speculation. 

This was an older comment -about the first histogram we've seen with naive mean calculation without any normalization steps-, however I still think that it holds true that categories such as is_eudaimonic_entertainment and is_hedonic_entertainment could be used as distinguishing events and we can keep them for the second round as well. However scaling up the categories up until 20 categories, while it might be helpful for sociological analysis, for practical reasons should be avoided, as comments carry more importance for AI analysis and noone would be willing to go through label 20 categories and then comment each of those. 5 categories on the other hand could still be a reasonable limit.

#### Question 2: You had hypothesized that hashtag are used as a shortcut, and not looking at the videos at all. However, is this true? OR is the model looking at the videos, and then just justifying using hashtags.

Preview of categories of interest and labeler comments with hashtags

In [None]:
h_comments_with_hashtags = human_common_commented[human_common_commented["comment_contains_hashtag"]]

In [None]:
m_comments_with_hashtags = ai_common_commented[ai_common_commented["comment_contains_hashtag"]]

In [None]:
categories_of_interest = ["is_eudaimonic_entertainment", "is_hedonic_entertainment", "is_intolerant"]
h_filtered_comments = h_comments_with_hashtags[h_comments_with_hashtags["variable"].isin(categories_of_interest)]
m_filtered_comments = m_comments_with_hashtags[m_comments_with_hashtags["variable"].isin(categories_of_interest)]
h_grouped_comments = h_filtered_comments.groupby(["post_id", "variable", "value"])["comment"].apply(list).reset_index()
m_grouped_comments = m_filtered_comments.groupby(["post_id", "variable", "value"])["comment"].apply(list).reset_index()

h_grouped_comments.head()
len(h_grouped_comments) # only 5 instances, with is_saxony and is_political included 16 instances
m_grouped_comments.head()
len(m_grouped_comments) # 126 instances, with is_saxony and is_political included 315 instances

In [None]:
# Get videos to inspect
matching_post_ids = set(h_grouped_comments["post_id"]) & set(m_grouped_comments["post_id"])
matching_post_ids

Library Video Inspection Results
only checked out the narrowed down scope of videos as mentioned in the code above

In [None]:
h_grouped_comments

In [None]:
print(h_grouped_comments[h_grouped_comments["post_id"] == "7388827240718404896"]["comment"].values)
print(h_grouped_comments[h_grouped_comments["post_id"] == "7388827240718404896"]["value"].values)

In [None]:
# A good example where we can see model considers all modalities including correctly defining the genre of the music played 
# without a clear indication in the video and uses hashtags only for further elaboration
# one might of course argue that the language in the hashtags used might be an indicator of the group of "Metalheads" but I guess this much of nuance
# could only be captured by looking further into the model's generation
# similar to what Philipp mentioned during the meeting with the usage of number 88 among certain right-wing groups
# the more I look at the individual posts the more I think there might be some other indicators that might blur the model's judgement
# about whether it uses/or -to what degree- other modalities or not. It's also showing where people do the heavy metal gesture "Devil's horns"
# with their hands. Interestingly,
# there is a hand gesture that resembles it and is used among right-wing Turkish groups, also in Germany called as "Wolfsgruß".
# I think we can consider overall at least for such examples it's a good idea to provide model with metadata, otherwise 
# simply given a video without any context it might be hard to distinguish between all these.

print(m_grouped_comments[m_grouped_comments["post_id"] == "7388827240718404896"]["comment"].values)
print(m_grouped_comments[m_grouped_comments["post_id"] == "7388827240718404896"]["value"].values)

In [None]:
print(h_grouped_comments[h_grouped_comments["post_id"] == "7408532445294791969"]["comment"].values)
print(h_grouped_comments[h_grouped_comments["post_id"] == "7408532445294791969"]["value"].values)

In [None]:
print(m_grouped_comments[m_grouped_comments["post_id"] == "7408532445294791969"]["comment"].values)
print(m_grouped_comments[m_grouped_comments["post_id"] == "7408532445294791969"]["value"].values)

In [None]:
print(h_grouped_comments[h_grouped_comments["post_id"] == "7410732413246082336"]["comment"].values)
print(h_grouped_comments[h_grouped_comments["post_id"] == "7410732413246082336"]["value"].values)

In [None]:
# interesting example by Gemini. Especially the last part looks so thought through and articulate in comparison to the plain commentation 
# done by humans. 
# You mentioned as a related hypothesis "Sometimes Gemini 2.5 Pro is right, and the humans are wrong."
# even tough both sides reach to the same conclusion on this one and labelled the post as 1,
# the way they reach to that seems to be different, where model outputs more nuanced justification
# for its labeling.
print(m_grouped_comments[m_grouped_comments["post_id"] == "7410732413246082336"]["comment"].values)
print(m_grouped_comments[m_grouped_comments["post_id"] == "7410732413246082336"]["value"].values)

In [None]:
print(h_grouped_comments[h_grouped_comments["post_id"] == "7416726076744863008"]["comment"].values)
print(h_grouped_comments[h_grouped_comments["post_id"] == "7416726076744863008"]["value"].values)

In [None]:
# Another example by Gemini where it clearly uses hashtags only for further elaboration
print(m_grouped_comments[m_grouped_comments["post_id"] == "7416726076744863008"]["comment"].values)
print(m_grouped_comments[m_grouped_comments["post_id"] == "7416726076744863008"]["value"].values)

In [None]:
# TODO:
# get the posts where human labelers commented / to justify their labels using audio modality / keywords: sound, audio, voice, music, song
# keyword: combination, both, either - especially post where there is high disagreement between humans
# get the same posts from AI labels and compare its justification. is it rather hashtag or not? and also calculate the percentage of posts that are justified by audio modality
# and also the percentage of posts that are justified by hashtags
# and also agreement between human and AI labels in these posts audio modality RQ: The impact of different modalities while labeling
# see Tomas regression analysis + disagreement/agreement_score, to see correlation between human and AI labels Related hypothesis: Sometimes Gemini 2.5 Pro is right, and the humans are wrong. 
# is it due to modality? do some modalities introduce noise? ablation study
# go back to first qwen generations toxicainment output file
# one example qwen 72b recognizes the joke - or whatever that was-/ "ein Dreier" whereas qwen 3b and 7b will proceed with hashtag justification directly
# less parameter count less representation space however qwen also apply some tricks to enhance memory usage does it have any impact, like smaller model + 
# resized/smaller images passed to the model? any distortion in the quality of the image the model sees? probably qwen team knows well enough to publish this
# so no need to check this

#human_commented = human_commented.assign(comment_contains_audio=human_commented["comment"].str.contains("audio|sound|voice|music|song"))