# Cards Exploratory Data Analysis

In this notebook, we explore the Augmented CARDS dataset and sample a small benchmark of 600 claims, each with a corresponding label (0-5).

In [1]:
import string

import pandas as pd
import plotly.express as px

## Load the Augmented CARDS dataset

In [2]:
# Download the augmented_cards.csv file from https://figshare.com/articles/dataset/Untitled_Item/25465036
df = pd.read_csv("../../data/benchmark/augmented_cards.csv")
df

Unnamed: 0,text,binary_claim,cards_claim,acards_claim,full_claim,DATASET,PARTITION
0,! !! # climate change feelings https://t.co/X8...,0,,,,waterloo,VALID
1,!! climate change should not be a partisan iss...,0,,,,waterloo,TRAIN
2,""" ""Current scientific understanding provides l...",0,0_0,0_0,,cards,TRAIN
3,""" ""International accords and underlying region...",0,0_0,0_0,,cards,VALID
4,""" ...all models contain large errors in precip...",1,5_1,5_1,,cards,TRAIN
...,...,...,...,...,...,...,...
75648,💪Be the change.✅Shape the future.📣Submit your ...,0,0_0,0_0,0_0,golden,TEST
75649,📝 Tackling the climate crisis is daunting. Thi...,0,0_0,0_0,0_0,golden,TEST
75650,📺Our Technological Innovations and Climate Cha...,0,0_0,0_0,0_0,golden,TEST
75651,🚨 SHOCKING: Top Conservative donor Lord Frost ...,0,0_0,0_0,0_0,golden,TEST


In [3]:
# There are 3 dataset sources
df["DATASET"].value_counts()

DATASET
waterloo    43943
cards       28999
golden       2711
Name: count, dtype: int64

`waterloo`

Quote from [Hierarchical CARDS](https://www.nature.com/articles/s43247-024-01573-7.pdf):
> To enhance the model’s performance, we incorporated the Climate Change Twitter Dataset labelled by the University of Waterloo, featuring a 90/10 ratio of verified and misleading tweets (https://www.kaggle.com/datasets/edqian/twitter-climate-change-sentiment-dataset), to the binary classifier training set.

_This issue is that the labels for this dataset are binary only._

`cards`

Quotes from [CARDS](https://osf.io/preprints/socarxiv/crxfm):

> We wrote custom software to harvest all content from 20 conservative think tanks and 33 climate contrarian blogs and the 255 climate-related content of 20 conservative think-tanks over period from 1998 to 2020. Extended Data Tables 1-2 provide a full list of the blogs and CTTs included in this study, as well as the number of documents provided by each source. We collected a total of 249,413 climate change relevant documents—which contain over 174 million words (tokens)—from these 53 sources over the relevant time period. Extended Data Figs. 1-2 illustrate the total document frequencies over time, offering the monthly counts of documents for blogs and CTTs.

>  In order to provide an accurate assessment of model performance in light of noisy label information and to facilitate comparison across deep and shallow classifiers, we split our annotated paragraphs into a training set (n = 23,436), validation set (n = 2,605), and an "error free" test set (n = 2,904). To arrive at the "error free" test set, we 1) generated a random sample of annotated paragraphs which matched the class distribution in the training set and 2) re-annotated the test set to fix clear annotation errors.

_It's unclear about what "error free" means for the training and validation sets. Does it imply that they might contain labeling errors?_

`golden`

Quote from [Hierarchical CARDS](https://www.nature.com/articles/s43247-024-01573-7.pdf):
> To assess the model’s capabilities, climate change experts labelled a
testing set of tweets following the CARDS taxonomy. This dataset, denoted as Expert
Annotated Climate Tweets in Table 1, was composed of ~~2607~~ 2711 tweets related to
climate change, sampled from the platform in the second half of 2022.



In [4]:
# Remove duplicates
canonical_text = (
    df["text"].str.lower().str.replace(f"[{string.punctuation}]", "", regex=True)
)
df = df.loc[canonical_text.drop_duplicates().index]
df

Unnamed: 0,text,binary_claim,cards_claim,acards_claim,full_claim,DATASET,PARTITION
0,! !! # climate change feelings https://t.co/X8...,0,,,,waterloo,VALID
1,!! climate change should not be a partisan iss...,0,,,,waterloo,TRAIN
2,""" ""Current scientific understanding provides l...",0,0_0,0_0,,cards,TRAIN
3,""" ""International accords and underlying region...",0,0_0,0_0,,cards,VALID
4,""" ...all models contain large errors in precip...",1,5_1,5_1,,cards,TRAIN
...,...,...,...,...,...,...,...
75648,💪Be the change.✅Shape the future.📣Submit your ...,0,0_0,0_0,0_0,golden,TEST
75649,📝 Tackling the climate crisis is daunting. Thi...,0,0_0,0_0,0_0,golden,TEST
75650,📺Our Technological Innovations and Climate Cha...,0,0_0,0_0,0_0,golden,TEST
75651,🚨 SHOCKING: Top Conservative donor Lord Frost ...,0,0_0,0_0,0_0,golden,TEST


In [5]:
# Remove texts that only have a binary label (bye waterloo)
display(df["cards_claim"].isna().value_counts())
df = df.dropna(subset="cards_claim")
df

cards_claim
True     41082
False    31509
Name: count, dtype: int64

Unnamed: 0,text,binary_claim,cards_claim,acards_claim,full_claim,DATASET,PARTITION
2,""" ""Current scientific understanding provides l...",0,0_0,0_0,,cards,TRAIN
3,""" ""International accords and underlying region...",0,0_0,0_0,,cards,VALID
4,""" ...all models contain large errors in precip...",1,5_1,5_1,,cards,TRAIN
5,""" ...analyzed storminess across the whole of s...",1,1_7,1_7,,cards,TRAIN
6,""" A paper published ...in the Journal of Geoph...",1,2_1,2_1,,cards,TRAIN
...,...,...,...,...,...,...,...
75648,💪Be the change.✅Shape the future.📣Submit your ...,0,0_0,0_0,0_0,golden,TEST
75649,📝 Tackling the climate crisis is daunting. Thi...,0,0_0,0_0,0_0,golden,TEST
75650,📺Our Technological Innovations and Climate Cha...,0,0_0,0_0,0_0,golden,TEST
75651,🚨 SHOCKING: Top Conservative donor Lord Frost ...,0,0_0,0_0,0_0,golden,TEST


In [6]:
# Remove the single remaining waterloo text
display(df["DATASET"].value_counts())
df = df[df["DATASET"] != "waterloo"]
df = df.copy()

DATASET
cards       28868
golden       2640
waterloo        1
Name: count, dtype: int64

In [7]:
# Rename "golden" to "twitter" for clarity
df["DATASET"] = df["DATASET"].replace("golden", "twitter")
df["DATASET"].value_counts()

DATASET
cards      28868
twitter     2640
Name: count, dtype: int64

## Explore the distribution of claims

In [8]:
# Add a column for CARDS claim level 1 & 2
df = df.rename(columns={"cards_claim": "cards_claim_2"})
df["cards_claim_1"] = df["cards_claim_2"].str.split("_").str[0]
df

Unnamed: 0,text,binary_claim,cards_claim_2,acards_claim,full_claim,DATASET,PARTITION,cards_claim_1
2,""" ""Current scientific understanding provides l...",0,0_0,0_0,,cards,TRAIN,0
3,""" ""International accords and underlying region...",0,0_0,0_0,,cards,VALID,0
4,""" ...all models contain large errors in precip...",1,5_1,5_1,,cards,TRAIN,5
5,""" ...analyzed storminess across the whole of s...",1,1_7,1_7,,cards,TRAIN,1
6,""" A paper published ...in the Journal of Geoph...",1,2_1,2_1,,cards,TRAIN,2
...,...,...,...,...,...,...,...,...
75648,💪Be the change.✅Shape the future.📣Submit your ...,0,0_0,0_0,0_0,twitter,TEST,0
75649,📝 Tackling the climate crisis is daunting. Thi...,0,0_0,0_0,0_0,twitter,TEST,0
75650,📺Our Technological Innovations and Climate Cha...,0,0_0,0_0,0_0,twitter,TEST,0
75651,🚨 SHOCKING: Top Conservative donor Lord Frost ...,0,0_0,0_0,0_0,twitter,TEST,0


![](./cards_taxonomy.png)

In [9]:
df["is_contrarian_claim"] = df["cards_claim_1"] != "0"
px.sunburst(
    df.sort_values("cards_claim_2"),
    path=["is_contrarian_claim", "cards_claim_1", "cards_claim_2"],
    color="cards_claim_1",
    title="Distribution of claims by taxonomy level",
    width=600,
    height=600,
).update_traces(sort=False, hovertemplate="%{id}: <b>%{value:,d}</b>").update_layout(
    dict(margin=dict(l=20, r=20, t=50, b=20), title_x=0.5)
)

In [10]:
df["cards_claim_2"].value_counts().sort_values()

cards_claim_2
3_6        1
3_4        1
1_8        5
1_0        6
4_3        9
3_0       12
1_2      202
3_1      263
1_6      277
4_5      287
4_2      302
1_3      344
4_4      357
3_3      424
2_3      447
1_1      449
3_2      453
4_1      530
1_7      620
1_4      631
2_1     1151
5_1     1841
5_2     2014
0_0    20882
Name: count, dtype: int64

I propose we begin with the CARDS Level 1 taxonomy for the following reasons:

- Starting with a simpler taxonomy of 5 + 1 classes, rather than 27 + 1, will streamline our initial work.
- While the CARDS Level 2 examples provide some depth, they may not be fully exhaustive.
- Several CARDS Level 2 claims have very limited labeled examples. For instance, claim `3_6` has only one example available.
- Even the authors noted potential confusion among certain sublabels, as illustrated below:
  > Due to thematic overlap between sub-claims 5.2 (Movement is unreliable) and 5.3 (Climate is a conspiracy), we collapsed these claims into a single measure.

## Limitation of the CARDS Dataset

The CARDS dataset provides claims extracted directly from articles.
However, the full context of these articles is not available 😢.
Additionally, some claims originate from short tweets.

The median length of claims is approximately 40 words.
It is about ten times shorter than the 2-minute excerpts from Mediatree...

In [11]:
# Median length per dataset
df["text"].str.split().str.len().groupby(df["DATASET"]).median()

DATASET
cards      43.0
twitter    32.0
Name: text, dtype: float64

## Create a temporary benchmark

To get started, until we have our own Mediatree benchmark.

In [12]:
benchmark = df

In [13]:
# Claims from the class `0` (no contrarian claim) often have nothing to do with climate, such as:
# > If you drive regularly along Route 100 in Howard County, you might notice the lighting is a little dimmer than it used to be
# To avoid making the classification too easy, we only sample examples that contain some climate-related keywords

# I asked ChatGPT to generate these
climate_keywords = [
    "climate change",
    "global warming",
    "greenhouse",
    "carbon",
    "sustainability",
    "ecosystem",
    "biodiversity",
    "renewable energy",
    "climate policy",
    "adaptation and mitigation",
]
# Remove claims that are too easy to classify
mask_to_drop = (benchmark["cards_claim_1"] == "0") & ~benchmark[
    "text"
].str.lower().str.contains("|".join(climate_keywords), regex=True)
benchmark = benchmark.drop(mask_to_drop[mask_to_drop].index)
benchmark

Unnamed: 0,text,binary_claim,cards_claim_2,acards_claim,full_claim,DATASET,PARTITION,cards_claim_1,is_contrarian_claim
3,""" ""International accords and underlying region...",0,0_0,0_0,,cards,VALID,0,False
4,""" ...all models contain large errors in precip...",1,5_1,5_1,,cards,TRAIN,5,True
5,""" ...analyzed storminess across the whole of s...",1,1_7,1_7,,cards,TRAIN,1,True
6,""" A paper published ...in the Journal of Geoph...",1,2_1,2_1,,cards,TRAIN,2,True
7,""" Eme et al. conclude , in their words, that ""...",1,3_2,3_2,,cards,TRAIN,3,True
...,...,...,...,...,...,...,...,...,...
75647,👁 Top insights:- People can be moved to action...,0,0_0,0_0,0_0,twitter,TEST,0,False
75648,💪Be the change.✅Shape the future.📣Submit your ...,0,0_0,0_0,0_0,twitter,TEST,0,False
75650,📺Our Technological Innovations and Climate Cha...,0,0_0,0_0,0_0,twitter,TEST,0,False
75651,🚨 SHOCKING: Top Conservative donor Lord Frost ...,0,0_0,0_0,0_0,twitter,TEST,0,False


In [14]:
# Since the authors mention that the test set from CARDS is the only one to be noise-free, we'll sample only from the test set
# The twitter split should be noise-free too
benchmark = benchmark[
    (benchmark["DATASET"] == "twitter")
    | ((benchmark["DATASET"] == "cards") & (benchmark["PARTITION"] == "TEST"))
]

# Random sampling should approximately preserve the original distribution
benchmark = benchmark.groupby("cards_claim_1").sample(
    100, replace=False, random_state=42
)
benchmark = benchmark.rename(
    columns={"text": "claim", "cards_claim_1": "label", "DATASET": "source"}
)[["claim", "label", "source"]].reset_index(drop=True)
benchmark = benchmark.sort_values(["label", "source", "claim"])
benchmark

Unnamed: 0,claim,label,source
95,"""We humans are creating the conditions for our...",0,cards
89,55. Comments about climate change made by envi...,0,cards
34,A Week of Dire Predictions: Climate Change Con...,0,cards
39,A report on the flood disaster and climate cha...,0,cards
62,"According to a new poll , just 42 percent of G...",0,cards
...,...,...,...
552,This friggin tool and his climate change hypoc...,5,twitter
546,Today I see far more extreme climate misinform...,5,twitter
514,Watching @iancollinsuk on @TalkTV commenting t...,5,twitter
593,Where do they come up with this shit???Environ...,5,twitter


In [15]:
benchmark.to_csv("../../data/benchmark/cards_sample_600.csv", index=False)