# CARDS Article Reproduction

**Author:** Cristian Alexanther Rojas Cardenas
**ID**: 32775849

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Load-and-Preprocess" data-toc-modified-id="Load-and-Preprocess-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Load and Preprocess</a></span></li><li><span><a href="#Training-Models" data-toc-modified-id="Training-Models-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Training Models</a></span><ul class="toc-item"><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Roberta" data-toc-modified-id="Roberta-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Roberta</a></span></li></ul></li><li><span><a href="#Inference" data-toc-modified-id="Inference-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Inference</a></span><ul class="toc-item"><li><span><a href="#Validation" data-toc-modified-id="Validation-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Validation</a></span></li><li><span><a href="#Test" data-toc-modified-id="Test-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Test</a></span></li></ul></li><li><span><a href="#Data-Generated" data-toc-modified-id="Data-Generated-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Data Generated</a></span></li></ul></div>

In [1]:
# Load the required packages

import re
import torch
import pickle
import unicodedata
import numpy as np
import pandas as pd

from IPython.display import display, Markdown, Latex

from scipy.special import softmax

from sklearn.preprocessing import LabelEncoder
from sklearn.utils.class_weight import compute_class_weight
from sklearn.metrics import f1_score, accuracy_score, precision_score, recall_score
from sklearn.metrics import plot_roc_curve
from sklearn import preprocessing
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

from simpletransformers.classification import ClassificationModel

from matplotlib import pyplot as plt
from sklearn.metrics import (
    accuracy_score, roc_auc_score, 
    classification_report, confusion_matrix, ConfusionMatrixDisplay)

from cards.utils import read_csv
import cards.preprocess as pp
from cards.fit.logistic import fit_logistic_classifier

from tqdm.notebook import tqdm
pd.set_option('display.max_colwidth', None)
tqdm.pandas()

if torch.cuda.is_available():    
    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use GPU {}:'.format(
        torch.cuda.current_device()), torch.cuda.get_device_name(torch.cuda.current_device()))

else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

2023-04-05 15:17:47.792796: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-04-05 15:17:47.792825: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: crarojasca-Blade-14-RZ09-0370
2023-04-05 15:17:47.792831: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: crarojasca-Blade-14-RZ09-0370
2023-04-05 15:17:47.792876: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 530.30.2
2023-04-05 15:17:47.792890: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 530.30.2
2023-04-05 15:17:47.792894: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 530.30.2


No GPU available, using the CPU instead.


In [2]:
with open('cards/models/label_encoder.pkl', 'rb') as f:
    le = pickle.load(f)

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [3]:
le.classes_.shape

(18,)

## Load and Preprocess

In [5]:
# Load and pre-process the text data
# Define text pre-processing functions
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)
def remove_non_ascii(text):
    """Remove non-ASCII characters from list of tokenized words"""
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
def strip_underscores(text):
    return re.sub(r'_+', ' ', text)
def remove_multiple_spaces(text):
    return re.sub(r'\s{2,}', ' ', text)

# Merge text pre-processing functions
def denoise_text(text):
    text = remove_between_square_brackets(text)
    text = remove_non_ascii(text)
    text = strip_underscores(text)
    text = remove_multiple_spaces(text)
    return text.strip()

In [3]:
# Load the data
train = pd.read_csv('cards/data/training/training.csv')
train["PARTITION"] = "TRAIN"
valid = pd.read_csv('cards/data/training/validation.csv')
valid["PARTITION"] = "VALID"
test = pd.read_csv('cards/data/training/test.csv')
test["PARTITION"] = "TEST"

data = pd.concat([train, valid, test], ignore_index=True)

# Pre-process the text
data['roberta_preprocessed'] = data["text"].astype(str).apply(denoise_text)

data['lr_preprocessed'] = data["text"].progress_apply(
    lambda text: pp.tokenize(pp.denoise_text(str(text)), remove_stops=True))

# Load the label encoder
le = LabelEncoder()

# Encode the labels
data['label'] = le.fit_transform(data.claim)

data.head()

  0%|          | 0/28945 [00:00<?, ?it/s]



Unnamed: 0,text,claim,PARTITION,roberta_preprocessed,lr_preprocessed,label
0,What do you do if you are a global warming alarmist and real-world temperatures do not warm as much as your climate model predicted?,5_1,TRAIN,What do you do if you are a global warming alarmist and real-world temperatures do not warm as much as your climate model predicted?,global warming alarmist real world temperatures warm much climate model predicted,16
1,"(2.) A sun-blocking volcanic aerosols component to explain the sudden but temporary cooling of global sea surface temperatures that are caused by catastrophic volcanic eruptions; and,",0_0,TRAIN,"(2.) A sun-blocking volcanic aerosols component to explain the sudden but temporary cooling of global sea surface temperatures that are caused by catastrophic volcanic eruptions; and,",2 sun blocking volcanic aerosols component explain sudden temporary cooling global sea surface temperatures caused catastrophic volcanic eruptions,0
2,"Now, I am very interested in the AMO, since it strongly influences Atlantic hurricanes, Arctic sea ice, and Greenland climate. We are already seeing a recovery of the Atlantic sector of the Arctic sea ice, and some hints of cooling in Greenland.",1_1,TRAIN,"Now, I am very interested in the AMO, since it strongly influences Atlantic hurricanes, Arctic sea ice, and Greenland climate. We are already seeing a recovery of the Atlantic sector of the Arctic sea ice, and some hints of cooling in Greenland.",interested amo since strongly influences atlantic hurricanes arctic sea ice greenland climate already seeing recovery atlantic sector arctic sea ice hints cooling greenland,1
3,"Dr. Christy addressed recent challenges to the satellite data. One paper claimed to show that the satellite data actually show warming. The author, however, used only 9 percent on the satellite data the data with the least coverage and the greatest error. Each attack of the satellite data has disregarded the fact that this record is independently validated by a 98 percent correspondence with the radiosonde balloon data. These same scientists seem to put a lot of credence in surface temperature data that only cover 10 percent of the globe, nearly all of which is in the Northern Hemisphere.",0_0,TRAIN,"Dr. Christy addressed recent challenges to the satellite data. One paper claimed to show that the satellite data actually show warming. The author, however, used only 9 percent on the satellite data the data with the least coverage and the greatest error. Each attack of the satellite data has disregarded the fact that this record is independently validated by a 98 percent correspondence with the radiosonde balloon data. These same scientists seem to put a lot of credence in surface temperature data that only cover 10 percent of the globe, nearly all of which is in the Northern Hemisphere.",dr. christy addressed recent challenges satellite data one paper claimed show satellite data actually show warming author however used 9 percent satellite data data least coverage greatest error attack satellite data disregarded fact record independently validated 98 percent correspondence radiosonde balloon data scientists seem put lot credence surface temperature data cover 10 percent globe nearly northern hemisphere,0
4,"After a brief protest from Massachusetts Republicans in their state Senate, the commonwealth is on the verge of changing its law to allow Gov. Deval Patrick (D) to appoint an interim Senator until the special election to fill the late Sen. Edward Kennedy's seat can be held in January.",0_0,TRAIN,"After a brief protest from Massachusetts Republicans in their state Senate, the commonwealth is on the verge of changing its law to allow Gov. Deval Patrick (D) to appoint an interim Senator until the special election to fill the late Sen. Edward Kennedy's seat can be held in January.",brief protest massachusetts republicans state senate commonwealth verge changing law allow gov. deval patrick appoint interim senator special election fill late sen. edward kennedy 's seat held january,0


## Training Models
### Logistic Regression

In [4]:
def fit_logistic_classifier(X, y):

    # Vectorize
    vectorizer = TfidfVectorizer(min_df=3,  max_features=None,
                                strip_accents='unicode',
                                ngram_range=(1, 2), use_idf=1, smooth_idf=1, sublinear_tf=1)

    # Fit final logistic classifier. Hyperparameters tuned via grid search using
    #  10-fold cross-validation
    clf_logit = LogisticRegression(C=7.96,
                                solver='lbfgs',
                                multi_class='ovr',
                                max_iter=200,
                                class_weight='balanced')
    
    pipe = Pipeline([('vectorizer', vectorizer), ('clf_logit', clf_logit)])
    pipe.fit(X, y)

    return pipe

# Fit the model
data_train = data.loc[data.PARTITION == "TRAIN"]
lr_model = fit_logistic_classifier(data_train.lr_preprocessed, data_train.label)

data['lr_pred'] = le.inverse_transform(lr_model.predict(data.lr_preprocessed))
data['lr_proba'] = lr_model.predict_proba(data.lr_preprocessed).tolist()

### Roberta

In [5]:
# Define the model 
architecture = 'roberta'
# model_name = 'CARDS_RoBERTa_Classifier'
model_name = "cards/models/CARDS_RoBERTa_Classifier"

# Load the classifier
roberta_model = ClassificationModel(architecture, model_name)

Some weights of the model checkpoint at cards/models/CARDS_RoBERTa_Classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [6]:
data['roberta_pred'] = le.inverse_transform(predictions)
data['roberta_proba'] = [softmax(element[0]) for element in raw_outputs]

NameError: name 'predictions' is not defined

## Inference

In [None]:
def report(y_true, y_pred, scores,  classes):
    
    acc = accuracy_score(y_true, y_pred)
    roc_auc = roc_auc_score(y_true, scores, multi_class="ovr", average="weighted")
    
    print(f"Accuracy: {acc}")
    print(f"AUC: {roc_auc}")
    print(classification_report(le.transform(y_true), le.transform(y_pred), target_names=classes))
    c_m = confusion_matrix(y_true, y_pred)
    cmp = ConfusionMatrixDisplay(
        c_m, display_labels=classes)
    fig, ax = plt.subplots(figsize=(10,10))
    cmp.plot(ax=ax)   

classes = le.classes_
data_valid= data[data.PARTITION=="TRAIN"]

display(Markdown("**Logistic Regression**"))
report(data_valid.claim, data_valid['lr_pred'].values, 
       np.stack(data_valid['lr_proba'].values, axis=0), classes)

display(Markdown("**Roberta**"))
report(data_valid.claim, data_valid['roberta_pred'].values, 
       np.stack(data_valid['roberta_proba'].values, axis=0), classes)

### Validation

In [None]:
classes = le.classes_
data_valid= data[data.PARTITION=="VALID"]

display(Markdown("**Logistic Regression**"))
report(data_valid.claim, data_valid['lr_pred'].values, 
       np.stack(data_valid['lr_proba'].values, axis=0), classes)

display(Markdown("**Roberta**"))
report(data_valid.claim, data_valid['roberta_pred'].values, 
       np.stack(data_valid['roberta_proba'].values, axis=0), classes)

### Test

<img src="images/image.png" alt="drawing" width="600"/>

In [None]:
classes = le.classes_
data_test = data[data.PARTITION=="TEST"].copy(deep=True)

display(Markdown("**Logistic Regression**"))
report(data_test.claim, data_test['lr_pred'].values, 
       np.stack(data_test['lr_proba'].values, axis=0), classes)

display(Markdown("**Roberta**"))
report(data_test.claim, data_test['roberta_pred'].values, 
       np.stack(data_test['roberta_proba'].values, axis=0), classes)

In [None]:
data_test["score_diff"] = abs(data_test.lr_proba.apply(max) - data_test.roberta_proba.apply(max))

In [None]:
(
    data_test[data_test.claim == data_test.roberta_pred]
    .sort_values("score_diff", ascending=False)
)[["text", "claim", "lr_pred", "roberta_pred", "score_diff"]][:10]

# 2.1 "Its geological"
# 5.2 "Proxies are unreliable" -> "Environmentalist are biased" 
# 1.1 "Its the temperature" -> "No claim"

In [None]:
(
    data_test[data_test.claim != data_test.roberta_pred]
    .sort_values("score_diff", ascending=False)
)[["text", "claim", "lr_pred", "roberta_pred", "score_diff"]][:10]

# 2.1.4 "Past climate change" -> "No claim"
# 2.3.3 "CO2 lags climate" -> 5.1.4 "Models are unreliable"
# 1.4 "Hiatus on warning" -> 2.1 "Past Climate change Its geological" -> 5.1 "Proxies are unrealiable?"

In [None]:
data.to_csv("CARDS_scored.csv")

## Data Generated

In [2]:
# Load and pre-process the text data
# Define text pre-processing functions
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)
def remove_non_ascii(text):
    """Remove non-ASCII characters from list of tokenized words"""
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
def strip_underscores(text):
    return re.sub(r'_+', ' ', text)
def remove_multiple_spaces(text):
    return re.sub(r'\s{2,}', ' ', text)

# Merge text pre-processing functions
def denoise_text(text):
    text = remove_between_square_brackets(text)
    text = remove_non_ascii(text)
    text = strip_underscores(text)
    text = remove_multiple_spaces(text)
    return text.strip()

# Define the model 
architecture = 'roberta'
# model_name = 'CARDS_RoBERTa_Classifier'
model_name = "cards/models/CARDS_RoBERTa_Classifier"

# Load the classifier
roberta_model = ClassificationModel(architecture, model_name, use_cuda=False)

Some weights of the model checkpoint at cards/models/CARDS_RoBERTa_Classifier were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [11]:
text = """The Paris Agreement is a waste of time and money. It will do little to actually reduce emissions and will only hurt thep_text = pp.tokenize(pp.denoise_text(str(text)), remove_stops=True)"""
predictions, raw_outputs = roberta_model.predict([text])
prediction = le.inverse_transform(predictions)
score = [softmax(element[0]) for element in raw_outputs]
prediction, score

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/1 [00:00<?, ?it/s]

(array(['4_2'], dtype=object),
 [array([1.92078608e-02, 1.24708095e-04, 1.20347048e-04, 9.06095597e-05,
         1.74216548e-04, 1.25754849e-04, 1.06050310e-04, 1.50851931e-04,
         9.09488312e-05, 1.08554764e-04, 1.03412133e-04, 9.63269202e-05,
         4.45473362e-03, 9.73702785e-01, 4.06041881e-04, 1.04975681e-04,
         7.32501268e-05, 7.58571408e-04])])

In [None]:
FILE = "datasets/generated_disinformation_taxonomy_CARDS_CHATGPT_specific_samples_V2.csv"

data = pd.read_csv(FILE)

# Pre-process the text
data['roberta_preprocessed'] = data["text"].astype(str).apply(denoise_text)

data['roberta_preprocessed'] = data["roberta_preprocessed"].apply(lambda x: x.split("5.")[0])

# Predict the labels
predictions, raw_outputs = roberta_model.predict(list(data.roberta_preprocessed))

data['cards_pred'] = le.inverse_transform(predictions)
data['cards_proba'] = [softmax(element[0]) for element in raw_outputs]

data.to_csv(FILE, index=False)

In [9]:
data.to_csv("datasets/generated_disinformation_taxonomy_CARDS_CHATGPT_specific_samples_predict.csv", 
            index=False)

In [11]:
data["cards_pred"] = le.inverse_transform(predictions)
data['cards_proba'] = [softmax(element[0]) for element in raw_outputs]
data.to_csv(FILE, index=False)

In [8]:
data.loc[data.DATASET=="cards", 'cards_pred'] = le.inverse_transform(predictions)

In [14]:
data.loc[data.DATASET=="cards", 'cards_proba'] = [str(softmax(element[0]).tolist()) for element in raw_outputs]

In [17]:
data.loc[data.DATASET=="cards", 'cards_pred']

0        5_1
1        0_0
2        1_1
3        5_1
4        0_0
        ... 
28940    0_0
28941    5_2
28942    5_2
28943    5_2
28944    5_2
Name: cards_pred, Length: 28945, dtype: object

In [18]:
FILE

'datasets/cards_waterloo.csv'

In [15]:
data.to_csv(FILE, index=False)

In [6]:
le.inverse_transform(predictions).shape

(28945,)

In [None]:
def report(y_true, y_pred, scores,  classes):
    
    acc = accuracy_score(y_true, y_pred)
    
    print(f"Accuracy: {acc}")
    print(classification_report(le.transform(y_true), le.transform(y_pred), target_names=classes))
    c_m = confusion_matrix(y_true, y_pred)
    cmp = ConfusionMatrixDisplay(
        c_m, display_labels=classes)
    fig, ax = plt.subplots(figsize=(10,10))
    cmp.plot(ax=ax)   

display(Markdown("**Roberta**"))
classes = le.classes_
report(data.generated_label.values, data['roberta_pred'].values, 
       np.stack(data['roberta_proba'].values, axis=0)[:,1:], classes)

In [None]:
ran_generated = pd.read_csv("datasets/generated_disinformation.csv")

# Pre-process the text
ran_generated['roberta_preprocessed'] = ran_generated["text"].astype(str).apply(denoise_text)

ran_generated['lr_preprocessed'] = ran_generated["text"].progress_apply(
    lambda text: pp.tokenize(pp.denoise_text(str(text)), remove_stops=True))

In [None]:
# Define the model 
architecture = 'roberta'
# model_name = 'CARDS_RoBERTa_Classifier'
model_name = "cards/models/CARDS_RoBERTa_Classifier"

# Load the classifier
roberta_model = ClassificationModel(architecture, model_name)

# Predict the labels
predictions, raw_outputs = roberta_model.predict(list(ran_generated.roberta_preprocessed))

In [None]:
import pickle
# Load label encoder
with open('cards/models/label_encoder.pkl', 'rb') as f:
    le = pickle.load(f)

ran_generated['roberta_pred'] = le.inverse_transform(predictions)
ran_generated['roberta_proba'] = [softmax(element[0]) for element in raw_outputs]

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure(figsize=(5, 8))

fig = ran_generated['roberta_pred'].value_counts().plot.barh()
fig.set_title("Distribution Randomly Generated Data")
fig.invert_yaxis()
fig.set_xlabel("Label")
fig.set_ylabel("Count")

In [None]:
ran_generated['roberta_pred'].value_counts().to_frame()

In [2]:
import pandas as pd
tmp = pd.read_csv("datasets/augmented/{seed}/cards_augmented_{n}_{seed}.csv")
tmp.groupby(["claim", "DATASET"]).text.count()
tmp

Unnamed: 0.1,Unnamed: 0,text,claim,PARTITION,DATASET,based_claims,roberta_preprocessed,cards_pred,cards_proba,labels,cards_aug_400_pred,cards_aug_400_proba
0,0,What do you do if you are a global warming ala...,5_1,TRAIN,cards,,,,,16,5_1,[2.66602323e-05 9.55529793e-06 6.46228181e-06 ...
1,1,(2.) A sun-blocking volcanic aerosols componen...,0_0,TRAIN,cards,,,,,0,0_0,[9.99892366e-01 3.43110850e-06 1.55333296e-06 ...
2,2,"Now, I am very interested in the AMO, since it...",1_1,TRAIN,cards,,,,,1,1_1,[3.40918034e-04 9.97579768e-01 2.22130897e-04 ...
3,3,Dr. Christy addressed recent challenges to the...,0_0,TRAIN,cards,,,,,0,5_1,[1.15811178e-01 7.15264941e-05 2.30180540e-05 ...
4,4,After a brief protest from Massachusetts Repub...,0_0,TRAIN,cards,,,,,0,0_0,[9.99924239e-01 3.15635907e-06 1.35488427e-06 ...
...,...,...,...,...,...,...,...,...,...,...,...,...
35740,35740,The climate change movement is nothing more th...,5_2,TRAIN,generated-chatgpt,"[26074, 28886, 28838]",The climate change movement is nothing more th...,5_2,[1.86649233e-04 1.36169071e-05 7.82586332e-06 ...,17,5_2,[4.60948588e-06 1.33967328e-05 1.83201059e-05 ...
35741,35741,Climate change alarmists are like religious ze...,5_2,TRAIN,generated-chatgpt,"[26507, 28899, 28911]",Climate change alarmists are like religious ze...,5_1,[2.18725477e-03 8.55337122e-05 5.04474727e-05 ...,17,5_2,[2.58458459e-06 2.44979204e-05 2.65273978e-05 ...
35742,35742,The climate has been changing for millions of ...,5_2,TRAIN,generated-chatgpt,"[28913, 28852, 28861]",The climate has been changing for millions of ...,2_1,[1.96669253e-04 8.84273677e-06 2.08247946e-05 ...,17,5_2,[5.06228738e-06 1.37137704e-05 1.53661954e-05 ...
35743,35743,Climate change is just a way for politicians t...,5_2,TRAIN,generated-chatgpt,"[28622, 26097, 28855]",Climate change is just a way for politicians t...,5_2,[4.80194881e-04 1.51505610e-05 1.03923381e-05 ...,17,5_2,[4.06765238e-06 1.31177262e-05 1.34945348e-05 ...


claim  DATASET          
0_0    cards                19867
1_1    cards                  421
       generated-chatgpt      400
1_2    cards                  184
       generated-chatgpt      400
1_3    cards                  284
       generated-chatgpt      400
1_4    cards                  605
       generated-chatgpt      400
1_6    cards                  236
       generated-chatgpt      400
1_7    cards                  538
       generated-chatgpt      400
2_1    cards                 1000
       generated-chatgpt      400
2_3    cards                  425
       generated-chatgpt      400
3_1    cards                  256
       generated-chatgpt      400
3_2    cards                  424
       generated-chatgpt      400
3_3    cards                  405
       generated-chatgpt      400
4_1    cards                  428
       generated-chatgpt      400
4_2    cards                  245
       generated-chatgpt      400
4_4    cards                  311
       generated-chatgp