# Data Enrichment

In this notebook we will add aditional features extracted from the dataset through the use of pretrained NLP models. To be precise, here we will extract the sentiment and also classify the quotes into one of 17 classes of misconceptions regarding climate change.

Contents:
- [Sentiment Analysis](#sentiment)
- [Claims Misinformation Prediction using CARDS](#cards)

Import the needed libraries

In [1]:
import pandas as pd
from transformers import pipeline

Load the quotes we extracted using ClimaTextBERT

In [98]:
x = pd.read_json('climate_change_quotes_small.json')

<a id='sentiment'></a>
# Sentiment Analysis
For the sentiment analysis task we will rely on a trained distil BERT model for sentiment analysis. The model gives state-of-the-art performance of ~93% accuracy.

In [161]:
classifier = pipeline('sentiment-analysis', device=0)

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)


Running the model

In [None]:
def classify(q):
    '''
        Convert outputs from doubles of sign and value into
        a single value between -1 and 1 where negative values
        correspond to negative sentiment and vice versa
    '''
    tmp = classifier(q)[0]
    sgn = tmp['label']
    score = tmp['score']
    sgn = 1.0 if sgn == 'POSITIVE' else -1.
    return sgn*score
    
scores = x.quotation.apply(classify)

In [204]:
x['sentimentScores'] = scores

Save the outputs

In [209]:
x.to_json('climate_change_quotes_small_v2.json')

<a id='cards'></a>
## Claims prediction with the RoBERTa classifier used in the Coan et al. (2021) article 'Computer-assisted detection and classification of misinformation about climate change

The code below was obtained from the cards_inference.ipynb supplied by the above mentioned article

In [None]:
import pandas as pd
import re
import unicodedata
import time
from simpletransformers.classification import ClassificationModel
from scipy.special import softmax
import torch
# Load device
if torch.cuda.is_available():    
    device = torch.device("cuda")
    print('There are %d GPU(s) available.' % torch.cuda.device_count())
    print('We will use GPU {}:'.format(torch.cuda.current_device()), torch.cuda.get_device_name(torch.cuda.current_device()))
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

In [None]:
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)
def remove_non_ascii(text):
    """Remove non-ASCII characters from list of tokenized words"""
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
def strip_underscores(text):
    return re.sub(r'_+', ' ', text)
def remove_multiple_spaces(text):
    return re.sub(r'\s{2,}', ' ', text)

def denoise_text(text):
    text = remove_between_square_brackets(text)
    text = remove_non_ascii(text)
    text = strip_underscores(text)
    text = remove_multiple_spaces(text)
    return text.strip()

Load the data of quotes to be exteded with the CARDS output

In [None]:
# Load the text data
data = pd.read_json('climate_change_quotes_small_v2.json')
print('{} paragraphs were loaded. Here are the first few rows of the data:'.format(len(data)))
data.head()

Denoise the quotes

In [None]:
data['quotation_denoised'] = data['quotation'].astype(str).apply(denoise_text)

Load model

In [None]:
%%time

# Define the model 
architecture = 'roberta'
model_name = 'CARDS_RoBERTa_Classifier'

# Load the classifier
model = ClassificationModel(architecture, model_name, use_cuda=True)

Run model

In [None]:
%%time
predictions, raw_outputs = model.predict(list(data.quotation_denoised))

In [None]:
data['cardsPredLabel'] = predictions
data

In [None]:
data = data.drop(columns=['quotation_denoised'])

Save the outputs

In [None]:
data.to_json('climate_change_quotes_small_v3.json')