### DANGEROUS TERRITORY:
This notebook can allocate a lot of disk space and needs some processing power (ideally available CUDA GPU)

# Reddit Climate Change - Modeling Sentiment & Emotion
Supervision: Prof. Dr. Jan Fabian Ehmke

Group members: Britz Luis, Huber Anja, Krause Felix Elias, Preda Yvonne-Nadine

Time: Summer term 2023 

Data: https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset

In [2]:
from transformers import pipeline # Protobuf version <4 (e.g. 3.20.3) might be needed!
import pandas as pd
import numpy as np
import os
import torch

print(f"CUDA device found: {torch.cuda.get_device_name(torch.cuda.current_device())}")
if torch.cuda.is_available():
    device = torch.cuda.current_device()
    print("### \n WARNING: YOU WILL TRAIN ON DETECTED GPU \n###")
else:
    device = -1

CUDA device found: NVIDIA GeForce GTX 1060 6GB
### 
###


## Load data

In [8]:
# Load posts
df = pd.read_csv("data/preprocessed_posts.csv", header=0, nrows=100)
df.head(3)

Unnamed: 0,type,id,subreddit.id,subreddit.name,subreddit.nsfw,created_utc,permalink,domain,url,selftext,title,score,title_clean,selftext_clean,language,created_date,created_day,created_month,created_year,created_time
0,post,x2slxy,2w844,nostupidquestions,False,1661990182,https://old.reddit.com/r/NoStupidQuestions/com...,self.nostupidquestions,,Ok so I was having a conversation with my neig...,Sharks and climate change,2,Sharks and,Ok so I was having a conversation with my neig...,en,2022-08-31,31,8,2022,23:56:22
1,post,x2pkij,2wnw4,stonerthoughts,False,1661982208,https://old.reddit.com/r/StonerThoughts/commen...,self.stonerthoughts,,It's actually the planet moving closer and clo...,what if instead of climate change...,3,what if instead of ...,It's actually the planet moving closer and clo...,en,2022-08-31,31,8,2022,21:43:28
2,post,x2mtg7,2r3rn,anarchocapitalism,False,1661975381,https://old.reddit.com/r/anarchocapitalism/com...,self.anarchocapitalism,,"Duh. Yes, temperatures go up and down. How i...",Climate Change Is REAL,0,Is REAL,"Duh. Yes, temperatures go up and down. How i...",en,2022-08-31,31,8,2022,19:49:41


In [9]:
# Load comments
df = pd.read_csv("data/preprocessed_comments.gzip", compression="gzip", header=0, nrows=100) # FIXME For now only small sample!
# TODO Maybe filter e.g. for year/range
df.head(3)

Unnamed: 0,id,subreddit.name,subreddit.nsfw,created_utc,permalink,sentiment,score,body_clean,created_date,created_day,created_month,created_year,created_time
0,imlddn9,news,False,1661990368,https://old.reddit.com/r/news/comments/x2cszk/...,0.5719,2.0,Yeah but what the above commenter is saying is...,2022-08-31,31.0,8.0,2022.0,23:59:28
1,imldbeh,ohio,False,1661990340,https://old.reddit.com/r/Ohio/comments/x2awnp/...,-0.9877,2.0,Any comparison of efficiency between solar and...,2022-08-31,31.0,8.0,2022.0,23:59:00
2,imldado,newzealand,False,1661990327,https://old.reddit.com/r/newzealand/comments/x...,-0.1143,1.0,I'm honestly waiting for and the impacts of ...,2022-08-31,31.0,8.0,2022.0,23:58:47


In [110]:
# Extract sample text
df.body.iloc[10]

'https://www.google.com/amp/s/www.foxnews.com/opinion/climate-change-congress-must-act-cleaner-healthier-world.amp\n\nThat sure is some far left propaganda, fox news...'

## HF Transformers Models

Be aware: Models rather large, first time to run might take some downloading time (~500MB; saved in sth like "C:\Users\Felix\.cache\huggingface\hub")

All HF sentiment models: https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads&search=sentiment

All HF emotion detection models: https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads&search=emotion

(Also models for irony, stances on climate/feminism/activism, etc.)

#### Sentiment Model

https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment

ENCODING: 0 -> Negative; 1 -> Neutral; 2 -> Positive

In [None]:
model = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment", device=device)

#### Climate model

https://huggingface.co/cardiffnlp/twitter-roberta-base-stance-climate

Paper: https://aclanthology.org/S16-1003.pdf

In [None]:
model = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-stance-climate", device=device)

#### Large emotion model (28 states detected)

Model: https://huggingface.co/arpanghoshal/EmoRoBERTa

Labels detected: 'remorse', 'disappointment', 'sadness', 'gratitude', 'realization', 'disapproval', 'neutral', 'approval', 'embarrassment', 'caring', 'curiosity', 'confusion', 'annoyance', 'joy', 'optimism', 'relief', 'excitement', 'admiration', 'love', 'disgust', 'grief', 'amusement', 'anger', 'surprise', 'pride', 'nervousness', 'fear', 'desire'

In [40]:
model = pipeline('sentiment-analysis', model='arpanghoshal/EmoRoBERTa', top_k=3, device=device) # top_k=None lists all labels

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at arpanghoshal/EmoRoBERTa.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


#### Small emotion model (7 states)

Model: https://huggingface.co/j-hartmann/emotion-english-distilroberta-base

Labels detected: 'surprise', 'neutral', 'fear', 'anger', 'joy', 'disgust', 'sadness'

In [14]:
model = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", top_k=1, device=device) # top_k=None lists all labels

#### Test model

In [33]:
sample_prediction = model("Wow! I really did not know that the sky was blue!")
sample_prediction



[[{'label': 'surprise', 'score': 0.9810950756072998}]]

In [None]:
# Find labels included
#labels = [i["label"] for i in sample_prediction]
#print(labels)

## Apply model to dataframe

In [36]:
def reduce_text(text, max_length=np.inf):
    if len(text) > max_length:
        text = text[0:max_length]
    return text

In [37]:
# Reduce texts if necessary
#df["body"] = df["body"].apply(reduce_text, args=(512,)) # e.g. for small sentiment model

In [39]:
# Apply to dataframe
df["label"] = model(list(df["body"])) # With GPU around 3x faster, still ~0.1 sec/text
#df["top_label"] = [i["label"] for i in df["label"]]



In [40]:
df

Unnamed: 0,type,id,subreddit.id,subreddit.name,subreddit.nsfw,created_utc,permalink,body,sentiment,score,label,top_label
218,comment,imkym8u,2qk5q,climateskeptics,False,1661983952,https://old.reddit.com/r/climateskeptics/comme...,Although the film crew swore up and down they ...,0.4841,5,"[{'label': 'fear', 'score': 0.35692259669303894}]",fear
441,comment,imkkeo7,2qk5q,climateskeptics,False,1661978410,https://old.reddit.com/r/climateskeptics/comme...,That is entirely what the other papers say. I...,0.8432,1,"[{'label': 'neutral', 'score': 0.8820799589157...",neutral
1025,comment,imjj5o9,2qk5q,climateskeptics,False,1661964273,https://old.reddit.com/r/climateskeptics/comme...,Climate change will always exist and always ha...,0.0000,1,"[{'label': 'neutral', 'score': 0.8541768789291...",neutral
1095,comment,imjfcxh,2qk5q,climateskeptics,False,1661962830,https://old.reddit.com/r/climateskeptics/comme...,The biggest problem is that the main water rig...,-0.4878,1,"[{'label': 'neutral', 'score': 0.56916344165802}]",neutral
1292,comment,imj2c6k,2qk5q,climateskeptics,False,1661957784,https://old.reddit.com/r/climateskeptics/comme...,Are you suggesting some conspiracy by NOAA to ...,0.8990,0,"[{'label': 'neutral', 'score': 0.8405239582061...",neutral
...,...,...,...,...,...,...,...,...,...,...,...,...
11426,comment,im2fcw7,2qk5q,climateskeptics,False,1661644739,https://old.reddit.com/r/climateskeptics/comme...,Climate change is like Long COVID. They blame ...,0.0258,1,"[{'label': 'anger', 'score': 0.4453723728656769}]",anger
11444,comment,im2dlf0,2qk5q,climateskeptics,False,1661643881,https://old.reddit.com/r/climateskeptics/comme...,"No, but climate change policies definitely did...",-0.6956,2,"[{'label': 'anger', 'score': 0.8326584696769714}]",anger
12336,comment,im06d9q,2qk5q,climateskeptics,False,1661609774,https://old.reddit.com/r/climateskeptics/comme...,Fair but even smaller ponds and streams are be...,0.1655,1,"[{'label': 'neutral', 'score': 0.6826416850090...",neutral
12340,comment,im063s7,2qk5q,climateskeptics,False,1661609657,https://old.reddit.com/r/climateskeptics/comme...,Are you so lacking in any integrity whatsoever...,0.9068,-1,"[{'label': 'disgust', 'score': 0.5141499638557...",disgust


## Save results

In [None]:
comment_save_path = f"data/comments_results.csv"

if not os.path.isfile(comment_save_path):
    df.to_csv(comment_save_path)
    print("File saved!")
else: 
    print("Warning file already exists")