### DANGEROUS TERRITORY:
This notebook can allocate a lot of disk space and needs some processing power (ideally available CUDA GPU)

# Reddit Climate Change - Modeling Sentiment & Emotion
Supervision: Prof. Dr. Jan Fabian Ehmke

Group members: Britz Luis, Huber Anja, Krause Felix Elias, Preda Yvonne-Nadine

Time: Summer term 2023 

Data: https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset

In [6]:
from transformers import pipeline # Protobuf version <4 (e.g. 3.20.3) might be needed!

import pandas as pd
import numpy as np
import os
import torch
import pickle
from tqdm import tqdm
import time
from datetime import datetime
import sys
import multiprocessing.dummy as mp 
import logging

In [7]:
print(f"CUDA device found: {torch.cuda.get_device_name(torch.cuda.current_device())}")
if torch.cuda.is_available():
    device = torch.cuda.current_device()
    batch_size = 8
    print("### \n WARNING: YOU WILL TRAIN ON DETECTED GPU \n###")
else:
    batch_size = 1
    device = -1

CUDA device found: NVIDIA GeForce GTX 1060 6GB
### 
###


## Load data

In [8]:
# Load comments
df = pd.read_csv("data/comments_final.csv", header=0, index_col=0)

# Sanity check
if not df[df.isna().any(axis=1)].empty:
    raise Exception("Sanity check failed! Empty rows detected!")

df.head(3)

Unnamed: 0,id,subreddit.name,subreddit.nsfw,created_utc,permalink,sentiment,score,created_date,created_day,created_month,created_year,created_time,topic_number,topic_name,topic_most_used_words,body_clean_full
0,c0i14fb,askreddit,False,1262306000.0,https://old.reddit.com/r/AskReddit/comments/ak...,0.7998,1.0,2010-01-01,1,1,2010,00:34:07,-1,-1_climate_people_global_warming,climate - people - global - warming - just - s...,"should be ""San Diego Weatherman has an opinion..."
1,c0i195b,worldnews,False,1262313000.0,https://old.reddit.com/r/worldnews/comments/ak...,0.4754,0.0,2010-01-01,1,1,2010,02:30:18,0,0_people_just_climate_global,people - just - climate - global - don - like ...,Both Iggy and Harper would have marched us int...
2,c0i1a0w,environment,False,1262314000.0,https://old.reddit.com/r/environment/comments/...,0.0242,1.0,2010-01-01,1,1,2010,02:54:40,0,0_people_just_climate_global,people - just - climate - global - don - like ...,"A man who though a moderate Tory , has a mixed..."


In [9]:
df.shape

(1041570, 16)

In [10]:
# Extract sample text
df.body_clean_full.iloc[21]



In [11]:
# Get samples per year
years = df.created_year.unique()

for year in years:
    print(year, df[df.created_year == year].shape[0])

2010 15986
2011 26432
2012 40264
2013 63797
2014 99626
2015 99550
2016 99589
2017 99544
2018 99491
2019 99419
2020 99431
2021 99332
2022 99109


## HF Transformers Models

Be aware: Models rather large, first time to run might take some downloading time (~500MB per model; saved in sth like "C:\Users\Felix\.cache\huggingface\hub")

All HF sentiment models: https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads&search=sentiment

All HF emotion detection models: https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads&search=emotion

### Sentiment Model

https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment

ENCODING: 0 -> Negative; 1 -> Neutral; 2 -> Positive

GPU: ~0.0105 sec/text

In [None]:
model = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment", device=device, batch_size=batch_size)
model_name = "sentiment"

### Climate stance model

https://huggingface.co/cardiffnlp/twitter-roberta-base-stance-climate

Paper: https://aclanthology.org/S16-1003.pdf

"Climate Change is a Real Concern" -> favor/against/none

GPU: ~0.014 sec/text

In [4]:
model = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-stance-climate", device=device, batch_size=batch_size)
model_name = "climate_stance"

### Climate sentiment model

https://huggingface.co/climatebert/distilroberta-base-climate-sentiment 

-> neutral, opportunity, risk

In [None]:
model = pipeline("sentiment-analysis", model="climatebert/distilroberta-base-climate-sentiment", device=device, batch_size=batch_size)
model_name = "climate_sentiment"

### Large emotion model (28 states detected)

Model: https://huggingface.co/arpanghoshal/EmoRoBERTa

Labels detected: 'remorse', 'disappointment', 'sadness', 'gratitude', 'realization', 'disapproval', 'neutral', 'approval', 'embarrassment', 'caring', 'curiosity', 'confusion', 'annoyance', 'joy', 'optimism', 'relief', 'excitement', 'admiration', 'love', 'disgust', 'grief', 'amusement', 'anger', 'surprise', 'pride', 'nervousness', 'fear', 'desire'

GPU: ~0,11 sec/text

In [4]:
model = pipeline('sentiment-analysis', model='arpanghoshal/EmoRoBERTa', top_k=1, device=device, batch_size=batch_size) # top_k=None lists all labels
model_name = "emotion_large"

# Only apply for non outlier topics!
df = df[df.topic_number != -1]
df.shape

All model checkpoint layers were used when initializing TFRobertaForSequenceClassification.

All the layers of TFRobertaForSequenceClassification were initialized from the model checkpoint at arpanghoshal/EmoRoBERTa.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.


(486402, 16)

### Small emotion model (7 states)

Model: https://huggingface.co/j-hartmann/emotion-english-distilroberta-base

Labels detected: 'surprise', 'neutral', 'fear', 'anger', 'joy', 'disgust', 'sadness'

GPU: ~0,011 sec/text

In [4]:
model = pipeline("text-classification", model="j-hartmann/emotion-english-distilroberta-base", top_k=1, device=device, batch_size=batch_size) # top_k=None lists all labels
model_name = "emotion_small"

## Test model functionality

In [None]:
sample_prediction = model("Climate change is a big scam! Why is everyone so upset?!?!")
sample_prediction

In [None]:
# Find labels included
labels = [i["label"] for i in sample_prediction]
print(labels)

## Apply model manually

In [None]:
# Create backup of df
#df_backup = df.copy()

In [5]:
# Reduce text to character limit if there is one
def reduce_text(text, max_length=np.inf):
    if len(text) > max_length:
        text = text[0:max_length]
    return text

In [6]:
# Reduce texts if necessary
df["body_clean_full"] = df["body_clean_full"].apply(reduce_text, args=(512, )) # e.g. for small sentiment model
print("Comment text reduced to 512 characters!")

Comment text reduced to 512 characters!


In [None]:
# Only use subset
df.sample(frac=1, random_state=42) # shuffle
df = df.iloc[:1000, :]

In [None]:
# CAUTION: THIS WILL RUN INFERENCE AND CAN THUS TAKE SOME TIME
#df["label"] = model(list(df["body_clean_full"]))
results = df.body_clean_full.iloc[:1000].map(model)

In [None]:
# Recommended inference on GPU
# https://huggingface.co/docs/transformers/pipeline_tutorial#using-pipelines-on-a-dataset
def load_iterator():
    for i in df.body_clean_full.iloc[:1000]:
        # DEBUGGING
        # i = i.replace("\"", " ")
        # i = i.replace("'", " ")
        # print(i)
        # print("")
        yield i

results = []

for out in model(load_iterator()):
    results.append(out)

In [None]:
# Extract (top) label
results = [i[0]["label"] for i in results]

In [None]:
# Print all top labels found
np.unique(results)

In [None]:
# Get sample texts for certain label
print(df.iloc[:1000,:][pd.Series(results) == "LABEL_0"].body_clean_full.iloc[8])

### DEBUGGING FOLLOWS

In [86]:
# Filter data for faulty months and years
df = df[(df.created_year == 2015) & (df.created_month == 11) | (df.created_year == 2021) & (df.created_month == 4) | (df.created_year == 2021) & (df.created_month == 10)]

In [None]:
import re
df.body_clean_full = df.body_clean_full.apply(lambda x: re.sub(r'[^a-zA-Z0-9\s\.,?!;:\'"()\[\]\{\}\-]', "", x))

In [89]:
df.body_clean_full.values[5926]

"The only people who don't think that climate change is the biggest threat, are the people who don't understand climate change.  Between ISIS and climate change, only one of those issues can potentially destroy the habitability of the entire planet."

In [None]:
# Reproducing error
df[(df.created_year == 2015) & (df.created_month == 11)].body_clean_full.iloc[5926:].map(model)

In [None]:
df.body_clean_full.iloc[:1000][0]

## Run prediction per year

In [None]:
# Only use subset
# df = df.sample(frac=1, random_state=42) # shuffle
# df = df.iloc[:1000, :]

In [6]:
# Reduce texts if necessary
df["body_clean_full"] = df["body_clean_full"].apply(reduce_text, args=(512, )) # e.g. for small sentiment model
print("Comment text reduced to 512 characters!")

Comment text reduced to 512 characters!


In [7]:
# Sanity check
df[df.body_clean_full.apply(len) > 512]

Unnamed: 0,id,subreddit.name,subreddit.nsfw,created_utc,permalink,sentiment,score,created_date,created_day,created_month,created_year,created_time,topic_number,topic_name,topic_most_used_words,body_clean_full


In [None]:
df.created_year.unique()

In [7]:
# Clean data even further! (Necessary for years 2015 and 2021)
import re
df.body_clean_full = df.body_clean_full.apply(lambda x: re.sub(r'[^a-zA-Z0-9\s\.,?!;:\'"()\[\]\{\}\-]', "", x))

In [8]:
# Start logging
# https://docs.python.org/3/howto/logging.html
folder = f"data/{model_name}_labels/"
if not os.path.exists(folder):
    os.mkdir(folder)

logging.basicConfig(filename=f"data/{model_name}_labels/{model_name}_logs.log", level=logging.DEBUG)

In [9]:
def save_file(save_df, idx, to_csv=False):
    idx = str(idx)
    folder = f"data/{model_name}_labels/"
    path = folder + f"{model_name}_{idx}"

    if os.path.exists(path + ".pkl"): 
        path = path + "_" + str(datetime.now())[-5:]
        logging.warning(str(datetime.now()) + f" Warning: path already existed, will save as: {path}")
    if not os.path.exists(folder):
        os.mkdir(folder)

    try:
        with open(path + ".pkl", "wb") as f:
            pickle.dump(save_df, f)
        logging.info(str(datetime.now()) + f" Year {idx} saved in {path}")
    except:
        # Save to csv if failed
        try:
            logging.info(str(datetime.now()) + " Pickling failed, will try to save as csv")
            save_df.to_csv(path + "_BACKUP.csv", index=False)
        except:
            logging.error(str(datetime.now()) + f" ERROR: Failed to save year {idx}!")
            logging.info(str(sys.exc_info()))
            return False

    return True

In [None]:
# Sanity check
#save_file(df, 2004)

In [10]:
max_length = 512

def executer(df_sample, year):
    try:
        logging.info(str(datetime.now()) + f" Running inference for year {year}, with {df_sample.shape[0]} samples")

        # Run inference as recommended for GPU
        # https://huggingface.co/docs/transformers/pipeline_tutorial#using-pipelines-on-a-dataset
        def load_iterator():
            for i in df_sample.body_clean_full: # FIXME DEBUGGING .iloc[83814:]
                yield i[0:max_length]

        labels = []
        idx = 0
        for idx, out in enumerate(model(load_iterator())): # Run inference and collect labels
            labels.append(out)

        #labels = model(list(df_sample["body_clean_full"])) # OLD WAY TO RUN INFERENCE

        if model_name in ["emotion_small", "emotion_large"]:
            labels = [i[0]["label"] for i in labels]
        else: 
            labels = [i["label"] for i in labels]

        df_save = pd.DataFrame({"id": df_sample.id, model_name: labels})
        save_file(df_save, year)

    except:
        logging.error(str(datetime.now()) + f" ERROR with year {year}: \n" + str(sys.exc_info()))
        logging.info(str(datetime.now()) + f" Last idx checked: {idx}, with id: {df_sample.iloc[idx,:].id}")

In [None]:
# Sanity check
#executer(2021)

In [11]:
# RUN INFERENCE ON ALL DATA

years = df.created_year.unique()
years = [2015, 2021]
#years[years.sort()]

logging.info(f"Running inference for {model_name} \n")
print(datetime.now(), f"Running inference for {model_name} \n")

for year in tqdm(years):
    df_sample = df[df.created_year == year]
    
    executer(df_sample, year) # optimized way through iterator

logging.info(f"Inference finished for {model_name}")

2023-05-15 19:05:51.975160 Running inference for emotion_large 



100%|██████████| 2/2 [3:01:40<00:00, 5450.26s/it]  


In [None]:
df.body_clean_full[df.created_year == 2015].iloc[83815]

### Re-running problematic years monthwise

In [12]:
# RUN INFERENCE ON ALL DATA - PER YEAR AND MONTH

years = [2021]#, 2015]
months = [4, 10]
#months = df.created_month.unique()
#months[months.sort()]

logging.info(f"\nSAFETY RUN PER MONTH AND YEAR \n")
logging.info(f"Running inference for {model_name} \n")
print(datetime.now(), f"Running inference for {model_name} \n")

for year in tqdm(years):
    for month in months:
        df_sample = df[(df.created_year == year) & (df.created_month == month)]
    
        executer(df_sample, str(year) + "-" + str(month)) # optimized way through iterator
        #executer2(df_sample, year) # old way

logging.info(f"Inference finished for {model_name}")

2023-05-15 17:09:14.592877 Running inference for emotion_small 



100%|██████████| 1/1 [02:05<00:00, 125.75s/it]


## Checking faulty data months

In [None]:
problem_1 = df[(df.created_year == 2015) & (df.created_month == 11)].body_clean_full.values[5927]
problem_1

In [None]:
problem_2 = df[(df.created_year == 2021) & (df.created_month == 4)].body_clean_full.values[1583]
problem_2

In [None]:
problem_3 = df[(df.created_year == 2021) & (df.created_month == 10)].body_clean_full.values[1511]
problem_3

In [None]:
len(problem_1)

In [None]:
# TODO try to remove all special characters of month

## Load results

In [None]:
# Load labels file
with open(f"data/{model_name}_labels/{model_name}_2021.pkl", "rb") as f:
    labels_df = pickle.load(f)

In [None]:
labels_df

## Join labels to full results

In [51]:
model_name = "emotion_large"

In [52]:
path = f"data/{model_name}_labels/"

files = os.listdir(path)
files = [i for i in files if i.endswith(".pkl")]

first = True
for f in files:
    with open(path + f, "rb") as temp_df:
        labels_df = pickle.load(temp_df)

    if first:
        temp = labels_df
        first = False
    else:
        temp = pd.concat([labels_df, temp])

temp = temp.drop_duplicates() # removes duplicates from already inferred data, necessary for emotion_small!

In [53]:
temp

Unnamed: 0,id,emotion_large
1071585,hqqvzdi,realization
1071586,hqqvzz0,approval
1071587,hqqwdqz,neutral
1071588,hqqwq4u,annoyance
1071591,hqqyf7x,disgust
...,...,...
15979,c1azzhq,neutral
15980,c1azzu8,realization
15981,c1b000n,annoyance
15982,c1b005v,approval


In [5]:
# Sanity check
df[df.id == "hqqvwdx"]

Unnamed: 0,id,subreddit.name,subreddit.nsfw,created_utc,permalink,sentiment,score,created_date,created_day,created_month,created_year,created_time,topic_number,topic_name,topic_most_used_words,body_clean_full,climate_stance,emotion_small,emotion_large
942461,hqqvwdx,whitepeopletwitter,False,1640996000.0,https://old.reddit.com/r/WhitePeopleTwitter/co...,0.0,1.0,2022-01-01,1,1,2022,00:06:49,-1,-1_people_just_like_climate,people - just - like - climate - don - think -...,-clinton/yes-donald-trump-did-call-climate-ch...,favor,surprise,


In [55]:
# Join by index with comments.csv
old_shape = df.shape
print("Size of full dataframe:  ", old_shape)
print("Size of labels dataframe:", temp.shape)

# Join dataframes
df = pd.merge(df, temp, how="left", on="id")

# Verify join
if old_shape[0] != df.shape[0]:
    raise Exception(f"ERROR: lengths mismatch! Error when joining! \n {old_shape[0]} -> {df.shape[0]}")

Size of full dataframe:   (1041570, 18)
Size of labels dataframe: (486402, 2)


In [56]:
df

Unnamed: 0,id,subreddit.name,subreddit.nsfw,created_utc,permalink,sentiment,score,created_date,created_day,created_month,created_year,created_time,topic_number,topic_name,topic_most_used_words,body_clean_full,climate_stance,emotion_small,emotion_large
0,c0i14fb,askreddit,False,1.262306e+09,https://old.reddit.com/r/AskReddit/comments/ak...,0.7998,1.0,2010-01-01,1,1,2010,00:34:07,-1,-1_climate_people_global_warming,climate - people - global - warming - just - s...,"should be ""San Diego Weatherman has an opinion...",favor,surprise,
1,c0i195b,worldnews,False,1.262313e+09,https://old.reddit.com/r/worldnews/comments/ak...,0.4754,0.0,2010-01-01,1,1,2010,02:30:18,0,0_people_just_climate_global,people - just - climate - global - don - like ...,Both Iggy and Harper would have marched us int...,favor,fear,neutral
2,c0i1a0w,environment,False,1.262314e+09,https://old.reddit.com/r/environment/comments/...,0.0242,1.0,2010-01-01,1,1,2010,02:54:40,0,0_people_just_climate_global,people - just - climate - global - don - like ...,"A man who though a moderate Tory , has a mixed...",favor,surprise,approval
3,c0i1hsb,askreddit,False,1.262330e+09,https://old.reddit.com/r/AskReddit/comments/ak...,0.7579,3.0,2010-01-01,1,1,2010,07:05:41,-1,-1_climate_people_global_warming,climate - people - global - warming - just - s...,Changing the oil *filter* every single time yo...,favor,neutral,
4,c0i1pd9,politics,False,1.262349e+09,https://old.reddit.com/r/politics/comments/akc...,-0.9849,32.0,2010-01-01,1,1,2010,12:37:36,-1,-1_climate_people_global_warming,climate - people - global - warming - just - s...,; We have no history - ours goes back only y...,none,disgust,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1041565,imlbfv6,terrifyingasfuck,False,1.661990e+09,https://old.reddit.com/r/TerrifyingAsFuck/comm...,0.3182,-6.0,2022-08-31,31,8,2022,23:45:08,0,0_people_like_just_don,people - like - just - don - think - world - w...,I'm sure it's climate change. Probably has no...,favor,neutral,approval
1041566,imlbh9l,damnthatsinteresting,False,1.661990e+09,https://old.reddit.com/r/Damnthatsinteresting/...,0.4404,5.0,2022-08-31,31,8,2022,23:45:25,-1,-1_people_just_like_climate,people - just - like - climate - don - think -...,You should check out Paul Nicklen's (the guy i...,favor,neutral,
1041567,imlcpab,askreddit,False,1.661990e+09,https://old.reddit.com/r/AskReddit/comments/x2...,0.4690,2.0,2022-08-31,31,8,2022,23:54:25,-1,-1_people_just_like_climate,people - just - like - climate - don - think -...,They need to change laws so it's more worth se...,favor,neutral,
1041568,imlctc0,pastors,False,1.661990e+09,https://old.reddit.com/r/pastors/comments/x2il...,0.9779,2.0,2022-08-31,31,8,2022,23:55:14,4,4_trans_gender_women_men,trans - gender - women - men - gay - sex - peo...,Can i suggest maybe honing in on LGBTQ? It's ...,none,neutral,curiosity


In [58]:
# Save new file
save_path = "data/comments_final_labels.csv"

if not os.path.exists(save_path):
    df.to_csv(save_path)
else:
    print(f"WARNING: File not saved as {path} already exists!")