# Reddit Climate Change - Modeling & Evaluation
Supervision: Prof. Dr. Jan Fabian Ehmke

Group members: Britz Luis, Huber Anja, Krause Felix Elias, Preda Yvonne-Nadine

Time: Summer term 2023 

Data: https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset

In [None]:
#  Topic detection

# LDA
# https://towardsdatascience.com/end-to-end-topic-modeling-in-python-latent-dirichlet-allocation-lda-35ce4ed6b3e0
# http://rstudio-pubs-static.s3.amazonaws.com/79360_850b2a69980c4488b1db95987a24867a.html

# BERTopic
# https://maartengr.github.io/BERTopic/index.html


In [1]:
# Preparing environment
#%pip install bertopic
from bertopic import BERTopic
from sentence_transformers import SentenceTransformer, util
from umap import UMAP
from hdbscan import HDBSCAN
from sklearn.feature_extraction.text import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
import pandas as pd

  @numba.jit()
  @numba.jit()
  @numba.jit()
  @numba.jit()


In [25]:
# Load data
clean_comments = pd.read_csv("data/preprocessed_comments.csv")

  clean_comments = pd.read_csv("data/preprocessed_comments.csv")


In [26]:
clean_comments = clean_comments.dropna(axis=0)

In [29]:
# Convert timestamp because import creates float variables
clean_comments["created_year"] = pd.to_datetime(clean_comments["created_date"]).dt.strftime('%Y')
clean_comments["created_month"] = pd.to_datetime(clean_comments["created_date"]).dt.strftime('%m')
clean_comments["created_day"] = pd.to_datetime(clean_comments["created_date"]).dt.strftime('%d')

In [30]:
clean_comments.head()

Unnamed: 0,id,subreddit.name,subreddit.nsfw,created_utc,permalink,sentiment,score,created_date,created_day,created_month,created_year,created_time,body_clean
0,i79uz1c,oddlyterrifying,False,1651658000.0,https://old.reddit.com/r/oddlyterrifying/comme...,-0.5574,3.0,2022-05-04,4,5,2022,09:58:26,Oh shit there's a new one out? Last one k watc...
1,hz51unj,technews,False,1646280000.0,https://old.reddit.com/r/technews/comments/t53...,0.4588,1.0,2022-03-03,3,3,2022,03:54:08,"We’re never going to reopen those wells, its e..."
2,i3ic64d,worldnews,False,1649177000.0,https://old.reddit.com/r/worldnews/comments/tw...,0.6249,1.0,2022-04-05,5,4,2022,16:36:35,Climate Change is the Great Filter.
3,id3tlo2,ontario,False,1655760000.0,https://old.reddit.com/r/ontario/comments/vglj...,0.296,0.0,2022-06-20,20,6,2022,21:16:31,Climate change also means greater crop yields ...
4,iebulu4,news,False,1656603000.0,https://old.reddit.com/r/news/comments/vo98pd/...,-0.6115,12.0,2022-06-30,30,6,2022,15:26:18,The decline into total destruction by climate ...


# Modeling Topic Clusters

In [31]:
# Create a subsets for every year
year_groups = clean_comments.groupby(clean_comments['created_year'])

year_dfs = {'comments_{}'.format(year): group for year, group in year_groups}

for year, group in year_groups:
    year_dfs[year] = group

In [32]:
year_dfs["comments_2021"]

Unnamed: 0,id,subreddit.name,subreddit.nsfw,created_utc,permalink,sentiment,score,created_date,created_day,created_month,created_year,created_time,body_clean
99109,gtp03oi,yanggang,False,1.617807e+09,https://old.reddit.com/r/YangGang/comments/mli...,0.7522,1.0,2021-04-07,07,04,2021,14:54:16,"Ah ok I got my facts mixed up, thx for correct..."
99110,hkrr8ry,politicalcompass,False,1.637012e+09,https://old.reddit.com/r/PoliticalCompass/comm...,-0.5859,1.0,2021-11-15,15,11,2021,21:28:46,Oh boohoo muh gas prices. Gas prices were arti...
99111,h3cru7f,danlebatardshow,False,1.624908e+09,https://old.reddit.com/r/DanLeBatardShow/comme...,-0.4497,2.0,2021-06-28,28,06,2021,19:21:54,"it is, but i made the ""blame"" assumptions too ..."
99112,hlu80m8,neoliberal,False,1.637714e+09,https://old.reddit.com/r/neoliberal/comments/r...,-0.8639,7.0,2021-11-24,24,11,2021,00:33:26,Making carbon emissions reflect the externalit...
99113,h4ym10o,worldnews,False,1.626122e+09,https://old.reddit.com/r/worldnews/comments/oi...,-0.4491,20.0,2021-07-12,12,07,2021,20:34:18,This could be a watershed moment if civilizati...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
198438,gyv4lk2,politicaldiscussion,False,1.621542e+09,https://old.reddit.com/r/PoliticalDiscussion/c...,0.4873,0.0,2021-05-20,20,05,2021,20:14:52,I don’t see how republicans win this. \n\nAcco...
198439,hijfvds,environment,False,1.635530e+09,https://old.reddit.com/r/environment/comments/...,-0.7770,0.0,2021-10-29,29,10,2021,17:45:58,"Dude, electric isn't the only solution to the ..."
198440,h2wmlqa,askscience,False,1.624555e+09,https://old.reddit.com/r/askscience/comments/o...,-0.0258,11.0,2021-06-24,24,06,2021,17:17:25,Those leave the area relatively uninhabitable ...
198441,h6b21z8,environment,False,1.627082e+09,https://old.reddit.com/r/environment/comments/...,0.5000,1.0,2021-07-23,23,07,2021,23:19:18,Everything I said in that comment is as true a...


In [33]:
# Create one array of all titles, to feed it into BERT
docs = year_dfs["comments_2010"].body_clean.values


In [23]:
# BERT stepwise
# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Step 2 - Reduce dimensionality
umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine')

# Step 3 - Cluster reduced embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Step 4 - Tokenize topics
vectorizer_model = CountVectorizer(stop_words="english")

# Step 5 - Create topic representation
ctfidf_model = ClassTfidfTransformer()

# All steps together
topic_model = BERTopic(
  embedding_model=embedding_model,    # Step 1 - Extract embeddings
  umap_model=umap_model,              # Step 2 - Reduce dimensionality
  hdbscan_model=hdbscan_model,        # Step 3 - Cluster reduced embeddings
  vectorizer_model=vectorizer_model,  # Step 4 - Tokenize topics
  ctfidf_model=ctfidf_model,          # Step 5 - Extract topic words
  calculate_probabilities=False,      # Raises speed
  min_topic_size = 300,               # Reduces number of topics
  nr_topics="auto",                    # Reduces number of topics
  verbose=True
)

topics, probs = topic_model.fit_transform(docs)
topic_model.get_topic_info()

Batches: 100%|██████████| 500/500 [04:10<00:00,  2.00it/s]
2023-05-03 10:44:09,965 - BERTopic - Transformed documents to Embeddings
2023-05-03 10:44:40,232 - BERTopic - Reduced dimensionality
2023-05-03 10:44:40,674 - BERTopic - Clustered reduced embeddings
2023-05-03 10:44:45,574 - BERTopic - Reduced number of topics from 106 to 74


Unnamed: 0,Topic,Count,Name
0,-1,8554,-1_people_gt_global_just
1,0,1544,0_science_people_gt_don
2,1,1154,1_weather_warming_global_snow
3,2,505,2_temperature_atmosphere_water_vapor
4,3,315,3_obama_rudd_labor_party
...,...,...,...
69,68,16,68_grassroots_poptart_article_typety
70,69,16,69_causation_correlation_does_equal
71,70,16,70_economics_economists_nafta_maths
72,71,15,71_ica_water_peru_aquifer


In [24]:
# Get specific topic
topic_model.get_topic(0)

[('science', 0.01219470217097759),
 ('people', 0.010584660869951001),
 ('gt', 0.009517132731414605),
 ('don', 0.0086000351186167),
 ('scientific', 0.008597060961067942),
 ('global', 0.00822324026848448),
 ('think', 0.00787297838775973),
 ('scientists', 0.007625308000465114),
 ('evolution', 0.007521849591286694),
 ('warming', 0.007371242329106124)]

In [27]:
# Store topic info in dataframe
doc_info = topic_model.get_document_info(docs)

In [28]:
# Check out document information
doc_info.head()

Unnamed: 0,Document,Topic,Name,Top_n_words,Probability,Representative_document
0,Industrial output --&gt; Increased atmospheric...,-1,-1_people_gt_global_just,people - gt - global - just - science - don - ...,0.0,False
1,This is true but only because Australia lacks ...,-1,-1_people_gt_global_just,people - gt - global - just - science - don - ...,0.0,False
2,"Please, explain to us the whole concept.\n\nNe...",-1,-1_people_gt_global_just,people - gt - global - just - science - don - ...,0.0,False
3,"It hasn't been ""d"" some political types prefer...",1,1_weather_warming_global_snow,weather - warming - global - snow - ice - temp...,1.0,False
4,"&gt; It's called "" "" or as the right wing has ...",1,1_weather_warming_global_snow,weather - warming - global - snow - ice - temp...,1.0,False
