# Reddit Climate Change - Data Preparation
Supervision: Prof. Dr. Jan Fabian Ehmke

Group members: Britz Luis, Huber Anja, Krause Felix Elias, Preda Yvonne-Nadine

Time: Summer term 2023 

Data: https://www.kaggle.com/datasets/pavellexyr/the-reddit-climate-change-dataset

In [1]:
# Loading packages
import pandas as pd
import matplotlib.pyplot as plt

## Load data and pre-processing

### Data import

In [22]:
# Loading data
raw_comments = pd.read_csv('data/the-reddit-climate-change-dataset-comments.csv',nrows=1000)
raw_posts = pd.read_csv('data/the-reddit-climate-change-dataset-posts.csv',nrows= 50000)

### Empty, removed and deleted entries

In [23]:
# Clean post data set from empty posts, removed posts, deleted posts
clean_posts = raw_posts.drop(raw_posts[raw_posts['selftext'] == '[removed]'].index)
clean_posts = clean_posts.drop(clean_posts[clean_posts['selftext'] == '[deleted]'].index)
clean_posts = clean_posts.dropna(subset=['selftext'])

In [15]:
# Clean comment dataset from empty comments
clean_comments = raw_comments.dropna(subset=['body'], how='all')

### Duplicates

In [4]:
# Drop duplicates in post and comment dataset
clean_posts = clean_posts.drop_duplicates(subset=["selftext","type"])
clean_comments = clean_comments.drop_duplicates(subset=["body","type"])

### Sort out specific words

In [5]:
# Sort out the word climate change in comment and post dataset
deleted_words = ("climate","change")

for x in deleted_words:
    clean_posts["selftext"] = clean_posts["selftext"].str.replace(x, "", case=False)
    clean_posts["title"] = clean_posts["title"].str.replace(x, "", case=False)
    clean_comments["body"] = clean_comments["body"].str.replace(x, "", case=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_comments["body"] = clean_comments["body"].str.replace(x, "", case=False)


### Bots

In [24]:
# Sort out bot subreddits in post dataset
# It might be the case that we sort out too many comments or posts because not all of these channels have bot created content
# But since we have enough data it is better to sort them out completely

bot_subreddits = ['bottown2',
                  'subredditsummarybot',
                  'newsbotbot',
                  'blenderbot',
                  'wutbotposts',
                  'testanimalsupportbot',
                  'interfaithbotdialogue',
                  'bottowngarden',
                  'bottownfriends',
                  'bottown22',
                  'bottown_polibot',
                  'bottown1',
                  'bottown',
                  'testingground4bots',
                  'botterminator',
                  'popularnewsbot',
                  'twitter_bot',
                  'bottalks',
                  'u_anticensor_bot',
                  'u_yangpolicyinfo_bot',
                  'uknewsbyabot',
                  'u_userleansbot',
                  'talkwithgpt2bots',
                  'removalbot',
                  'pulsarbot',
                  'repostsleuthbot',
                  'nwordcountbot',
                  'gwcoepbot',
                  'modbot_staging',
                  'u_commonmisspellingbot',
                  'brokentranslatebot',
                  'gbpolbot',
                  'u_bot4bot',
                  'botsrights',
                  'botsscrewingup',
                  'articlebot',
                  'stabbot',
                  'bot4bottesting',
                  'newsbotmarket',
                  'mimeticsbot',
                  'airsoft_bot',
                  'bottesting',
                  'trollabot',
                  'trollbot',
                  'spacenewsbot',
                  'israelnewsbot',
                  'newsbiasbot',
                  'wikileaksemailbot',
                  'thelinkfixerbot',
                  'quizzybot',
                  'sentimentviewbot',
                  'open_bots_test',
                  'printrbot',
                  'isreactionarybot',
                  'foreveralonebots',
                  'dogetipbot',
                  'havoc_bot',
                  'botrequests',
                  'autowikibot',
                  'atheismbot',
                  'webbot']

for i in bot_subreddits:
    clean_posts = clean_posts[~clean_posts['subreddit.name'].str.contains(i)]

In [None]:
# Sort out bot subreddits in comment dataset

### Convert date and time information

In [13]:
# Create a new columns with date and time information
clean_posts['created_date'] = pd.to_datetime(clean_posts['created_utc'], utc=True, unit='s').dt.strftime('%Y-%m-%d')
clean_posts['created_day'] = pd.to_datetime(clean_posts['created_utc'], utc=True, unit='s').dt.strftime('%d')
clean_posts['created_month'] = pd.to_datetime(clean_posts['created_utc'], utc=True, unit='s').dt.strftime('%m')
clean_posts['created_year'] = pd.to_datetime(clean_posts['created_utc'], utc=True, unit='s').dt.strftime('%Y')
clean_posts['created_time'] = pd.to_datetime(clean_posts['created_utc'], utc=True, unit='s').dt.strftime('%H:%M:%S')

clean_comments['created_date'] = pd.to_datetime(clean_comments['created_utc'], utc=True, unit='s').dt.strftime('%Y-%m-%d')
clean_comments['created_day'] = pd.to_datetime(clean_comments['created_utc'], utc=True, unit='s').dt.strftime('%d')
clean_comments['created_month'] = pd.to_datetime(clean_comments['created_utc'], utc=True, unit='s').dt.strftime('%m')
clean_comments['created_year'] = pd.to_datetime(clean_comments['created_utc'], utc=True, unit='s').dt.strftime('%Y')
clean_comments['created_time'] = pd.to_datetime(clean_comments['created_utc'], utc=True, unit='s').dt.strftime('%H:%M:%S')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_comments['created_date'] = pd.to_datetime(clean_comments['created_utc'], utc=True, unit='s').dt.strftime('%Y-%m-%d')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  clean_comments['created_day'] = pd.to_datetime(clean_comments['created_utc'], utc=True, unit='s').dt.strftime('%d')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

### Output file

In [None]:
# Output CSV file with relevant data