### Pre-talk notes for Speaker!
* get API keys

During talk:
* Run Jupyter from Terminal
* Log in to UKDS Reddit acount
* Open the [Reddit API Dashboard](https://old.reddit.com/prefs/apps/)
* Share public link -> https://github.com/UKDataServiceOpen/working-with-twitter-data/blob/main/TwarcDemo.ipynb
* Share Binder Link -> TODO
* Make Binder -> TODO

Talk time - TODO

In [40]:
# API KEYS - DELETE AFTER TALK
client_id="REPLACEME"
client_secret="REPLACEME"

# Reddit Scraping Demo
In this notebook we detail the recommended way to collect large amounts of Reddit data. 

Previously we recommended Twitter/X as a fantastic source of social media textual data. Sadly, that's no longer the case.

The Reddit API is a close approvimation, though markedly more complicated. Where a tweet stands on it's own contextless, most Reddit comments are related to the post title, or previous comments in a thread in some way and carry a more inherent bias in that regard.

The good news, the Reddit API is free. We can get roughly 100 comments per request × 60 requests = ~6,000 comments/minute. With clever programming we can theoretically grab all comments from a subreddit.

The best package for reddit-scraping is the Python Reddit API Wrapper, or [PRAW](https://praw.readthedocs.io/en/stable/getting_started/installation.html).

# Installation
You will need Python 3 and pip3 availible on your local machine. I recommend doing this by installing [Anaconda](https://docs.anaconda.com/anaconda/install/index.html)

We can check these exist by typing the following:
```
python
```
which should open a REPL and print out our python version.
And:
```
pip3
```
which should log out the manual for pip3. If these don't happen you will need to install these [here](https://www.python.org/downloads/) which may take some time.

```
pip install praw
```
This finally installs PRAW

In [41]:
# !pip3 install praw
# !pip3 install pandas
# !pip3 install openpyxl

In [42]:
# You will have to install PRAW if you haven't already
import praw
# This checks is PRAW is installed
print(praw.__version__)

7.8.1


In [43]:
# Install other helpful packages
import pandas as pd
import openpyxl
from datetime import datetime, timezone

# API Keys
In order to use this to scrape the Reddit API, you will need a Reddit account and API key:
1. Make a Reddit account [here](reddit.com/register).
2. Make a Reddit API key [here](reddit.com/prefs/apps).
3. Click "are you a developer? Make an app..."
4. Mark it as a Web app
5. Fill in all details with a name, description, and URLs relating to your project.

You should see two values:
1. Client ID
2. Client Secret

We won't go into much detail on these, but these two values would give complete access to your reddit account. If you are using a personal account and care about the history of it do not share these EVER. I'd recommend making a new reddit account for research processes.

# Get all comments from a single reddit post

In [44]:
# 1. Authenticate (replace with your credentials)
reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent="python:testscript:v0.1 (by /u/myusername)",
)

In [45]:
# 2. Get the submission
url = "https://old.reddit.com/r/ghana/comments/18b74n9/i_would_love_to_visit_ghana_what_should_i_know/"
submission = reddit.submission(url=url)

# 3. Load all comments
submission.comments.replace_more(limit=None)
all_comments = submission.comments.list()

# 4. Prepare list of dicts for DataFrame
data = []
for comment in all_comments:
    data.append({
        "author": str(comment.author),
        "score": comment.score,
        "created_utc": datetime.fromtimestamp(comment.created_utc, timezone.utc).replace(tzinfo=None),
        "body": comment.body.replace("\n", " "),
    })

# 5. print 5 comments
for comment in all_comments[0:4]:
    print(comment.body)

Thanks OP for your submission. This sub is heavily moderated by Auto Mod and your post may be mistakenly removed automatically. Please send a message to the mods or u/JuliusCeaserBoneHead for manual approval. Before you do that, make sure your post does not break any of r/Ghana rules especially rule 4 (No Self Promotion).


*I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/ghana) if you have any questions or concerns.*
Travelling here between June - September means you'll meet pretty mild weather. Pack enough comfortable tshirts and shorts to accommodate for multiple changes a day. Don't stay in Accra throughout your visit. 

Go to :

-Akosombo (see the Volta lake and multiple riverfront villas)
-Tamale (Bole national park and Paga crocodile pond if possible in the upper east)
-then Kumasi on your way back (Lake Bosomtwe, Kumasi royal museum)

The local cuisine isn't bad. Keep an open mind.

Not too sure 

In [46]:
# 5. Create DataFrame
df = pd.DataFrame(data)

df.head()

Unnamed: 0,author,score,created_utc,body
0,AutoModerator,1,2023-12-05 08:00:46,Thanks OP for your submission. This sub is hea...
1,OG_rafiki,20,2023-12-05 08:33:14,Travelling here between June - September means...
2,steepcurve,11,2023-12-05 08:55:53,If you go to pub/club. Keep count on your drin...
3,Adorable_Rub_8257,7,2023-12-07 22:46:28,Reading all the comments here got me smiling. ...
4,MyRockMyRefuge,4,2023-12-05 08:52:48,It’s great to hear you want to visit Ghana. Th...


In [47]:
# 6. Save to CSV
df.to_csv("data/single_post_reddit_comments.csv", index=False, encoding="utf-8")

print(f"Saved {len(df)} comments to reddit_comments.csv")

Saved 42 comments to reddit_comments.csv


In [48]:
#7. Or save to Excel Spreadsheet
df.to_excel("data/single_post_reddit_comments.xlsx", index=False, engine='openpyxl')

print(f"Saved {len(df)} comments to reddit_comments.xlsx")

Saved 42 comments to reddit_comments.xlsx


# Get all comments, from multiple reddit posts

In [49]:
# List of URLs
urls = [
    "https://old.reddit.com/r/ghana/comments/18b74n9/i_would_love_to_visit_ghana_what_should_i_know/",
    "https://old.reddit.com/r/geography/comments/1bpw6wx/tell_me_something_interesting_about_ghana/",
    "https://old.reddit.com/r/travel/comments/sq9gv6/ghana_is_a_gorgeous_safe_and_diverse_country/",
]

# Prepare list to hold all comment data
data = []

for url in urls:
    submission = reddit.submission(url=url)
    submission.comments.replace_more(limit=None)
    all_comments = submission.comments.list()

    for comment in all_comments:
        data.append({
            "url": url,
            "post_id": submission.id,
            "post_title": submission.title,
            "author": str(comment.author),
            "score": comment.score,
            "created_utc": datetime.fromtimestamp(comment.created_utc, timezone.utc).replace(tzinfo=None),
            "body": comment.body.replace("\n", " "),
        })

# Print all collected comments
for comment in data[0:4]:
    print(f"[{comment['post_id']}] {comment['post_title']} - {comment['author']}: {comment['body']}")


[18b74n9] I would love to visit Ghana. What should I know? - AutoModerator: Thanks OP for your submission. This sub is heavily moderated by Auto Mod and your post may be mistakenly removed automatically. Please send a message to the mods or u/JuliusCeaserBoneHead for manual approval. Before you do that, make sure your post does not break any of r/Ghana rules especially rule 4 (No Self Promotion).   *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/ghana) if you have any questions or concerns.*
[18b74n9] I would love to visit Ghana. What should I know? - OG_rafiki: Travelling here between June - September means you'll meet pretty mild weather. Pack enough comfortable tshirts and shorts to accommodate for multiple changes a day. Don't stay in Accra throughout your visit.   Go to :  -Akosombo (see the Volta lake and multiple riverfront villas) -Tamale (Bole national park and Paga crocodile pond if possible i

In [50]:
# 5. Create DataFrame
df = pd.DataFrame(data)

df.head()

Unnamed: 0,url,post_id,post_title,author,score,created_utc,body
0,https://old.reddit.com/r/ghana/comments/18b74n...,18b74n9,I would love to visit Ghana. What should I know?,AutoModerator,1,2023-12-05 08:00:46,Thanks OP for your submission. This sub is hea...
1,https://old.reddit.com/r/ghana/comments/18b74n...,18b74n9,I would love to visit Ghana. What should I know?,OG_rafiki,21,2023-12-05 08:33:14,Travelling here between June - September means...
2,https://old.reddit.com/r/ghana/comments/18b74n...,18b74n9,I would love to visit Ghana. What should I know?,steepcurve,10,2023-12-05 08:55:53,If you go to pub/club. Keep count on your drin...
3,https://old.reddit.com/r/ghana/comments/18b74n...,18b74n9,I would love to visit Ghana. What should I know?,Adorable_Rub_8257,8,2023-12-07 22:46:28,Reading all the comments here got me smiling. ...
4,https://old.reddit.com/r/ghana/comments/18b74n...,18b74n9,I would love to visit Ghana. What should I know?,MyRockMyRefuge,3,2023-12-05 08:52:48,It’s great to hear you want to visit Ghana. Th...


In [51]:
# 6. Save to CSV
df.to_csv("data/multi_post_reddit_comments.csv", index=False, encoding="utf-8")

print(f"Saved {len(df)} comments to reddit_comments.csv")

Saved 797 comments to reddit_comments.csv


In [52]:
#7. Or save to Excel Spreadsheet
df.to_excel("data/multi_post_reddit_comments.xlsx", index=False, engine='openpyxl')

print(f"Saved {len(df)} comments to reddit_comments.xlsx")

Saved 797 comments to reddit_comments.xlsx


# Get all comments, from an entire subreddit

In [53]:
subreddit_name = "ghana"  # change to any subreddit you want

# Get top 100 posts from the last year
subreddit = reddit.subreddit(subreddit_name)
submissions = subreddit.top(time_filter="year", limit=100)

data = []

for submission in submissions:
    submission.comments.replace_more(limit=None)
    all_comments = submission.comments.list()

    for comment in all_comments:
        data.append({
            "url": f"https://old.reddit.com{submission.permalink}",
            "post_id": submission.id,
            "post_title": submission.title,
            "author": str(comment.author),
            "score": comment.score,
            "created_utc": datetime.fromtimestamp(comment.created_utc, timezone.utc).replace(tzinfo=None),
            "body": comment.body.replace("\n", " "),
        })

# Print 5 comments
for comment in data[0:4]:
    print(f"[{comment['post_id']}] {comment['post_title']} - {comment['author']}: {comment['body']}")


[1g2mnuu] Bolt Driver - AutoModerator: Introducing the !medaase app. If someone's comment/post helps you, use !medaase as a reply to them to add a reputation to their profile. Users with the highest reputations will have their comments and posts auto approved and rise to the top of comments. Users can also use their reputation as a flair. Hello /u/ONDickson_, Did your post get removed? please read the subreddit rules. /r/ghana/about/rules/. Please send a message to r/ghana or u/JuliusCeaserBoneHead for manual approval.   *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/ghana) if you have any questions or concerns.*
[1g2mnuu] Bolt Driver - Kenshia09: Such act of kindness. This gives me hope in humanity 🥹
[1g2mnuu] Bolt Driver - ONDickson_: [Bolt Response](https://imgur.com/gallery/bolt-response-1gTLOvn)
[1g2mnuu] Bolt Driver - sbirdhall: Thou shall not steal. But you or your friend need to be more careful

In [54]:
# 5. Create DataFrame
df = pd.DataFrame(data)

# print first 5 rows, formatted
df.head()

Unnamed: 0,url,post_id,post_title,author,score,created_utc,body
0,https://old.reddit.com/r/ghana/comments/1g2mnu...,1g2mnuu,Bolt Driver,AutoModerator,1,2024-10-13 10:25:50,Introducing the !medaase app. If someone's com...
1,https://old.reddit.com/r/ghana/comments/1g2mnu...,1g2mnuu,Bolt Driver,Kenshia09,74,2024-10-13 11:15:13,Such act of kindness. This gives me hope in hu...
2,https://old.reddit.com/r/ghana/comments/1g2mnu...,1g2mnuu,Bolt Driver,ONDickson_,25,2024-10-13 10:26:26,[Bolt Response](https://imgur.com/gallery/bolt...
3,https://old.reddit.com/r/ghana/comments/1g2mnu...,1g2mnuu,Bolt Driver,sbirdhall,21,2024-10-13 15:48:26,Thou shall not steal. But you or your friend n...
4,https://old.reddit.com/r/ghana/comments/1g2mnu...,1g2mnuu,Bolt Driver,kweikuz,11,2024-10-13 13:17:50,we do exist


In [55]:
# Print some basic information
df.describe()

Unnamed: 0,score,created_utc
count,8169.0,8169
mean,4.378382,2024-09-29 06:37:57.768637696
min,-54.0,2024-04-24 13:15:37
25%,1.0,2024-08-17 12:07:00
50%,2.0,2024-09-25 13:54:54
75%,4.0,2024-12-04 15:10:58
max,165.0,2025-07-26 02:29:55
std,10.30668,


In [56]:
# 6. Save to CSV
df.to_csv("data/ghana_subreddit_comments.csv", index=False, encoding="utf-8")

print(f"Saved {len(df)} comments to reddit_comments.csv")

Saved 8169 comments to reddit_comments.csv


In [57]:
#7. Or save to Excel Spreadsheet
df.to_excel("data/ghana_subreddit_comments.xlsx", index=False, engine='openpyxl')

print(f"Saved {len(df)} comments to reddit_comments.xlsx")

Saved 8169 comments to reddit_comments.xlsx
