# Reddit Scraper for PlayStation Posts 🎮

**https://www.reddit.com/robots.txt**

- https://www.reddit.com/r/playstation/
- https://www.reddit.com/r/PlaystationPortal/
- https://www.reddit.com/r/PlayStationPlus/
- https://www.reddit.com/r/PS4/
- https://www.reddit.com/r/PS5/
- https://www.reddit.com/r/PSVR/

**Note** Decided to use the `/top/?t=year` URL param to help filter the noise. I will then take the top 1,000 posts for the past year. 

In [3]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
from bs4 import BeautifulSoup
import pandas as pd

In [4]:
def scroll_to_bottom(driver):
    last_height = driver.execute_script("return document.body.scrollHeight")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(5)
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            break
        last_height = new_height

In [5]:
def extract_data(driver):
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    articles = soup.find_all('article')
    data = []
    for article in articles:
        post = article.find('shreddit-post')
        if post:
            data.append({
                'Date': post.get('created-timestamp', ''),
                'Title': post.get('post-title', ''),
                'Link': 'https://www.reddit.com' + post.get('permalink', ''),
                'Comments': post.get('comment-count', '0'),
                'Author': post.get('author', ''),
                'Upvotes': post.get('score', '0')
            })
    return data

In [1]:
def scrape_reddit():
    driver = webdriver.Chrome()
    driver.get("https://www.reddit.com/r/PSVR/top/?t=year") # note for me: r/Playstation 🟢 r/PlaystationPortal 🟢 r/PlaystationPlus 🟢 r/PS4 🟢 r/PS5 🟢 r/PSVR 🟢
    scroll_to_bottom(driver)
    scraped_data = extract_data(driver)
    driver.quit()
    return pd.DataFrame(scraped_data)

In [6]:
df = scrape_reddit()
print(df)

                                Date  \
0    2023-03-05T22:09:03.058000+0000   
1    2023-03-09T15:42:22.709000+0000   
2    2023-02-25T19:24:20.976000+0000   
3    2023-02-18T16:02:00.029000+0000   
4    2023-06-05T21:05:14.701000+0000   
..                               ...   
243  2023-02-17T21:02:57.490000+0000   
244  2023-02-25T21:11:55.799000+0000   
245  2023-02-18T19:14:10.881000+0000   
246  2023-03-07T14:26:49.853000+0000   
247  2023-02-22T08:45:13.177000+0000   

                                                 Title  \
0    The VR2 Controller charging station just burne...   
1                  I can’t see any improvements at all   
2    The only negative thing about these controller...   
3    I'm ecstatic. Can't believe it I got it this e...   
4          The PSVR2's price isn't looking so bad now.   
..                                                 ...   
243  Without Parole: "The PSVR2 game announcements ...   
244  POV for those curious about PSVR2 image clarit... 

# Testing with itables 🧪

In [7]:
from itables import show

In [8]:
print(df.shape)

(248, 6)


In [9]:
show(df)

Date,Title,Link,Comments,Author,Upvotes
Loading... (need help?),,,,,


In [10]:
# to csv
# df.to_csv('RAW_rPSVR_reddit_titles.csv', index=False)