# Subreddits scraper 

To fine tune contextual word embeddings we collect more (unlabeled) data from Reddit using the [Pushshift API](https://github.com/pushshift/api) through [PSAW](https://psaw.readthedocs.io/en/latest/).

In [1]:
import praw
import time
from psaw import PushshiftAPI

In [3]:
def log(action):
    global time_passed
    new_time = time.time()
    print(action + f'took {(new_time - time_passed)} seconds')
    time_passed = new_time
    return

"""
Saves given post text to the specified file.
"""
def save_post(text, filename):
    text = text.replace("\n+", " ")
    with open(filename, 'a', encoding="utf-8") as f:
        f.write(text + '\n')
        
"""
Saves the specified number of posts from the specified subreddit to the specified file. 
"""
def scrape_subreddit(reddit, subname, filename, n_posts=10):
    posts = api.search_submissions(subreddit=subname, mem_safe=True)
    i = 0
    for submission in posts: 
        try:
            text = submission.selftext
            if text and len(text) != 0 and '[removed]' not in text and '[deleted]' not in text:
                save_post(text, filename)
                i += 1
        except AttributeError:
            pass
        
        if i >= n_posts:
            break

## Credentials
Enter your own reddit app credentials instead of empty strings below. Follow [first steps guide by Reddit API](https://github.com/reddit-archive/reddit/wiki/OAuth2-Quick-Start-Example#first-steps) for more information about reddit app creation.

In [2]:
reddit = praw.Reddit(client_id='',
                     client_secret='', 
                     password='',
                     user_agent='', 
                     username='')
api = PushshiftAPI(reddit)
time_passed = time.time()

## Scraping details
Customize subreddits, scraped data destination and number of reddit posts below.

In [4]:
subreddits = [
    'domesticviolence',
    'survivorsofabuse', 
    'anxiety',
    'stress',
    'almosthomeless',
    'assistance',
    'food_pantry',
    'homeless',
    'ptsd',
    'relationships'
]
FILE = 'scraped-data-all.txt' # scraped data destination
N = 10000 # number of posts per subreddit

## Run scraping

Depending on active PushShift shards the wanted number of posts may not always be attained. 

In [5]:
for sub in subreddits:
    scrape_subreddit(reddit, sub, FILE, N)
    log(f'{sub} done ')



domesticviolence done took 261.1838638782501 seconds
survivorsofabuse done took 86.72371578216553 seconds




anxiety done took 248.17567539215088 seconds
stress done took 115.80576658248901 seconds
almosthomeless done took 67.03820085525513 seconds
assistance done took 790.1212203502655 seconds
food_pantry done took 85.13964891433716 seconds
homeless done took 323.1671495437622 seconds
ptsd done took 316.2838931083679 seconds
relationships done took 1053.020528793335 seconds
