# Working with Reddit

## Lecture objectives
1. Demonstrate how to scrape Reddit data using their API

We reviewed topic modeling in the previous lecture. Here and in the next lecture, we'll focus on another common Natural Language Processing tool: sentiment analysis. In short, sentiment analysis tries to understand whether a snippet of text (e.g. a tweet, a review, or a sentence from an article) is positive, negative, or neutral.

We'll apply sentiment analysis to some Reddit data on public transportation, using the [PRAW](https://praw.readthedocs.io/en/stable/) library.

If you want to access the Reddit data yourself, you'll need to:

(1) sign up for a Reddit account (free)

(2) create a client id and client secret. It's also free, and takes about 5 minutes. [Follow the second part of these instructions.](https://cs205uiuc.github.io/guidebook/python/reddit-api.html) You can get to the apps tab here: https://www.reddit.com/prefs/apps.

In earlier versions of this course, I used Twitter. However, academic access to the Twitter API is no longer free. For a thorough treatment of obtaining, analyzing, and interpreting Twitter data, check out [*Twitter as Data*](https://www.cambridge.org/core/elements/twitter-as-data/27B3DE20C22E12E162BFB173C5EB2592) by Prof. Zachary Steinert-Threlkeld here in the Luskin School of Public Affairs.

## Using the Reddit API
The `praw` library provides easy access to Reddit. You can enter your credentials here, or just follow along for the time being.

In [None]:
import praw

# enter your own client_id and client_secret
client_id = 'YOUR_ID_HERE'
client_secret = 'YOUR_SECRET_HERE'
# this identifies you, but can be any string
user_agent = 'scraper by u/adammb_ucla'

reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent=user_agent,
)

Now we have our `reddit` object that has several methods.

For example, we can get new posts, and loop over them. Here's we'll get the latest 10 from the Urban Planning subreddit.

In [None]:
for submission in reddit.subreddit('urbanplanning').new(limit=10):
    print(submission.title)

The `submission` object provides access to the more detailed post information.

In [None]:
submission?

Let's look at the comments within the last submission.

In [None]:
submission.num_comments

In [None]:
for c in submission.comments:
    print(c.body)

Each comment `c` also has further attributes. We used `body` to get the text of the comment, but there are also timestamps, author details, and so on.

In [None]:
c?

Let's get the submission and all the comments from the latest 100 posts from three transit subreddits: LA Metro, BART, and NYC rail.

We'll define a function that takes the subreddit name, and returns all of these comments in a single list.

In [None]:
def get_reddit(subreddit_name):
    r_list = []
    for submission in reddit.subreddit(subreddit_name).new(limit=50):
        r_list.append(submission.title)
        r_list += [c.body for c in submission.comments] 

    print('Retrieved {} comments for {}'.format(len(r_list), subreddit_name))
    return r_list

la_metro = get_reddit('LAMetro')

Let's do the same for the other two agencies.

In [None]:
bart = get_reddit('Bart')
nyc_rail = get_reddit('Nycrail')

Now let's save these comments to a file. We'll use a pickle, which as we've seen in earlier modules, can save most Python objects in their original format. (We could also have looped over the list of comments and saved them as text.)

In [None]:
import pickle
with open('../data/reddit/la_metro.pickle', 'wb') as f:
    pickle.dump(la_metro, f)
with open('../data/reddit/bart.pickle', 'wb') as f:
    pickle.dump(bart, f)
with open('../data/reddit/nyc_rail.pickle', 'wb') as f:
    pickle.dump(nyc_rail, f)

We'll pick up these data in the next lecture and see how to analyze the sentiment of the Reddit posts.

But we have only scratched the surface of the PRAW library. Explore the documentation for examples of how to filter your results (e.g. you could search for all posts within a subreddit that mention "rent" or "eviction"), access the number of upvotes, and more.

<div class="alert alert-block alert-info">
<h3>Key Takeaways</h3>
<ul>
  <li>Reddit has a powerful API that is relatively easy to use.</li>
  <li>Reddit is not representative. Whether that matters depends on your particular project and use case.</li>
</ul>
</div>