Interesting, as long as we're only *getting* information from Reddit, and not posting information to it, the Reddit API doesn't require authentication. So we don't have to find and store any keys or tokens. Let's go ahead and import PRAW, the Python Reddit API Wrapper.

In [1]:
import praw

While we don't have to authenticate, it is considered basic courtesy to at least identify yourself in some fashion when using Reddit's API. When we build our API object, we pass it a parameter called user_agent, which is simply a string identifying your bot (it will show up with this name in Reddit access logs). It's best practice to put your email address or your reddit username in the user_agent string so you can be contacted if necessary. 

We use a method called ``Reddit`` and store its output in a variable. That variable is our API access to Reddit. 

In [2]:
r = praw.Reddit(user_agent="IU Social Media Mining by vmalic@indiana.edu")

# Getting a User and Their Submissions

We get a reddit user by calling the method ``get_redditor`` on the API and supplying a user name. I'm going to use a username that you can find in the PRAW documentation. 

In [3]:
user = r.get_redditor("MattDamon_")

Each reddit user (or redditor) has a small amount of information associated with it. Reddit is more content-centered, in contrast to Facebook, which is more user-centered. Therefore, there actually isn't a lot of info directly associated with a given user. In fact, all you can really find out about a user is their name (which you already know, if you're using the function ``get_redditor``. 

In [4]:
print(user.name) # The user name

MattDamon_


Instead, if you're mining Reddit, you'll be more interested in the content associated with the user: the submissions they've made, the comments they've made, the items they've upvoted or downvoted, etc. 

Once we have a user, we can use a set a functions to get *that set of content associated with the user*. 

For example, ``get_upvoted`` gets all the submissions this user has upvoted. ``get_submitted`` gets all the submissions this user has made. However! Note that when you call these methods, you get a strange object called a ``generator``. You *can't* inspect a generator directly. The *only* way to access items inside them is with a ``for`` loop.

Generators may strike you as odd as first, but you'll run into generators frequently as they're often a more efficient way to code things. If a normal list has 1000 items in it, that's taking up space in memory. With a generator, you only have to summon one object at a time, which eases memory use. 

In most practical situations, you'll only need generators because for each item you encounter in the sequence, you can process it somehow and store only the information you need in another variable, like an empty list you've initialized.

One problem you may have with generators is that if you're iterating through a generator of objects you may not be sure what methods or attributes are available to you. There are two ways to fix this situation. First, go to the API documentation: it'll tell you what's available. Second, you can call a generator with a small amount of items and apply the ``list`` function to it to force-convert the generator to a list. 

Note that all the ``get_`` methods in PRAW take an argument called ``limit``, which indicates how many results you want. The default value is 25. 

In [5]:
submissions = user.get_submitted(limit=5)

print(type(submissions))

#Convert the generator into a list

submissions = list(submissions)
print(type(submissions))

<class 'generator'>
<class 'list'>


In [6]:
#Now we can inspect the items

submission0 = submissions[0]
print(dir(submission0))

['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__unicode__', '__weakref__', '_api_link', '_comment_sort', '_comments', '_comments_by_id', '_extract_more_comments', '_get_json_dict', '_has_fetched', '_info_url', '_insert_comment', '_methods', '_orphaned', '_params', '_populate', '_post_populate', '_replaced_more', '_underscore_names', '_uniq', '_update_comments', 'add_comment', 'approve', 'approved_by', 'archived', 'author', 'author_flair_css_class', 'author_flair_text', 'banned_by', 'clear_vote', 'clicked', 'comments', 'created', 'created_utc', 'delete', 'distinguish', 'distinguished', 'domain', 'downs', 'downvote', 'edit', 'edited', 'from_api_response', 'from_id', 'from_json', 'from_url', 'fullna

Once you're familiar with an object's methods and attributes, from that point forward you don't need to force the generator into a list. You can just use the generator with a for loop and process the objects inside the loop. For example, now I've learned all submissions have an attribute ``score`` that tells me what the score of the submission is (the number of upvotes minus downvotes). Now I'll iterate through this user's submissions using a generator and collect all the scores. 

In [7]:
scores = []

for submission in user.get_submitted(limit=50):
    scores.append(submission.score)

In [8]:
print(scores)

[5214, 3503]


Let's instead look at this user's comments and get the score for each comment.

In [9]:
user_comments_with_scores = []

for comment in user.get_comments(limit=50):
    user_comments_with_scores.append((comment.body, comment.score))
    

``user_comments_with_scores`` is a list of tuple. In each tuple, the first item is the text of the comment, the second item is the score. 

Here's the first tuple:

In [18]:
user_comments_with_scores[0]

("I think the answer to that is clear and I'll let you fill in the blanks.",
 1918)

We've got the text, and we know it's total score, so it's possible to get a bunch of comments and also a numeric value representing how well received thet comment was. 

# Pulling Content from Subreddits


What really makes Reddit a unique target for Social Media Mining is the concept of *subreddits*. A subreddit, as the name implies, is a section of Reddit that is dedicated to a certain topic or theme. Any user can make any subreddit about almost anything, so long as the topic does not violate Reddit's terms of service. 

For those of us using data mining algorithms, one of the major advantages of the idea of subreddits is that posts and comments on Reddit are *already* divided into useful categories. For example, there subreddit ``The_Donald`` is a subreddit for supporters of Trump, and the subreddit ``hillaryclinton``is for supporters of Clinton. If you're interested in how supporters of different parties use language differently, you could pull thousands of comments from these subreddits and have a set of data ready for machine learning, with some carrying the label ``trump_supporter`` and the other carrying the label ``clinton_supporter``. 

Furthermore, the data on Reddit becomes more interesting when examined in conjunction with Reddit's voting system. Any submission or comment can be given a positive vote (an upvote) or a negative vote (a downvote). Presumably, that which gets upvoted at ``The_Donald`` would be quite different from things that get upvoted at ``hillaryclinton``. 

Let's learn how to pull content from subreddits. At first, I was using the Trump and Clinton subreddits as examples, but that got ugly fast - there was a lot of offensive material. So instead, I'll just use the subreddits for each of the parties. 

We can pull a subreddit as an object into our Python environment using the method ``get_subreddit``. Remember, we saved our Reddit API access in the variable ``r``.

In [11]:
subreddit_republican = r.get_subreddit("Republican")
subreddit_democrat = r.get_subreddit("democrats")

Let's get the most recent comments from each of these subreddits, using the function ``get_comments``. Remember, these return generators so we have to iterate through them with a for loop.

In [12]:
comments_republican = []
comments_democrat = []

for c in subreddit_republican.get_comments(limit=10):
    comments_republican.append((c.body, c.score))
    
for c in subreddit_democrat.get_comments(limit=10):
    comments_democrat.append((c.body, c.score))

Here are the first comments in the Trump and Clinton subreddits:

In [13]:
print(comments_republican[0])
print("*"*50)
print(comments_democrat[0])

("I don't follow that logic flow.\n\nWat?", 1)
**************************************************
("No. She has pneumonia. It's not a death sentence. She will recover and be fine. I just wish her campaign didn't hide stuff... AGAIN. I really don't know what this campaign is thinking. It's getting scary.", 1)


What if we want to see what comments are actually popular in each forum? Note that the ``get_comments`` function retrieves the newest comments, so most of the scores for these comments are 1, the default value for a newly created comment. 

We'll have to think differently about how to get comments that have been around long enough to get a big score (either positive or negative). 

Reddits are composed of **submissions**, which in turn can have **comments** - people discussion the submission. Both submissions and comments can be voted on. 

Why don't we get the highest-rated submissions from the last week from each subreddit, and *then* get comments from those submissions? If something is highly rated, it's visible, and a more visible submissions is more likely to generate a more lively discussion. 

A subreddit object has a method called ``get_top_from_week`` that will return the highest ratest submissions from the last week. As you can imagine, there are also methods like ``get_top_from_hour`` and ``get_top_from_month``.

Let's get the highest rated submissions from each subreddit for the last week. 

In [14]:
r_top_submissions = []
d_top_submissions = []

for s in subreddit_republican.get_top_from_week(limit=10):
    r_top_submissions.append((s.title, s.score)) # Getting submission title and score
    
for s in subreddit_democrat.get_top_from_week(limit=10):
    d_top_submissions.append((s.title, s.score))

In [15]:
print(r_top_submissions[0])
print(d_top_submissions[0])

('Video: Hillary Clinton having Medical Episode in New York Today', 70)
('Obama Just Blasted NBC\'s Matt Lauer For Letting Trump Lie Through His Teeth | "I think the most important thing for the public and the press is to just listen what he says and follow up, and ask questions about what appear to be either contradictory or uninformed or outright bad ideas."', 158)


Now, instead of getting the submissions, let's get their comments. 

In [16]:
r_comments = []
d_comments = []

for s in subreddit_republican.get_top_from_week(limit=10):
    for c in s.comments:
        r_comments.append((c.body, c.score))
    
for s in subreddit_democrat.get_top_from_week(limit=10):
    for c in s.comments:
        d_comments.append((c.body, c.score))

Now, we have comments from the top submissions in each of the subreddits. Now these comments have scores other than 1.

In [17]:
print(r_comments[0])
print("*"*50)
print(d_comments[0])

('Holy fuck! Look at her leaning against that post, then severely stagger a few steps towards the van being held up by her medical handler, as the Secret Service try to block the view.\n\n\nEDIT #1: The MSM is saying that she "overheated" early this morning at a 9/11 function in NYC. FYI, the weather record states it was around 75°F. If she can\'t handle 75°F, she better stay out of Phoenix, it\'s 93°F right now at 10:00AM.', 21)
**************************************************
("We're going to do something and it's going to be great is NOT an answer to ANY question.", 3)
