<h1>Getting headlines with Reddit</h1>
Reddit is a great source of textual information and this notebook will show you how to use this important source of data.

Simply put, Reddit is a message board wherein users submit links. What differentiates it from a real-time information network like Twitter is that the stream of content is curated by the community. Items of value are “upvoted,” and those deemed unworthy are "downvoted." This determines a post's position on the site, and items that hit the front page are seen by hundreds of thousands of people (consequently, sending boatloads of traffic to the linked website).

<h3>Installing the PRAW library</h3>
This is a library that wraps the Reddit API nicely and makes the extraction of information much easier.
The first step is to install the library: <b>you will need to run this cell only once!</b>

In [5]:
!pip install praw



In [6]:
import praw
import json

#Make sure you replace the needed information with the info from your account
reddit = praw.Reddit(client_id='<your_client_id>', client_secret='<your_client_secret>', user_agent='<your_user_name>')

<h3>Upvoting and Downvoting entries</h3>
When you're logged in to Reddit, you'll be able to upvote and downvote items to help determine their rank. You get one vote per item, but you can change it after it's logged.

The number appearing between the up and down arrows is the submission's score: the number of upvotes minus the number of downvotes. According to Reddit's FAQ, these numbers are "fuzzed" to prevent spam and abuse.

"On average, the difference in votes is accurate, but the fuzzing is — well, fuzzy," says Erik Martin, Reddit's general manager. "At any given moment, the difference may fluctuate very slightly, but over time the average difference is accurate."

You may also notice that posts with the highest score do not always rank at the top. This is due to Reddit's time decay algorithm. Posts on the front page are obviously more visible, and therefore have a higher chance of being upvoted. But the site wouldn't be valuable if the same content remained on the front page all day.

"The decay means that a 12-hour-old post must have 10 times as many points as a brand new post to appear at similar ranks," explains Martin. "This also means any given story has a roughly a 24 hour max lifespan on any user's front page." This allows newer content to surface at the top of the heap.

In [7]:
def toJson(sub):
    sid = sub.id
    title = sub.title
    created = sub.created_utc
    upvote = sub.upvote_ratio
    num_comments = sub.num_comments
    score = sub.score #Number of upvote-number of downvote
    
    result = json.loads(json.dumps({'sid': sid, 
                                    'title': title, 
                                    'created':created, 
                                    'upvote':upvote,
                                    'downvote': score- int(score*upvote),
                                    'num_comments':num_comments,
                                    'score':score
                                   }))
    return result

def getHeadlines(category, maxh):
    headlines = reddit.subreddit(category).new(limit=maxh)
    result = list()
    for headline in headlines:
        result.append(toJson(headline))
    return result

In [8]:
maxH = 50
headlines = getHeadlines('Sports+Art',100)

ResponseException: received 401 HTTP response

In [9]:
import pandas as pd
import datetime as dt

def get_date(created):
    return dt.datetime.fromtimestamp(created)

In [None]:
hdata = pd.DataFrame(headlines)

_timestamp = hdata["created"].apply(get_date)
hdata = hdata.assign(timestamp = _timestamp)

hdata

<h3>Headline Channels</h3>

Refer to the following for the list of categories:

<ul>
  <li>http://redditlist.com/</li>
  <li><a href="https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits" target=_blank>https://www.reddit.com/r/ListOfSubreddits/wiki/listofsubreddits</a></li>
</ul>