## Table of contents
* [Authentication](# authentication)
* [Extraction](#extraction)

### 1.Authentication  

PRAW is shortform for **P**ython **R**eddit **A**PI **W**rapper  

**Prerequisites**   
*Python Knowledge*:  
 	You need to know at least a little Python to use PRAW; it’s a Python wrapper after all. PRAW supports Python 2.7, and Python 3.3 to 3.6.
    
*Reddit Knowledge: *   
 	A basic understanding of how reddit.com works is a must. In the event you are not already familiar with Reddit start with their FAQ.  
    
*Reddit Account:*  
A Reddit account is required to access Reddit’s API. Create one at reddit.com. 

*Client ID & Client Secret:  *  
 	These two values are needed to access Reddit’s API as a script application (see Authenticating via OAuth for other application types). If you don’t already have a client ID and client secret, follow Reddit’s First Steps Guide to create them.  
    
*User Agent:*  
A user agent is a unique identifier that helps Reddit determine the source of network requests. To use Reddit’s API, you need a unique and descriptive user agent.
    
[Source](https://praw.readthedocs.io/en/latest/getting_started/quick_start.html)


After creating the credentials (client_id, client_secret, username, password), the next step is to login into reddit API using praw package

In [1]:
import praw
reddit = praw.Reddit(user_agent='Comment Extraction (by /u/USERNAME)',
                     client_id='********', client_secret="***********",
                     username='********', password='**********')

Here, we are creating an authorized reddit instance and not only the read-only instance as authorized instance has less restrictions in terms of data retrieval. The credentials are masked due to privacy and security reasons but you can find the guide to create them in a short time [here]().

### Data extraction
Here in this section, we send the requests to the reddit api and get the headers and body of the comments in 2 separate lists

But we need all the comments including the nested ones. Below code contains another while loop that goes into the comments section and gets the replies as well for those comments.  
All in all, it first goes into the subreddit and then into the comment and gets all the replies and iterates the same process, till it reaches the end of the line in the sub reddit

In [8]:
comm_list = []
header_list = []
i = 0
for submission in reddit.subreddit('cordcutters').hot(limit=2):
    submission.comments.replace_more(limit=None)
    comment_queue = submission.comments[:]  # Seed with top-level
    while comment_queue:
        header_list.append(submission.title)
        comment = comment_queue.pop(0)
        comm_list.append(comment.body)
        t = []
        t.extend(comment.replies)
        while t:
            header_list.append(submission.title)
            reply = t.pop(0)
            comm_list.append(reply.body)

In [2]:
comm_list = []
header_list = []
i = 0
for submission in reddit.subreddit('cordcutters').hot(limit=2):
    submission.comments.replace_more(limit=None)
    comment_queue = submission.comments[:]  # Seed with top-level

In [3]:
comment_queue

[Comment(id='ed5ssfg'),
 Comment(id='ed64a72'),
 Comment(id='edth3nc'),
 Comment(id='ed680cg'),
 Comment(id='ed699q2'),
 Comment(id='ed80ce8'),
 Comment(id='edau9st'),
 Comment(id='edcx477'),
 Comment(id='ee0fp3g'),
 Comment(id='ed5qrvh')]

In [4]:
comm_list = []
header_list = []
i = 0
for submission in reddit.subreddit('cordcutters').hot(limit=2):
    submission.comments.replace_more(limit=None)
    comment_queue = submission.comments[:]  # Seed with top-level
    while comment_queue:
        header_list.append(submission.title)
        comment = comment_queue.pop(0)
        comm_list.append(comment.body)
        t = []
        t.extend(comment.replies)

In [5]:
while comment_queue:
    header_list.append(submission.title)
    comment = comment_queue.pop(0)
    comm_list.append(comment.body)
    t = []
    t.extend(comment.replies)
    while t:
        header_list.append(submission.title)
        reply = t.pop(0)
        comm_list.append(reply.body)

In [7]:
comm_list

['Any plans to expand beyond your Chicago headquarters?',
 'Everyone, thank you very much for your time and interest - and thanks, sincerely, for watching.  Any ideas / questions / criticisms, please email us at [info@watchstadium.com](mailto:info@watchstadium.com)  \\- have a great day, everyone! \n\nAll the best,\n\nJason',
 'As a college football fan, I really enjoy your addition of Brett McMurphy to Stadium.  Is there anything in the works for an insider type show, Podcast, or other with Brett?',
 'I tried to bring your streaming channel into my hotel groups but was told there would be a large carrier fee to do so. Why is that when your content is free to access? ',
 "Previously received question from u/ASM360\n\nGood afternoon, I was wondering why Stadium is delivered in HD for Pluto Tv, via the Stadium app, Xumo, and even the Roku Channel app; however via OTA it's 480 and un-formatted (so small that it does not even fit the screen)?",
 'Any plans to join up with YouTubeTV? I real

Here, a dataframe has been created by concatenating the headers and comments from the sub reddit

In [9]:
import pandas as pd
df = pd.DataFrame(header_list)
df['comm_list'] = comm_list
df.head()

Unnamed: 0,0,comm_list
0,"I'm Jason Coyle, CEO of Stadium, the first 24/...",Any plans to expand beyond your Chicago headqu...
1,"I'm Jason Coyle, CEO of Stadium, the first 24/...","We have a sales office in NYC and Det., and we..."
2,"I'm Jason Coyle, CEO of Stadium, the first 24/...","Everyone, thank you very much for your time an..."
3,"I'm Jason Coyle, CEO of Stadium, the first 24/...",Thank you for your time!
4,"I'm Jason Coyle, CEO of Stadium, the first 24/...","As a college football fan, I really enjoy your..."


From the raw data, we can see that there are \n tags at the end of every sentence. After some cleaning, we finally export the data to a csv file

In [10]:
df.columns = ['header','comments']
df['comments'] = df['comments'].apply(lambda x : x.replace('\n',''))
# df.to_csv('cordcutter_comments1.csv',index = False)

In [None]:
df = pd.DataFrame(header_list)
df['comm_list'] = comm_list
df.columns = ['header','comments']
df['comments'] = df['comments'].apply(lambda x : x.replace('\n',''))
df.to_csv('cordcutter_comments.csv',index = False)

In [11]:
df.head()

Unnamed: 0,header,comments
0,"I'm Jason Coyle, CEO of Stadium, the first 24/...",Any plans to expand beyond your Chicago headqu...
1,"I'm Jason Coyle, CEO of Stadium, the first 24/...","We have a sales office in NYC and Det., and we..."
2,"I'm Jason Coyle, CEO of Stadium, the first 24/...","Everyone, thank you very much for your time an..."
3,"I'm Jason Coyle, CEO of Stadium, the first 24/...",Thank you for your time!
4,"I'm Jason Coyle, CEO of Stadium, the first 24/...","As a college football fan, I really enjoy your..."


In [33]:
df.loc[673,'comments']

"If you get downto -96 power you'll still get the same sensitivity.  So down to KXLY (probably).  You'll want to raise the antenna as high as possible, so it can see over those edges."