Reddit Data Collection:
===
This part has been divided to answer question like:
* How to collect data from reddit
* Which features to extract from subreddit post (which will shape our future decision of building a Reddit Flair Detector and deploying it as a web service)
* What we should consider while choosing a particular API to scrape reddit



##1. PRAW API:-
PRAW, an acronym for "Python Reddit API Wrapper", is a python package that allows for simple access to reddit's API. An API is a gateway through which user can access and modify data of any website.
PRAW has some great features which makes it easy-to-use for beginners like:
  * simpler interface
  * little to no skill required to use

However, it has some limitations like:-
  * you can fetch only upto 1000 submission or posts from reddit
  * PRAW has a [lazy object](https://praw.readthedocs.org/en/stable/pages/lazy-loading.html) model, so it won't make any more requests than it needs to which means that in fetching comments, it makes a new request for each username that going to take hours to collect.

Other than that, it requires user to create an account in reddit, sign up as developer, create an application and then provide credentials for obtaining data.

Enough taking let's get starting on how to use it

### How to use:
The very first thing you'll need to do is "Create an App" within Reddit to get the OAuth2 keys to access the API. 

Go to [this page](https://www.reddit.com/prefs/apps) and click create app or create another app button at the bottom left

<img src="https://miro.medium.com/max/1400/1*GQ8IREDENnkCRQT3VS55mQ.png">

This will open a form where you need to fill in a name, description and redirect uri. For the redirect uri you should choose https://localhost:8080 as described in the excellent [PRAW documentation](https://praw.readthedocs.io/en/latest/getting_started/authentication.html#script-application).

<img src="https://miro.medium.com/max/1400/1*ssLYczSLGzfm6SPM7mWzBg.png">

After pressing create app a new application will appear. Here you can find the authentication informtion needed to create the prew.Reddit instance.

<img src="https://miro.medium.com/max/1400/1*khszOCCaCtqZ6jM19uhpiQ.png">

Now, you can use this information to authenticate yourself.

In [0]:
# PRAW can be installed using python package installer
!pip install praw

Collecting praw
[?25l  Downloading https://files.pythonhosted.org/packages/5c/39/17251486951815d4514e4a3f179d4f3e7af5f7b1ce8eaba5a3ea61bc91f2/praw-7.0.0-py3-none-any.whl (143kB)
[K     |██▎                             | 10kB 22.6MB/s eta 0:00:01[K     |████▋                           | 20kB 1.8MB/s eta 0:00:01[K     |██████▉                         | 30kB 2.6MB/s eta 0:00:01[K     |█████████▏                      | 40kB 1.7MB/s eta 0:00:01[K     |███████████▍                    | 51kB 2.1MB/s eta 0:00:01[K     |█████████████▊                  | 61kB 2.5MB/s eta 0:00:01[K     |████████████████                | 71kB 2.9MB/s eta 0:00:01[K     |██████████████████▎             | 81kB 2.3MB/s eta 0:00:01[K     |████████████████████▋           | 92kB 2.5MB/s eta 0:00:01[K     |██████████████████████▉         | 102kB 2.8MB/s eta 0:00:01[K     |█████████████████████████▏      | 112kB 2.8MB/s eta 0:00:01[K     |███████████████████████████▍    | 122kB 2.8MB/s eta 0:00:01

In [0]:
# import library
import praw

# Before it can be used to scrape data we need to authenticate ourselves.
# Use the information from the above section to know how to do it.
# For this we need to create a Reddit instance and provide it with client_id,
# client_secret, user_agent, username and password
reddit = praw.Reddit(client_id='PERSONAL_USE_SCRIPT_14_CHARS', \
                     client_secret='SECRET_KEY_27_CHARS ', \
                     user_agent='YOUR_APP_NAME', \
                     username='YOUR_REDDIT_USER_NAME', \
                     password='YOUR_REDDIT_LOGIN_PASSWORD')

# get post from india subreddit
india_subreddit = reddit.subreddit('india')

If you visit any post in [r/india](https://www.reddit.com/r/india/) subreddit, the topics in r/india subreddit are: AskIndia, Non-Political, Scheduled, Photography, Science/Technology, Politics, Business/Finance, Policy/Economy,Sports, Food, Coronavirus. 

However, to generalize machine learning algorithms or neural network millions of observation are required; that's why I have used flairs:
**AskIndia, Non-Political, Politics, Policy/Economy, Sports, Food, Science/Technology, Business/Finance, Photography**

*These are the most common flairs that exist an year ago and also exist now in [r/india](https://www.reddit.com/r/india/) subreddit*

You can observe that there are 24 attributes of a subreddit post as discussed in [PRAW documentation](https://praw.readthedocs.io/en/latest/code_overview/models/submission.html).For Reddit flair detection task, I have only considered ones that are related to flair of the post.

**Note: This is the hypothesis of data collection that we will try to verify in EDA section**

These are given in following table: 

| Attribute | Description |
| --- | --- |
| author | Author of that post (Some people only post political content, some post Science/Technology content). |
| comments | Comments on that particular post relates to Language Modelling.  |
| created_utc | Time the submission was created, represented in Unix Time. (This might relate to people post religious content in morning and political in afternoon) |
| link_flair_text | The flair of that particular postThe link flair's text content, or None if not flaired. |
| num_comments | The number of comments on that post. |
| over_18 | Whether or not the submission has been marked as NSFW. |
| score | The number of upvotes for the submission. |
| selftext | The submission selftext - an empty string if a link post. |
| title | The title of the post. |
| url | The URL the submission links to, or the permalink if a selfpost. |
| comments_authors | author of comments appended together|

In [0]:
%%time
# to calculate time taken to scrape reddit

# create an empty python list to append posts from subreddit
posts = []

# a python list of all the flairs to collect post from corresponding tags
# These will be the labels in classification.
flairs = ['AskIndia','Non-Political','Politics','Policy/Economy','Sports','Food','Science/Technology','Business/Finance','Photography']

# iterate through each flair
for flair in flairs:
  
  # collect relevant posts by searching in subreddit (less than 100)
  relevant_subreddits = subreddit.search(f"flair_name:{flair}", limit=100)
  
  # iterate through each post
  for submission in relevant_subreddits:
    submission.comments.replace_more(limit=None)
    comment=''
    authors=''
    count=0
    for top_level_comment in submission.comments:
      # join all comments on post
      comment = comment + ' ' + top_level_comment.body
      # join all authors of comments on post
      authors = authors + ' ' + str(top_level_comment.author)

      count+= 1
      
      if (count>10):
        break

    posts.append([submission.author, submission.created_utc, submission.link_flair_text, submission.num_comments, submission.score, submission.over_18, submission.selftext, submission.title, submission.url, comment, authors])

In [0]:
import pandas as pd # for data preprocessing and manipulation

# transform list of dictionary to pandas dataframe for easier preprocessing
data = pd.DataFrame(posts, columns = ['author', 'created_utc', 'link_flair_text', 'num_comments', 'score', 'over_18', 'selftext', 'title', 'url', 'comment', 'authors'])

In [0]:
# a look at data
data.head()

Unnamed: 0,author,created_utc,link_flair_text,num_comments,score,over_18,selftext,title,url,comment,authors
0,sanand_satwik,1586713000.0,AskIndia,134,1047,False,Hi....It's really tough time for everyone. I r...,"Lost my Job, Sick Mother and Paralysed Dad, In...",https://www.reddit.com/r/india/comments/g014wc...,I'm a freelancer. Don't listen to the idiots ...,hashedram diabapp xataari Aashayrao sarcrasti...
1,TWO-WHEELER-MAFIA,1586419000.0,AskIndia,204,648,False,"We have floods, terrorist attacks, famines due...",Why does the government come with a begging bo...,https://www.reddit.com/r/india/comments/fxofyu...,I don't understand why they don't use money f...,Kinky-Monk ak32009 fools_eye None DwncstSheep...
2,GauGau24,1587355000.0,AskIndia,115,158,False,I don't think we've spend so much time with fa...,People stuck with their family during the lock...,https://www.reddit.com/r/india/comments/g4lrhm...,yesterday we had a major fight. (me and my wi...,Best-Economist Srthak_ ppccbba tb33296 damnji...
3,Oomada9,1587672000.0,AskIndia,110,103,False,Does caste still exist in India? Do people sti...,How prominent is the caste system in India now...,https://www.reddit.com/r/india/comments/g6tldd...,Very. \n\n\nHad a very good friend who was i...,merlin318 Vpee26 ppccbba Cierno Buns4Funz Sel...
4,indianoogler,1586178000.0,AskIndia,206,266,False,The corona virus has given me some time to thi...,Men who are 30+ and have decided not to get ma...,https://www.reddit.com/r/india/comments/fvy95j...,Plan your finances. Work and enjoy in your ow...,RedDevil-84 khushraho kingof-potatos congrats...


In [0]:
# for further preprocessing, transform it into csv file as it saves time 
# and we don't have to go through same process again.
data.to_csv('data.csv', index=None)

In [0]:
# only works on Google Colab
from google.colab import files
# download data on personal computer
files.download('data.csv')

### Might want to consider:
If you are planning on using PRAW API for Reddit Data Collection, it has a limitation of sending new request for fetching comments. In the above block, It took **7 hours to collect 1000 post from 10 flairs (with 10 comments from each post)**. One way to overcome this problem is to use multiprocessing or Pushshift's API

##2. Pushshift's API :- 
PRAW is the main Reddit API used for extracting data from the site using Python. Although there are a few limitations including extracting submissions between specific dates and you can only extract 1000 submission from a subreddit. This inconvenience was overcome by Pushshift's API for accessing Reddit's data.


**The Pushshift API**

Pushshift is a big-data storage and analytics project started and maintained by Jason Baumgartner. The Pushshift API serves a copy of reddit objects. Currently, data is copied into Pushshift at the time it is posted to reddit. Therefore, scores and other meta such as edits to a submission's selftext or a comment's body field may not reflect what is displayed by reddit. A future version of the API will update data at timed intervals.

But it has some great features like:
  * access the subreddit Data without even needing Reddit credentials.
  * analyze large quantities of reddit data
  * grab data for a specific data range in the past
  * search for comments
  * aggregate data

### How to use
We can access the Pushshift API through building an URL with the relevant parameters without even needing Reddit credentials.

Without parameters, this is the foundation of the URL you'll use to access Redit: https://api.pushshift.io/reddit/search/

Now with parameters, This is the url which will access india subreddit between 2 dates written in unix timestamps and search for all submission that contain the keyword - **coronavirus**: https://api.pushshift.io/reddit/search/submission/?q=coronavirus&after=1514764800&before=1517443200&subreddit=india

So, this will be the template in which we can insert dates and keyword to search for specific posts: https://api.pushshift.io/reddit/search/submission/?title='+str(query)+'&size=1000&after='+str(after)+'&before='+str(before)+'&subreddit='+str(sub)

**To make sure we are getting everything for the specific time period**

we can crate a method for building time period search intervals. Add in logic to request more posts. We will pull the last created on timestamp prior to the next request.

**NOTE: This method has a downside that it can fetch duplicates**

In [0]:
# Import the relevant modules
import pandas as pd # for reading csv file and data manipulation
import requests # to access the Pushshift API through building an URL
import json # to transform web page data into json format
import csv # to upload into a CSV for further analysis
import time # for transformation of timestamps
import datetime # for transformation of timestamps

In [0]:
# Subreddit to query
sub = 'india'
# before and after dates
before = "1577836800" #01/01/2020 @ 12:00am (UTC)
after = "1546300800"  #01/01/2019 @ 12:00am (UTC)
query = "state" # to search in subreddit
subCount = 0
subStats = {}

In [0]:
# We can access the Pushshift API through building an URL with the relevant
# parameters without even needing Reddit credentials
url = 'https://api.pushshift.io/reddit/search/submission/?title='+str(query)+'&size=1000&after='+str(after)+'&before='+str(before)+'&subreddit='+str(sub)
# to know how does the data look like click below pushshift URL with
# the parameters to see for yourself
url

'https://api.pushshift.io/reddit/search/submission/?title=state&size=1000&after=1546300800&before=1577836800&subreddit=india'

In [0]:
r = requests.get(url) # Requests module is used to access the URL  
data = json.loads(r.text) # with the JSON module collecting the text version of the page
dat = data['data']
dat[0] # a look at first post

{'author': 'TheEternalGentleman',
 'author_flair_css_class': None,
 'author_flair_richtext': [],
 'author_flair_text': None,
 'author_flair_type': 'text',
 'author_fullname': 't2_13flz4',
 'author_patreon_flair': False,
 'can_mod_post': False,
 'contest_mode': False,
 'created_utc': 1546328809,
 'domain': 'newindianexpress.com',
 'full_link': 'https://www.reddit.com/r/india/comments/abgc0m/in_about_2_hours_3_million_women_take_to_the/',
 'gildings': {'gid_1': 0, 'gid_2': 0, 'gid_3': 0},
 'id': 'abgc0m',
 'is_crosspostable': True,
 'is_meta': False,
 'is_original_content': False,
 'is_reddit_media_domain': False,
 'is_robot_indexable': True,
 'is_self': False,
 'is_video': False,
 'link_flair_background_color': '#ddbd37',
 'link_flair_css_class': 'Politics',
 'link_flair_richtext': [{'e': 'text', 't': 'Politics'}],
 'link_flair_template_id': '77f04f12-7ea0-11e3-ac66-22000a0b8292',
 'link_flair_text': 'Politics',
 'link_flair_text_color': 'dark',
 'link_flair_type': 'richtext',
 'locked'

In [0]:
def getPushshiftData(query, after, before, sub):
    '''
    Function to transform web page data into json
    '''
    url = 'https://api.pushshift.io/reddit/search/submission/?title='+str(query)+'&size=1000&after='+str(after)+'&before='+str(before)+'&subreddit='+str(sub)
    print(url)
    r = requests.get(url) # Requests module is used to access the URL
    data = json.loads(r.text) # with the JSON module collecting the text version of the page
    return data['data']

In [0]:
# similarly we can build another function to extract key data points:
def collectSubData(subm):
    subData = list() #list to store data points
    title = subm['title'] # title of the subreddit post
    url = subm['url'] # url associated with it
    try:
        flair = subm['link_flair_text'] # flair if given which it belongs to
    except KeyError:
        flair = "NaN"  # else none
    author = subm['author'] # author of post
    sub_id = subm['id'] # unique identifier of post
    score = subm['score'] # number of upvotes
    created = datetime.datetime.fromtimestamp(subm['created_utc']) #1520561700.0
    numComms = subm['num_comments'] # number of comments
    permalink = subm['permalink'] # permalink of post
    
    subData.append((sub_id,title,url,author,score,created,numComms,permalink,flair)) # adding it a single python list
    subStats[sub_id] = subData

In [0]:
# now we can run code and loop until all submission are collected from 
# a subreddit
data = getPushshiftData(query, after, before, sub)
# Will run until all posts have been gathered 
# from the 'after' date up until before date
while len(data) > 0:
    for submission in data:
        collectSubData(submission)
        subCount+=1
    # Calls getPushshiftData() with the created date of the last submission
    print(len(data))
    print(str(datetime.datetime.fromtimestamp(data[-1]['created_utc'])))
    after = data[-1]['created_utc']
    data = getPushshiftData(query, after, before, sub)
    
print(len(data))

https://api.pushshift.io/reddit/search/submission/?title=state&size=1000&after=1546300800&before=1577836800&subreddit=india
1000
2019-09-06 14:25:28
https://api.pushshift.io/reddit/search/submission/?title=state&size=1000&after=1567779928&before=1577836800&subreddit=india
506
2019-12-31 12:57:44
https://api.pushshift.io/reddit/search/submission/?title=state&size=1000&after=1577797064&before=1577836800&subreddit=india
0


In [0]:
# sanity check to make sure we have our data for futher analysis
print(str(len(subStats)) + " submissions have added to list")
print("1st entry is:")
print(list(subStats.values())[0][0][1] + " created: " + str(list(subStats.values())[0][0][5]))
print("Last entry is:")
print(list(subStats.values())[-1][0][1] + " created: " + str(list(subStats.values())[-1][0][5]))

1506 submissions have added to list
1st entry is:
In about 2 hours, 3 million women take to the streets of Kerala, to form a human wall around 620 km long from tip to toe of the state. created: 2019-01-01 07:46:49
Last entry is:
Pakistan uses terrorism as tool of state policy: Army Chief Naravane created: 2019-12-31 12:57:44


In [0]:
# upload to csv file
def updateSubs_file():
    '''
    function to store data into a csv file
    '''
    upload_count = 0
    # directory of google colaboratory
    location = "\\Reddit Data\\"
    print("input filename of submission file, please add .csv")
    filename = input()
    file = location + filename
    with open(file, 'w', newline='', encoding='utf-8') as file: 
        a = csv.writer(file, delimiter=',')
        headers = ["Post ID","Title","Url","Author","Score","Publish Date","Total No. of Comments","Permalink","Flair"]
        a.writerow(headers)
        for sub in subStats:
            a.writerow(subStats[sub][0])
            upload_count+=1
            
        print(str(upload_count) + " submissions have been uploaded") # print updates while writing into csv file
updateSubs_file()

input filename of submission file, please add .csv
submission.csv
1506 submissions have been uploaded


Approach:
---
In this notebook, I have explored different ways to scrape Reddit with their own limitations. The aim of this Part is explore different libraries to collect data, recognizing their limitations and choosing one suitable for our task i.e., Reddit Flair Detection.

As, our task requires us to building a Reddit Flair Detection classifier and deploying it to heroku. Since, we are going to test different machine learning models and deep learning algorithms it will be beneficial if we have millions of observations on dataset. Therefore, I have collected **two dataset** one from Reddit API and another from Pushshift's API
This dataset is from **1st January 2019 to 1st January 2020** containing title and body of reddit posts

In [0]:
import pandas as pd
import requests
import json
import csv
import time
import datetime

def getPushshiftData(after, before):
    sub = "india"
    url = 'https://api.pushshift.io/reddit/search/submission/?size=1000&after='+str(after)+'&before='+str(before)+'&subreddit='+str(sub)
    print(url)
    r = requests.get(url)
    data = json.loads(r.text)
    return data['data']

def collectSubData(subm):
    subData = list() #list to store data points
    sub_id = subm['id']
    title = subm['title']  
    
    try:
        body = subm['selftext']
    except KeyError:
        body = "" 
    
    try:
        flair = subm['link_flair_text']
        subData.append((title,body,flair))
        subStats[sub_id] = subData
    except KeyError:
        flair = "NaN" 

        
#before and after dates
before = "1577836800" #01/01/2020 @ 12:00am (UTC)
after = "1546300800"  #01/01/2019 @ 12:00am (UTC)
subCount = 0
subStats = {}

In [0]:
data = getPushshiftData(after, before)

# Will run until all posts have been gathered 
# from the 'after' date up until before date
while len(data) > 0:
    for submission in data:
        collectSubData(submission)
        subCount+=1
    # Calls getPushshiftData() with the created date of the last submission
    print(len(data))
    print(str(datetime.datetime.fromtimestamp(data[-1]['created_utc'])))
    after = data[-1]['created_utc']
    data = getPushshiftData(after, before)
    
print(len(data))

subStats = list(subStats.values())

https://api.pushshift.io/reddit/search/submission/?size=1000&after=1546300800&before=1577836800&subreddit=india
1000
2019-01-03 02:43:12
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1546483392&before=1577836800&subreddit=india
1000
2019-01-04 17:25:08
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1546622708&before=1577836800&subreddit=india
1000
2019-01-07 05:46:27
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1546839987&before=1577836800&subreddit=india
1000
2019-01-09 03:16:49
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1547003809&before=1577836800&subreddit=india
1000
2019-01-10 17:53:53
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1547142833&before=1577836800&subreddit=india
1000
2019-01-12 21:01:08
https://api.pushshift.io/reddit/search/submission/?size=1000&after=1547326868&before=1577836800&subreddit=india
1000
2019-01-15 06:07:07
https://api.pushshift.io/reddit/search/su

In [0]:
allowed_tags = ['AskIndia','Non-Political','Politics','Policy/Economy','Sports','Food','Science/Technology','Business/Finance','Photography']

extracted_dict = {
    "TITLE":[], 
    "BODY":[], 
    "FLAIR":[], 
}

for sub in subStats:
    if sub[0][2] not in allowed_tags:
        continue
    else:
      extracted_dict["FLAIR"].append(sub[0][2])
    
    extracted_dict["TITLE"].append(sub[0][0].replace(","," "))
    extracted_dict["BODY"].append(sub[0][1].replace(","," "))
    
    

pandas_data = pd.DataFrame(extracted_dict)
#print(pandas_data)

pandas_data.to_csv('data_1.csv', index=False)

In [0]:
from google.colab import files
files.download('data_1.csv')

**As in this part, We have collected two dataset one containing random post from 9 flairs and other containing an year data for 9 flairs. In EDA, we will explore both of them to find machine learning patterns**