# Reddit Flair Classification
## Scraping from r/india
The reddit posts are scraped using the pushshift api and saved to a csv. Later, posts pertinent to the categories mentioned below are extraced by ramdomly sampling from the a pool of posts from the csv.

Flair categories considered(Earlier the Non-Political category was also considered, but due to its vague attribute dropped, as "Non-Political" can be anything which is not political.)
1. 'Business/Finance'
2. 'Policy/Economy'
3. 'Photography'
4. 'Politics'
5. 'Sports'
6. '[R]eddiquette'
7. 'Food'
8. 'Science/Technology'
9. 'AskIndia'
10. 'CAA-NRC'
11. 'Coronavirus'

To scrape the reddit posts, I've used the pushshift api, the PRAW api could also have been used but, PRAW doesn't allow to crawl more than 1000 posts, therefore, I resort to the pushshift api, though, with PRAW the scraping is  much simpler

pushshift api gives access to a json from which the requred fields can be extracted.



### Importing Libraries

In [1]:
import numpy as np
import praw
import pandas as pd
import requests
import json
import csv
import time
import datetime



### Functions
The functions below performelementary tasks.
getPushshiftData() generates the URL and accesses the JSON, and returns the dictionary for further extractions
collectSubData() extracts the information from JSON, like dictionaries by accessing elements with keywords.

In [None]:
def getPushshiftData(sub, after, before):
    """
    sub -> subreddit
    after -> unix timestamp 
    """
    url = 'https://api.pushshift.io/reddit/search/submission/?after='+str(after)+'&before='+str(before)+'&subreddit='+str(sub)
    print(url)
    r = requests.get(url)
    data = json.loads(r.text)
    return data['data']

def collectSubData(subm):
    subData = list() #list to store data points
    title = subm['title']
    url = subm['url']
    try:
        flair = subm['link_flair_text']
    except KeyError:
        flair = "NaN" 
    try:
        selftext = subm['selftext']
    except KeyError:
        selftext = ""
    author = subm['author']
    sub_id = subm['id']
    score = subm['score']
    created = datetime.datetime.fromtimestamp(subm['created_utc']) #1520561700.0
    numComms = subm['num_comments']
    permalink = subm['permalink']
    
    subData.append((sub_id,title,url,author,score,created,numComms,permalink,flair, selftext))
    subStats[sub_id] = subData

Variable initializations 

In [18]:
#Subreddit to query
sub='india'
#before and after dates

after = "1546878243"  #January 1st 2019
before = "" #
subCount = 0
subStats = {}

### Main scraping code

Each api call gives access to 25 reddit post starting from the the time provided to the 'after' argument as a unix
timestamp, therefore, we make the api call until the while condition is false i.e. all posts before the 'before' timestamp have been accessed

In [19]:
data = getPushshiftData(sub,after,before)# Will run until all posts have been gathered 
# from the 'after' date up until before date
while len(data) > 0:
    for submission in data:
        collectSubData(submission)
        subCount+=1
    # Calls getPushshiftData() with the created date of the last submission
    print(len(data))
    print(str(datetime.datetime.fromtimestamp(data[-1]['created_utc'])))
    after = data[-1]['created_utc']
    data = getPushshiftData(sub,after,before) #initializing data with the last timestamp of previous call
    
print(len(data))

https://api.pushshift.io/reddit/search/submission/?after=1546878243&subreddit=india
25
2019-01-07 22:38:18
https://api.pushshift.io/reddit/search/submission/?after=1546880898&subreddit=india
25
2019-01-07 23:58:39
https://api.pushshift.io/reddit/search/submission/?after=1546885719&subreddit=india
25
2019-01-08 02:04:42
https://api.pushshift.io/reddit/search/submission/?after=1546893282&subreddit=india
25
2019-01-08 08:27:06
https://api.pushshift.io/reddit/search/submission/?after=1546916226&subreddit=india
25
2019-01-08 09:33:26
https://api.pushshift.io/reddit/search/submission/?after=1546920206&subreddit=india
25
2019-01-08 10:06:22
https://api.pushshift.io/reddit/search/submission/?after=1546922182&subreddit=india
25
2019-01-08 10:54:43
https://api.pushshift.io/reddit/search/submission/?after=1546925083&subreddit=india
25
2019-01-08 11:21:48
https://api.pushshift.io/reddit/search/submission/?after=1546926708&subreddit=india
25
2019-01-08 12:00:14
https://api.pushshift.io/reddit/searc

25
2019-01-11 15:38:22
https://api.pushshift.io/reddit/search/submission/?after=1547201302&subreddit=india
25
2019-01-11 16:08:43
https://api.pushshift.io/reddit/search/submission/?after=1547203123&subreddit=india
25
2019-01-11 17:08:20
https://api.pushshift.io/reddit/search/submission/?after=1547206700&subreddit=india
25
2019-01-11 17:58:54
https://api.pushshift.io/reddit/search/submission/?after=1547209734&subreddit=india
25
2019-01-11 19:23:22
https://api.pushshift.io/reddit/search/submission/?after=1547214802&subreddit=india
25
2019-01-11 19:53:23
https://api.pushshift.io/reddit/search/submission/?after=1547216603&subreddit=india
25
2019-01-11 20:43:37
https://api.pushshift.io/reddit/search/submission/?after=1547219617&subreddit=india
25
2019-01-11 21:27:49
https://api.pushshift.io/reddit/search/submission/?after=1547222269&subreddit=india
25
2019-01-11 22:18:51
https://api.pushshift.io/reddit/search/submission/?after=1547225331&subreddit=india
25
2019-01-11 23:13:46
https://api.pu

25
2019-01-15 18:27:58
https://api.pushshift.io/reddit/search/submission/?after=1547557078&subreddit=india
25
2019-01-15 19:23:07
https://api.pushshift.io/reddit/search/submission/?after=1547560387&subreddit=india
25
2019-01-15 20:10:25
https://api.pushshift.io/reddit/search/submission/?after=1547563225&subreddit=india
25
2019-01-15 21:13:56
https://api.pushshift.io/reddit/search/submission/?after=1547567036&subreddit=india
25
2019-01-15 22:36:24
https://api.pushshift.io/reddit/search/submission/?after=1547571984&subreddit=india
25
2019-01-15 23:49:58
https://api.pushshift.io/reddit/search/submission/?after=1547576398&subreddit=india
25
2019-01-16 00:57:14
https://api.pushshift.io/reddit/search/submission/?after=1547580434&subreddit=india
25
2019-01-16 07:54:10
https://api.pushshift.io/reddit/search/submission/?after=1547605450&subreddit=india
25
2019-01-16 09:23:52
https://api.pushshift.io/reddit/search/submission/?after=1547610832&subreddit=india
25
2019-01-16 10:31:07
https://api.pu

25
2019-01-19 20:41:12
https://api.pushshift.io/reddit/search/submission/?after=1547910672&subreddit=india
25
2019-01-19 21:35:37
https://api.pushshift.io/reddit/search/submission/?after=1547913937&subreddit=india
25
2019-01-19 23:41:22
https://api.pushshift.io/reddit/search/submission/?after=1547921482&subreddit=india
25
2019-01-20 04:54:01
https://api.pushshift.io/reddit/search/submission/?after=1547940241&subreddit=india
25
2019-01-20 08:23:35
https://api.pushshift.io/reddit/search/submission/?after=1547952815&subreddit=india
25
2019-01-20 10:15:37
https://api.pushshift.io/reddit/search/submission/?after=1547959537&subreddit=india
25
2019-01-20 11:18:34
https://api.pushshift.io/reddit/search/submission/?after=1547963314&subreddit=india
25
2019-01-20 12:27:48
https://api.pushshift.io/reddit/search/submission/?after=1547967468&subreddit=india
25
2019-01-20 13:21:41
https://api.pushshift.io/reddit/search/submission/?after=1547970701&subreddit=india
25
2019-01-20 14:21:18
https://api.pu

25
2019-01-23 23:07:56
https://api.pushshift.io/reddit/search/submission/?after=1548265076&subreddit=india
25
2019-01-24 00:53:30
https://api.pushshift.io/reddit/search/submission/?after=1548271410&subreddit=india
25
2019-01-24 03:42:25
https://api.pushshift.io/reddit/search/submission/?after=1548281545&subreddit=india
25
2019-01-24 08:03:23
https://api.pushshift.io/reddit/search/submission/?after=1548297203&subreddit=india
25
2019-01-24 09:25:56
https://api.pushshift.io/reddit/search/submission/?after=1548302156&subreddit=india
25
2019-01-24 10:12:06
https://api.pushshift.io/reddit/search/submission/?after=1548304926&subreddit=india
25
2019-01-24 10:59:34
https://api.pushshift.io/reddit/search/submission/?after=1548307774&subreddit=india
25
2019-01-24 11:37:20
https://api.pushshift.io/reddit/search/submission/?after=1548310040&subreddit=india
25
2019-01-24 12:34:30
https://api.pushshift.io/reddit/search/submission/?after=1548313470&subreddit=india
25
2019-01-24 13:29:27
https://api.pu

25
2019-01-28 11:52:22
https://api.pushshift.io/reddit/search/submission/?after=1548656542&subreddit=india
25
2019-01-28 12:32:59
https://api.pushshift.io/reddit/search/submission/?after=1548658979&subreddit=india
25
2019-01-28 13:14:22
https://api.pushshift.io/reddit/search/submission/?after=1548661462&subreddit=india
25
2019-01-28 14:07:33
https://api.pushshift.io/reddit/search/submission/?after=1548664653&subreddit=india
25
2019-01-28 14:57:38
https://api.pushshift.io/reddit/search/submission/?after=1548667658&subreddit=india
25
2019-01-28 15:40:43
https://api.pushshift.io/reddit/search/submission/?after=1548670243&subreddit=india
25
2019-01-28 16:36:12
https://api.pushshift.io/reddit/search/submission/?after=1548673572&subreddit=india
25
2019-01-28 17:35:23
https://api.pushshift.io/reddit/search/submission/?after=1548677123&subreddit=india
25
2019-01-28 18:19:26
https://api.pushshift.io/reddit/search/submission/?after=1548679766&subreddit=india
25
2019-01-28 18:58:04
https://api.pu

25
2019-01-31 23:42:54
https://api.pushshift.io/reddit/search/submission/?after=1548958374&subreddit=india
25
2019-02-01 01:35:25
https://api.pushshift.io/reddit/search/submission/?after=1548965125&subreddit=india
25
2019-02-01 06:34:44
https://api.pushshift.io/reddit/search/submission/?after=1548983084&subreddit=india
25
2019-02-01 08:15:48
https://api.pushshift.io/reddit/search/submission/?after=1548989148&subreddit=india
25
2019-02-01 09:22:59
https://api.pushshift.io/reddit/search/submission/?after=1548993179&subreddit=india
25
2019-02-01 10:15:08
https://api.pushshift.io/reddit/search/submission/?after=1548996308&subreddit=india
25
2019-02-01 11:22:28
https://api.pushshift.io/reddit/search/submission/?after=1549000348&subreddit=india
25
2019-02-01 11:54:12
https://api.pushshift.io/reddit/search/submission/?after=1549002252&subreddit=india
25
2019-02-01 12:41:24
https://api.pushshift.io/reddit/search/submission/?after=1549005084&subreddit=india
25
2019-02-01 13:14:19
https://api.pu

25
2019-02-04 22:15:10
https://api.pushshift.io/reddit/search/submission/?after=1549298710&subreddit=india
25
2019-02-04 23:30:37
https://api.pushshift.io/reddit/search/submission/?after=1549303237&subreddit=india
25
2019-02-05 00:57:18
https://api.pushshift.io/reddit/search/submission/?after=1549308438&subreddit=india
25
2019-02-05 06:21:27
https://api.pushshift.io/reddit/search/submission/?after=1549327887&subreddit=india
25
2019-02-05 08:21:40
https://api.pushshift.io/reddit/search/submission/?after=1549335100&subreddit=india
25
2019-02-05 09:14:43
https://api.pushshift.io/reddit/search/submission/?after=1549338283&subreddit=india
25
2019-02-05 10:12:48
https://api.pushshift.io/reddit/search/submission/?after=1549341768&subreddit=india
25
2019-02-05 10:56:25
https://api.pushshift.io/reddit/search/submission/?after=1549344385&subreddit=india
25
2019-02-05 11:43:09
https://api.pushshift.io/reddit/search/submission/?after=1549347189&subreddit=india
25
2019-02-05 12:29:14
https://api.pu

25
2019-02-08 16:55:03
https://api.pushshift.io/reddit/search/submission/?after=1549625103&subreddit=india
25
2019-02-08 17:32:50
https://api.pushshift.io/reddit/search/submission/?after=1549627370&subreddit=india
25
2019-02-08 18:13:17
https://api.pushshift.io/reddit/search/submission/?after=1549629797&subreddit=india
25
2019-02-08 19:02:27
https://api.pushshift.io/reddit/search/submission/?after=1549632747&subreddit=india
25
2019-02-08 19:42:34
https://api.pushshift.io/reddit/search/submission/?after=1549635154&subreddit=india
25
2019-02-08 20:18:28
https://api.pushshift.io/reddit/search/submission/?after=1549637308&subreddit=india
25
2019-02-08 21:28:08
https://api.pushshift.io/reddit/search/submission/?after=1549641488&subreddit=india
25
2019-02-08 22:19:22
https://api.pushshift.io/reddit/search/submission/?after=1549644562&subreddit=india
25
2019-02-08 23:13:50
https://api.pushshift.io/reddit/search/submission/?after=1549647830&subreddit=india
25
2019-02-09 00:51:08
https://api.pu

25
2019-02-12 17:31:58
https://api.pushshift.io/reddit/search/submission/?after=1549972918&subreddit=india
25
2019-02-12 18:48:37
https://api.pushshift.io/reddit/search/submission/?after=1549977517&subreddit=india
25
2019-02-12 19:56:05
https://api.pushshift.io/reddit/search/submission/?after=1549981565&subreddit=india
25
2019-02-12 20:56:31
https://api.pushshift.io/reddit/search/submission/?after=1549985191&subreddit=india
25
2019-02-12 21:35:48
https://api.pushshift.io/reddit/search/submission/?after=1549987548&subreddit=india
25
2019-02-12 22:28:31
https://api.pushshift.io/reddit/search/submission/?after=1549990711&subreddit=india
25
2019-02-12 23:35:09
https://api.pushshift.io/reddit/search/submission/?after=1549994709&subreddit=india
25
2019-02-13 01:37:46
https://api.pushshift.io/reddit/search/submission/?after=1550002066&subreddit=india
25
2019-02-13 06:50:22
https://api.pushshift.io/reddit/search/submission/?after=1550020822&subreddit=india
25
2019-02-13 08:41:09
https://api.pu

25
2019-02-16 06:42:49
https://api.pushshift.io/reddit/search/submission/?after=1550279569&subreddit=india
25
2019-02-16 08:49:15
https://api.pushshift.io/reddit/search/submission/?after=1550287155&subreddit=india
25
2019-02-16 09:40:39
https://api.pushshift.io/reddit/search/submission/?after=1550290239&subreddit=india
25
2019-02-16 10:19:15
https://api.pushshift.io/reddit/search/submission/?after=1550292555&subreddit=india
25
2019-02-16 10:58:58
https://api.pushshift.io/reddit/search/submission/?after=1550294938&subreddit=india
25
2019-02-16 11:36:51
https://api.pushshift.io/reddit/search/submission/?after=1550297211&subreddit=india
25
2019-02-16 12:36:09
https://api.pushshift.io/reddit/search/submission/?after=1550300769&subreddit=india
25
2019-02-16 13:14:02
https://api.pushshift.io/reddit/search/submission/?after=1550303042&subreddit=india
25
2019-02-16 13:49:19
https://api.pushshift.io/reddit/search/submission/?after=1550305159&subreddit=india
25
2019-02-16 14:44:50
https://api.pu

25
2019-02-19 17:11:51
https://api.pushshift.io/reddit/search/submission/?after=1550576511&subreddit=india
25
2019-02-19 17:47:12
https://api.pushshift.io/reddit/search/submission/?after=1550578632&subreddit=india
25
2019-02-19 18:52:35
https://api.pushshift.io/reddit/search/submission/?after=1550582555&subreddit=india
25
2019-02-19 19:55:58
https://api.pushshift.io/reddit/search/submission/?after=1550586358&subreddit=india
25
2019-02-19 20:44:06
https://api.pushshift.io/reddit/search/submission/?after=1550589246&subreddit=india
25
2019-02-19 21:22:05
https://api.pushshift.io/reddit/search/submission/?after=1550591525&subreddit=india
25
2019-02-19 22:08:02
https://api.pushshift.io/reddit/search/submission/?after=1550594282&subreddit=india
25
2019-02-19 22:49:00
https://api.pushshift.io/reddit/search/submission/?after=1550596740&subreddit=india
25
2019-02-19 23:57:56
https://api.pushshift.io/reddit/search/submission/?after=1550600876&subreddit=india
25
2019-02-20 01:36:38
https://api.pu

25
2019-02-22 23:43:37
https://api.pushshift.io/reddit/search/submission/?after=1550859217&subreddit=india
25
2019-02-23 01:11:10
https://api.pushshift.io/reddit/search/submission/?after=1550864470&subreddit=india
25
2019-02-23 06:06:40
https://api.pushshift.io/reddit/search/submission/?after=1550882200&subreddit=india
25
2019-02-23 08:28:19
https://api.pushshift.io/reddit/search/submission/?after=1550890699&subreddit=india
25
2019-02-23 09:29:34
https://api.pushshift.io/reddit/search/submission/?after=1550894374&subreddit=india
25
2019-02-23 10:46:44
https://api.pushshift.io/reddit/search/submission/?after=1550899004&subreddit=india
25
2019-02-23 11:35:04
https://api.pushshift.io/reddit/search/submission/?after=1550901904&subreddit=india
25
2019-02-23 12:17:20
https://api.pushshift.io/reddit/search/submission/?after=1550904440&subreddit=india
25
2019-02-23 13:09:12
https://api.pushshift.io/reddit/search/submission/?after=1550907552&subreddit=india
25
2019-02-23 13:51:35
https://api.pu

25
2019-02-26 16:30:11
https://api.pushshift.io/reddit/search/submission/?after=1551178811&subreddit=india
25
2019-02-26 17:08:25
https://api.pushshift.io/reddit/search/submission/?after=1551181105&subreddit=india
25
2019-02-26 17:57:33
https://api.pushshift.io/reddit/search/submission/?after=1551184053&subreddit=india
25
2019-02-26 18:32:49
https://api.pushshift.io/reddit/search/submission/?after=1551186169&subreddit=india
25
2019-02-26 19:17:32
https://api.pushshift.io/reddit/search/submission/?after=1551188852&subreddit=india
25
2019-02-26 19:46:34
https://api.pushshift.io/reddit/search/submission/?after=1551190594&subreddit=india
25
2019-02-26 20:21:42
https://api.pushshift.io/reddit/search/submission/?after=1551192702&subreddit=india
25
2019-02-26 20:56:56
https://api.pushshift.io/reddit/search/submission/?after=1551194816&subreddit=india
25
2019-02-26 21:19:30
https://api.pushshift.io/reddit/search/submission/?after=1551196170&subreddit=india
25
2019-02-26 22:04:26
https://api.pu

25
2019-02-28 22:22:12
https://api.pushshift.io/reddit/search/submission/?after=1551372732&subreddit=india
25
2019-02-28 23:02:07
https://api.pushshift.io/reddit/search/submission/?after=1551375127&subreddit=india
25
2019-02-28 23:28:51
https://api.pushshift.io/reddit/search/submission/?after=1551376731&subreddit=india
25
2019-03-01 00:32:55
https://api.pushshift.io/reddit/search/submission/?after=1551380575&subreddit=india
25
2019-03-01 01:14:16
https://api.pushshift.io/reddit/search/submission/?after=1551383056&subreddit=india
25
2019-03-01 03:28:52
https://api.pushshift.io/reddit/search/submission/?after=1551391132&subreddit=india
25
2019-03-01 06:27:53
https://api.pushshift.io/reddit/search/submission/?after=1551401873&subreddit=india
25
2019-03-01 07:56:05
https://api.pushshift.io/reddit/search/submission/?after=1551407165&subreddit=india
25
2019-03-01 09:00:12
https://api.pushshift.io/reddit/search/submission/?after=1551411012&subreddit=india
25
2019-03-01 09:35:34
https://api.pu

25
2019-03-04 02:19:06
https://api.pushshift.io/reddit/search/submission/?after=1551646146&subreddit=india
25
2019-03-04 05:56:39
https://api.pushshift.io/reddit/search/submission/?after=1551659199&subreddit=india
25
2019-03-04 08:09:18
https://api.pushshift.io/reddit/search/submission/?after=1551667158&subreddit=india
25
2019-03-04 09:14:06
https://api.pushshift.io/reddit/search/submission/?after=1551671046&subreddit=india
25
2019-03-04 10:07:11
https://api.pushshift.io/reddit/search/submission/?after=1551674231&subreddit=india
25
2019-03-04 10:39:53
https://api.pushshift.io/reddit/search/submission/?after=1551676193&subreddit=india
25
2019-03-04 11:14:24
https://api.pushshift.io/reddit/search/submission/?after=1551678264&subreddit=india
25
2019-03-04 11:44:17
https://api.pushshift.io/reddit/search/submission/?after=1551680057&subreddit=india
25
2019-03-04 12:22:32
https://api.pushshift.io/reddit/search/submission/?after=1551682352&subreddit=india
25
2019-03-04 13:00:17
https://api.pu

25
2019-03-07 03:39:42
https://api.pushshift.io/reddit/search/submission/?after=1551910182&subreddit=india
25
2019-03-07 07:41:36
https://api.pushshift.io/reddit/search/submission/?after=1551924696&subreddit=india
25
2019-03-07 08:57:15
https://api.pushshift.io/reddit/search/submission/?after=1551929235&subreddit=india
25
2019-03-07 09:46:44
https://api.pushshift.io/reddit/search/submission/?after=1551932204&subreddit=india
25
2019-03-07 10:26:29
https://api.pushshift.io/reddit/search/submission/?after=1551934589&subreddit=india
25
2019-03-07 10:59:05
https://api.pushshift.io/reddit/search/submission/?after=1551936545&subreddit=india
25
2019-03-07 11:44:11
https://api.pushshift.io/reddit/search/submission/?after=1551939251&subreddit=india
25
2019-03-07 12:26:10
https://api.pushshift.io/reddit/search/submission/?after=1551941770&subreddit=india
25
2019-03-07 12:56:32
https://api.pushshift.io/reddit/search/submission/?after=1551943592&subreddit=india
25
2019-03-07 13:36:23
https://api.pu

25
2019-03-10 16:55:13
https://api.pushshift.io/reddit/search/submission/?after=1552217113&subreddit=india
25
2019-03-10 17:59:02
https://api.pushshift.io/reddit/search/submission/?after=1552220942&subreddit=india
25
2019-03-10 18:52:44
https://api.pushshift.io/reddit/search/submission/?after=1552224164&subreddit=india
25
2019-03-10 19:38:12
https://api.pushshift.io/reddit/search/submission/?after=1552226892&subreddit=india
25
2019-03-10 20:39:28
https://api.pushshift.io/reddit/search/submission/?after=1552230568&subreddit=india
25
2019-03-10 21:53:26
https://api.pushshift.io/reddit/search/submission/?after=1552235006&subreddit=india
25
2019-03-10 22:33:25
https://api.pushshift.io/reddit/search/submission/?after=1552237405&subreddit=india
25
2019-03-10 23:28:06
https://api.pushshift.io/reddit/search/submission/?after=1552240686&subreddit=india
25
2019-03-11 00:28:32
https://api.pushshift.io/reddit/search/submission/?after=1552244312&subreddit=india
25
2019-03-11 03:32:02
https://api.pu

KeyboardInterrupt: 

Adter crawling through the posts, the below codes can be run to find the first and the last post saved, along with the time stamps, such that, later if more posts are requred, the crawling processing can be started from the last post saved

In [20]:
print(str(len(subStats)) + " submissions have added to list")
print("1st entry is:")
print(list(subStats.values())[0][0][1] + " created: " + str(list(subStats.values())[0][0][5]))
print("Last entry is:")
print(list(subStats.values())[-1][0][1] + " created: " + str(list(subStats.values())[-1][0][5]))

33100 submissions have added to list
1st entry is:
A ‘Not so crowded’ Local Train created: 2019-01-07 21:55:37
Last entry is:
[P] Tamil Nadu: Actor Vijayakanth’s Desiya Murpokku Dravida Kazhagam joins AIADMK-BJP alliance created: 2019-03-11 12:03:19


In [21]:
def updateSubs_file():
    upload_count = 0
    location = "./"
    print("input filename of submission file, please add .csv")
    filename = input()
    file = location + filename
    with open(file, 'a', newline='', encoding='utf-8') as file: 
        a = csv.writer(file, delimiter=',')
        headers = ["id","title","url","author","score","publish_date","num_comment","permalink","flair", "selftext"]
        a.writerow(headers)
        for sub in subStats:
            a.writerow(subStats[sub][0])
            upload_count+=1
            
        print(str(upload_count) + " submissions have been uploaded")

In [22]:
updateSubs_file()

input filename of submission file, please add .csv
scrapped_reddit.csv
33100 submissions have been uploaded


### Extracting relevant posts
The saved csv is imported as a dataframe for extracting the datapoints with our listed flairs.

In [2]:

data = pd.read_csv("./scrapped.csv")

In [3]:
def CountFrequency(my_list): 
  
    # Creating an empty dictionary  
    freq = {} 
    for item in my_list: 
        if (item in freq): 
            freq[item] += 1
        else: 
            freq[item] = 1
  
    return freq

In [4]:
unique = set(list(data['flair']))
print(unique)

{nan, 'Photography', 'Not in English.', 'Non-Political', 'AskIndia', '| Not Original/Relevant Title | | Repost |', '| Not Original/Relevant Title |', '| Witch-hunting/Targeting User | Meta.', 'Low Quality/Non OC Meme', '| Not specific to India |', '| Stickied Topic |', '| Self-promotion |', '| Repost |', 'AMA', 'Science/Technology', 'Business/Finance', 'Sports', '| Not Original/Relevant Title | | Social Media Rules |', '| Not specific to India | Low Quality/Non OC Meme', 'Megathread', 'Shitpost', 'Meta.', 'Coronavirus', 'PARTAYYY AGAIN :D', 'Personal/Unverified Twitter.', '| Custom (Informed OP) |', 'Food', '| Low-effort Self Post | | Repost |', 'Not Appropriate Subreddit', 'Post link Directly', '| [OLD] Content |', 'Meta', 'Science &amp; Technology', 'Verified', '| Low-effort Self Post | Post link Directly', '| Personal/Unverified Social Media |', 'CAA-NRC', 'Unverified', 'Dead Link', '[R]eddiquette', 'Low-effort self-post.', '| Unverified Content / Disreputed Source |', 'Meta. | Cust

In [5]:
freq = CountFrequency(list(data['flair']))
sorted_freq = {k: v for k, v in sorted(freq.items(), key=lambda item: item[1], reverse=True)}
print(sorted_freq)


{nan: 108986, 'Politics': 34490, 'Non-Political': 29435, 'AskIndia': 19358, 'Business/Finance': 7812, 'Coronavirus': 6057, 'Science/Technology': 5637, 'Policy/Economy': 5106, '[R]eddiquette': 3643, 'Photography': 3588, 'Sports': 2137, 'Food': 1606, 'CAA-NRC': 1070, 'Demonetization': 841, 'All CAPS.': 787, 'Not in English.': 780, 'Scheduled': 729, 'Low-effort self-post.': 564, 'CAA-NRC-NPR': 72, '| Repost |': 38, '| Not Original/Relevant Title |': 33, '| Not specific to India |': 26, '| Unverified Content / Disreputed Source |': 26, '| Low-effort Self Post |': 21, 'Low Quality/Non OC Meme': 19, 'Shitpost': 10, 'Meta.': 10, '| Not in English |': 9, '| Image Rule Violation |': 8, 'Science &amp; Technology': 7, '| Personal/Unverified Social Media |': 7, 'AMA': 6, '| Social Media Rules |': 6, 'Personal/Unverified Twitter.': 5, '| Self-promotion |': 5, 'Unverified': 4, 'Not Appropriate Subreddit': 4, '| Custom (Informed OP) |': 4, 'Post link Directly': 4, '| Not in English | | Not Original/R

In [6]:
data = data.replace('CAA-NRC-NPR', 'CAA-NRC')

In [7]:
freq = CountFrequency(list(data['flair']))
sorted_freq = {k: v for k, v in sorted(freq.items(), key=lambda item: item[1], reverse=True)}
print(sorted_freq)


{nan: 108986, 'Politics': 34490, 'Non-Political': 29435, 'AskIndia': 19358, 'Business/Finance': 7812, 'Coronavirus': 6057, 'Science/Technology': 5637, 'Policy/Economy': 5106, '[R]eddiquette': 3643, 'Photography': 3588, 'Sports': 2137, 'Food': 1606, 'CAA-NRC': 1142, 'Demonetization': 841, 'All CAPS.': 787, 'Not in English.': 780, 'Scheduled': 729, 'Low-effort self-post.': 564, '| Repost |': 38, '| Not Original/Relevant Title |': 33, '| Not specific to India |': 26, '| Unverified Content / Disreputed Source |': 26, '| Low-effort Self Post |': 21, 'Low Quality/Non OC Meme': 19, 'Shitpost': 10, 'Meta.': 10, '| Not in English |': 9, '| Image Rule Violation |': 8, 'Science &amp; Technology': 7, '| Personal/Unverified Social Media |': 7, 'AMA': 6, '| Social Media Rules |': 6, 'Personal/Unverified Twitter.': 5, '| Self-promotion |': 5, 'Unverified': 4, 'Not Appropriate Subreddit': 4, '| Custom (Informed OP) |': 4, 'Post link Directly': 4, '| Not in English | | Not Original/Relevant Title |': 3

In [8]:
flairs = ['Business/Finance', 'Policy/Economy', 'Photography', 'Politics', 'Sports', '[R]eddiquette', 'Food', 'Science/Technology', 'AskIndia','CAA-NRC', 'Coronavirus']

Builing a Balanced dataset by randomly sampling a maximum of 'n'(defined below) post per flair, from the scrapped csv

In [9]:
n = 2000

np.random.seed(42)
keep = []
flairs = [flair for flair in flairs if not str(flair) == 'nan']
for flair in flairs:
    l = len(data[data['flair'] == flair])
    if l > n:
        l = n
    idx = list(data[data['flair'] == flair]['id'])
    c = np.random.choice(idx, l, replace=False)
    for i in c:
        keep.append(i)

print (len(keep))

20748


In [10]:
data = data[data['id'].isin(keep)]

In [11]:
data.to_csv("scrapped_reddit_flared_2000_new_classes.csv",index=False)

In [12]:
data.head()

Unnamed: 0,id,title,url,author,score,publish_date,num_comment,permalink,flair,selftext
4,abd9dz,"Allow banks to hold passports of loan-takers, ...",https://www.reddit.com/r/india/comments/abd9dz...,askquestionsdude,1,2019-01-01 00:45:17,0,/r/india/comments/abd9dz/allow_banks_to_hold_p...,Business/Finance,[removed]
6,abde0g,Tamil Nadu to usher in New Year on green note ...,https://www.livemint.com/Politics/95rPBVTWmQqM...,askquestionsdude,1,2019-01-01 01:00:15,3,/r/india/comments/abde0g/tamil_nadu_to_usher_i...,Policy/Economy,
23,abe9wz,"Worst of the NPA crisis is over, says RBI report",https://www.livemint.com/Industry/LohId3yWEeQ1...,harddisc,1,2019-01-01 02:56:12,0,/r/india/comments/abe9wz/worst_of_the_npa_cris...,Policy/Economy,
41,aber36,Ravi Shastri's comments about O'Keefe,https://www.reddit.com/r/india/comments/aber36...,trowaaaay,1,2019-01-01 04:03:40,3,/r/india/comments/aber36/ravi_shastris_comment...,Sports,"Kerry O' Keefe joked that Mayank Agarwal ""appa..."
42,abervj,A picture I clicked at the pillar rocks in Kod...,https://i.redd.it/s2gh1xg4kq721.jpg,Daiguren_Hyorinmaru_,1,2019-01-01 04:07:10,52,/r/india/comments/abervj/a_picture_i_clicked_a...,Photography,


## The codes below this heading are all the methods that I have tried out and done rough runs to experiment

I though of extracting the comments for which I have the code below, but later dropped the idea as, the comments are variable, some posts have comments and some don't and some of the comments get deleted in future, therefore the comments shouldn't affect the flair of the original post and shouldn't be a factor in deciding the post's flair.

### PRAW credentials for creating a Reddit object and later using submission method to extract comment if required

In [17]:
#enter PRAW credentials
client_id = ""
client_secret = ""
user_agent = ""
username = ""
password = ""


In [18]:
reddit = praw.Reddit(client_id = client_id,
                    client_secret = client_secret,
                    user_agent = user_agent,
                    username = username,
                    password = password)


The below code randomly samples a max of 10 comments for each post.

In [18]:
import time
start = time.time()
np.random.seed(42)

for i, row in data.iterrows():
    comments = []
    num_comm = 10
    
    submission = reddit.submission(id=row['id'])
    l = len(submission.comments)
    
    if l > 0:
        if l < 10:
            num_comm = l
        r = np.random.choice(l, num_comm, replace=False) 
        for i in r:
            comments.append(submission.comments[i].body)
    
    row["comments"] = comments

print ((time.time()-start)/60)

KeyboardInterrupt: 

In [6]:
freq['Coronavirus']

6057

In [5]:
data.head()

Unnamed: 0,id,title,url,author,score,publish_date,num_comment,permalink,flair,selftext
4,abd9dz,"Allow banks to hold passports of loan-takers, ...",https://www.reddit.com/r/india/comments/abd9dz...,askquestionsdude,1,2019-01-01 00:45:17,0,/r/india/comments/abd9dz/allow_banks_to_hold_p...,Business/Finance,[removed]
5,abd9r1,"Allow banks to hold passports of loan-takers, ...",https://m.timesofindia.com/city/chennai/madras...,askquestionsdude,1,2019-01-01 00:46:27,13,/r/india/comments/abd9r1/allow_banks_to_hold_p...,Business/Finance,
6,abde0g,Tamil Nadu to usher in New Year on green note ...,https://www.livemint.com/Politics/95rPBVTWmQqM...,askquestionsdude,1,2019-01-01 01:00:15,3,/r/india/comments/abde0g/tamil_nadu_to_usher_i...,Policy/Economy,
9,abdjri,Some anniversaries in 2019,https://www.reddit.com/r/india/comments/abdjri...,SlightKnife,1,2019-01-01 01:18:48,3,/r/india/comments/abdjri/some_anniversaries_in...,Non-Political,150 year ago:\n\nBirth of Mahatma Gandhi\n\nDe...
22,abe9l1,[NP] PSA: Get vaccinated for rabies if you can.,https://www.reddit.com/r/india/comments/abe9l1...,dangling_asshole,1,2019-01-01 02:54:47,15,/r/india/comments/abe9l1/np_psa_get_vaccinated...,Non-Political,There's been a global shortage of human and eq...


In [22]:
data = SubredditScraper('india', lim=5000000, mode='w', sort='top').get_posts()

SubredditScraper instance created with values sub = india, sort = top, lim = 5000000, mode = w
csv = new_india_posts.csv
After set_sort(), sort = top and sub = india
csv_loaded = 0
Collecting information from r/india.
988 posts collected and saved to new_india_posts.csv


In [17]:
data['comments'] = ""

In [19]:
data.head()

Unnamed: 0,id,title,url,author,score,publish_date,num_comment,permalink,flair,selftext,comments
4,abd9dz,"Allow banks to hold passports of loan-takers, ...",https://www.reddit.com/r/india/comments/abd9dz...,askquestionsdude,1,2019-01-01 00:45:17,0,/r/india/comments/abd9dz/allow_banks_to_hold_p...,Business/Finance,[removed],
5,abd9r1,"Allow banks to hold passports of loan-takers, ...",https://m.timesofindia.com/city/chennai/madras...,askquestionsdude,1,2019-01-01 00:46:27,13,/r/india/comments/abd9r1/allow_banks_to_hold_p...,Business/Finance,,
6,abde0g,Tamil Nadu to usher in New Year on green note ...,https://www.livemint.com/Politics/95rPBVTWmQqM...,askquestionsdude,1,2019-01-01 01:00:15,3,/r/india/comments/abde0g/tamil_nadu_to_usher_i...,Policy/Economy,,
9,abdjri,Some anniversaries in 2019,https://www.reddit.com/r/india/comments/abdjri...,SlightKnife,1,2019-01-01 01:18:48,3,/r/india/comments/abdjri/some_anniversaries_in...,Non-Political,150 year ago:\n\nBirth of Mahatma Gandhi\n\nDe...,
22,abe9l1,[NP] PSA: Get vaccinated for rabies if you can.,https://www.reddit.com/r/india/comments/abe9l1...,dangling_asshole,1,2019-01-01 02:54:47,15,/r/india/comments/abe9l1/np_psa_get_vaccinated...,Non-Political,There's been a global shortage of human and eq...,


In [7]:
data.to_csv("cleaned_scrapped_reddit.csv",index=False)

In [9]:
a = list(data['flair'])

In [11]:
unique = set(a)

In [12]:
unique

{'AskIndia',
 'Business/Finance',
 'Food',
 'Non-Political',
 'Photography',
 'Policy/Economy',
 'Politics',
 'Scheduled',
 'Science/Technology',
 'Sports',
 '[R]eddiquette'}

In [4]:
class SubredditScraper:

    def __init__(self, sub, sort='new', lim=900, mode='w'):
        self.sub = sub
        self.sort = sort
        self.lim = lim
        self.mode = mode

        print(
            f'SubredditScraper instance created with values '
            f'sub = {sub}, sort = {sort}, lim = {lim}, mode = {mode}')

    def set_sort(self):
        if self.sort == 'new':
            return self.sort, reddit.subreddit(self.sub).new(limit=self.lim)
        elif self.sort == 'top':
            return self.sort, reddit.subreddit(self.sub).top(limit=self.lim)
        elif self.sort == 'hot':
            return self.sort, reddit.subreddit(self.sub).hot(limit=self.lim)
        elif self.sort == 'controversial':
            return self.sort, reddit.subreddit(self.sub).controversial(limit=self.lim)
        elif self.sort == 'gilded':
            return self.sort, reddit.subreddit(self.sub).gilded(limit=self.lim)
        else:
            self.sort = 'hot'
            print('Sort method was not recognized, defaulting to hot.')
            return self.sort, reddit.subreddit(self.sub).hot(limit=self.lim)

    def get_posts(self):
        """Get unique posts from a specified subreddit."""

        sub_dict = {
            'selftext': [], 'title': [], 'id': [], 'sorted_by': [],
            'num_comments': [], 'score': [], 'ups': [], 'downs': [],
            'link_flair_text': [], 'url': [] }
        csv = f'new_{self.sub}_posts.csv'

        # Attempt to specify a sorting method.
        sort, subreddit = self.set_sort()

        # Set csv_loaded to True if csv exists since you can't evaluate the
        # truth value of a DataFrame.
        df, csv_loaded = (pd.read_csv(csv), 1) if isfile(csv) else ('', 0)

        print(f'csv = {csv}')
        print(f'After set_sort(), sort = {sort} and sub = {self.sub}')
        print(f'csv_loaded = {csv_loaded}')

        print(f'Collecting information from r/{self.sub}.')
        for post in subreddit:

            # Check if post.id is in df and set to True if df is empty.
            # This way new posts are still added to dictionary when df = ''
            unique_id = post.id not in tuple(df.id) if csv_loaded else True

            # Save any unique, non-stickied posts with descriptions to sub_dict.
            if unique_id:
                sub_dict['selftext'].append(post.selftext)
                sub_dict['title'].append(post.title)
                sub_dict['id'].append(post.id)
                sub_dict['sorted_by'].append(sort)
                sub_dict['num_comments'].append(post.num_comments)
                sub_dict['score'].append(post.score)
                sub_dict['ups'].append(post.ups)
                sub_dict['downs'].append(post.downs)
                sub_dict['link_flair_text'].append(post.link_flair_text)
                sub_dict['url'].append(post.url)
            sleep(0.1)

        # pprint(sub_dict)
        new_df = pd.DataFrame(sub_dict)

        # Add new_df to df if df exists then save it to a csv.
        if 'DataFrame' in str(type(df)) and self.mode == 'w':
            pd.concat([df, new_df], axis=0, sort=0).to_csv(csv, index=False)
            print(
                f'{len(new_df)} new posts collected and added to {csv}')
        elif self.mode == 'w':
            new_df.to_csv(csv, index=False)
            print(f'{len(new_df)} posts collected and saved to {csv}')
        else:
            print(
                f'{len(new_df)} posts were collected but they were not '
                f'added to {csv} because mode was set to "{self.mode}"')
            
        return new_df