# **Data collection**

In order to collect the data necessary to classify posts, I explored the reddit praw API and Google BigQuery's RedditPosts dataset. The former was lacking the ability to filter posts by flair. The latter was a great option which I had considered. However, it contained posts from December 2015 to December 2019. Hence, it lacked the posts tagged as 'Coronavirus' in r/india. As a result, I decided to utilize the pushshift.io API to aquire data the latest posts from r/india. I considered a total of six flairs as per my understanding of popular posts and flairs at the current time on reddit. The following flairs were considered:

```
1.   Coronavirus
2.   Science/Technology
3.   Politics
4.   Non-Political
5.   AskIndia
6.   Policy/Economy
```







In [0]:
import pandas as pd
import numpy as np
import requests
import calendar
import time

#### **Total posts per flair**

In order to prevent class imbalance, each flair will have a maximum of 6000 posts. This number was decided based on the flair with the minimum number of posts, in this case it was Coronavirus with around 6200 posts.

In [0]:
POSTS_PER_CLASS = 6000
flairs = ['Politics', 'Coronavirus', 'Non-Political', 'Policy/Economy', 'AskIndia', 'Science/Technology']

#### **Collecting the data**

The pushshift API allows the user to download over 1000 posts from a subreddit. In addition to this, it allows one to filter posts based on the date they were posted. I ran a loop which collected posts starting from current time till the total number of posts collected were greater than or equal to 6000. In the end, 6000 posts were sampled from all the posts collected from each flair.

In [0]:
dataset_df = pd.DataFrame(columns=['created_utc', 'url', 'num_comments', 'selftext', 'title', 'over_18', 'link_flair_text', 'id', 'permalink'])

for flair in flairs:
  print("Getting data for", flair)
  epoch_time = calendar.timegm(time.gmtime())
  idle_attempts = 0
  posts = 0
  data_dict = {'created_utc': [], 'url': [], 'num_comments': [], 'selftext': [], 'title': [], 'over_18': [], 'link_flair_text': [], 
               'id': [], 'permalink': []}
  while posts < POSTS_PER_CLASS:
    flag = False
    URL = 'https://api.pushshift.io/reddit/submission/search/'
    PARAMS = {'subreddit': 'india', 'before': epoch_time, 'sort': 'desc', 'limit': 1000} 
    r = requests.get(url = URL, params = PARAMS)
    ret = r.json()
    if 'data' in ret and len(ret['data']) > 0:
      data = ret['data']
      for ele in data:
        if 'link_flair_text' in ele and ele['link_flair_text'] == flair:
          flag = True
          if ele['id'] not in data_dict['id']:
            data_dict['created_utc'].append(ele['created_utc'])
            data_dict['url'].append(ele['url'])
            data_dict['num_comments'].append(ele['num_comments'])
            data_dict['selftext'].append(ele['selftext'] if 'selftext' in ele else "")
            data_dict['title'].append(ele['title'])
            data_dict['over_18'].append(ele['over_18'])
            data_dict['link_flair_text'].append(ele['link_flair_text'])
            data_dict['id'].append(ele['id'])
            data_dict['permalink'].append(ele['permalink'])
            posts += 1
            if posts%1000 == 0:
              print("Posts:", posts)
      if flag is False:
        idle_attempts += 1
        print("idle attempts:", idle_attempts)
      if idle_attempts > 5:
        print("5 idle attempts, breaking loop")
        break
      epoch_time = data[len(data)-1]['created_utc']
    else:
      break
  dict_df = pd.DataFrame.from_dict(data_dict)
  print("Data for flair: " + flair + " " + str(len(dict_df)))
  if len(dict_df) > POSTS_PER_CLASS:
    dict_df = dict_df.sample(n = POSTS_PER_CLASS)
  dataset_df = pd.concat([dataset_df, dict_df], ignore_index=True)
  print("Data for flair after sampling: " + flair + " " + str(len(dataset_df)))

Getting data for Politics
Posts: 1000
Posts: 2000
Posts: 3000
Posts: 4000
Posts: 5000
Posts: 6000
Data for flair: Politics 6216
Data for flair after sampling: Politics 6000
Getting data for Coronavirus
Posts: 1000
Posts: 2000
Posts: 3000
Posts: 4000
Posts: 5000
Posts: 6000
Data for flair: Coronavirus 6020
Data for flair after sampling: Coronavirus 12000
Getting data for Non-Political
Posts: 1000
Posts: 2000
Posts: 3000
Posts: 4000
Posts: 5000
Posts: 6000
Data for flair: Non-Political 6048
Data for flair after sampling: Non-Political 18000
Getting data for Policy/Economy
Posts: 1000
Posts: 2000
Posts: 3000
Posts: 4000
Posts: 5000
Posts: 6000
Data for flair: Policy/Economy 6014
Data for flair after sampling: Policy/Economy 24000
Getting data for AskIndia
Posts: 1000
Posts: 2000
Posts: 3000
Posts: 4000
Posts: 5000
Posts: 6000
Data for flair: AskIndia 6059
Data for flair after sampling: AskIndia 30000
Getting data for Science/Technology
Posts: 1000
Posts: 2000
Posts: 3000
Posts: 4000
Posts

In [0]:
dataset_df.head()

Unnamed: 0,created_utc,url,num_comments,selftext,title,over_18,link_flair_text,id,permalink
0,1582112360,https://v.redd.it/djadxew3avh41,23,,Standing ovation and a huge round of applause ...,False,Politics,f69phh,/r/india/comments/f69phh/standing_ovation_and_...
1,1583239508,https://www.youtube.com/watch?v=_EIDILUGKaQ,0,,Narendra Modi Giving Up Social Media | Delhi s...,False,Politics,fcu1ck,/r/india/comments/fcu1ck/narendra_modi_giving_...
2,1583032675,https://www.reddit.com/r/india/comments/fbo27x...,398,"Those fuckers killing, maiming or hurting othe...",I am Hindu.,False,Politics,fbo27x,/r/india/comments/fbo27x/i_am_hindu/
3,1579961368,https://www.instagram.com/tv/B7tEQtmgZC0/?igsh...,1,,Indian Prime Minister's advice makes fair and ...,False,Politics,etr6v9,/r/india/comments/etr6v9/indian_prime_minister...
4,1581027603,https://i.redd.it/4cgfvu5iodf41.png,0,,"I know it is a bit too harsh, but is BJP causi...",False,Politics,f008vi,/r/india/comments/f008vi/i_know_it_is_a_bit_to...


The posts were collected and saved as a csv file.

In [0]:
dataset_df.to_csv('/content/drive/My Drive/Reddit_dataset/dataset_final.csv')