For data from current time, we use the praw to get submissions. But since the number of submissions are limited, and the Reddit API removed its timestamp feature, to get previous year data we make use of the Reddit Data made available by Jason Michael Baumgartner using Google BigQuery.
So, two approaches are used to collect data.

Part I:

Import the libraries needed for Reddit Data Collection :
1. praw - (“Python Reddit API Wrapper”, a python package that allows for simple access to reddit's API.)
2. pandas - open source data analysis and manipulation tool
3. matplotlib and seaborn to visualise data


In [15]:
import praw
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Create a Reddit instance to access the data

In [2]:
reddit=praw.Reddit(client_id='',
                   client_secret='',
                   username='',
                   password='',
                   user_agent=''
                   )

Get the subreddit - r/india and create a list to hold the all the flair needed. ( The allowed flairs are mentioned - https://www.reddit.com/r/india/wiki/rules )

In [3]:
subreddit=reddit.subreddit('india')

flair_list=['AskIndia','Non-Political','Scheduled','Photography','Science/Technology','Politics','Business/Finance','Policy/Economy','Sports','Food']

Check the attributes available with the submissions available in a subreddit to see which ones can be used as features.

In [4]:
submissions=subreddit.search(flair_list[0],limit=1)
for submission in submissions:
    print(dir(submission))

['STR_FIELD', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_chunk', '_comments_by_id', '_fetch', '_fetch_data', '_fetch_info', '_fetched', '_kind', '_reddit', '_reset_attributes', '_safely_add_arguments', '_url_parts', '_vote', 'all_awardings', 'allow_live_comments', 'approved_at_utc', 'approved_by', 'archived', 'author', 'author_flair_background_color', 'author_flair_css_class', 'author_flair_richtext', 'author_flair_template_id', 'author_flair_text', 'author_flair_text_color', 'author_flair_type', 'author_fullname', 'author_patreon_flair', 'author_premium', 'awarders', 'banned_at_utc', 'banned_by', 'can_gild', 'can_mod_post', 'category', 'clear_vote', 'clicked', 'comment_lim

Create a pandas adataframe to hold the attributes to save and use for further investigation

In [5]:
df=pd.DataFrame(columns=['flair','title','author','text','url','comments','score','domain'])

Create a for loop to get the information for the various Flair and store them into the database (max. 200 for each flair)

In [14]:
for flair in flair_list:
    list_of_submission=subreddit.search(flair,limit=200)
    for submission in list_of_submission:
        if not submission.stickied:
            comments=""
            submission.comments.replace_more(limit=0)
            comment_list=submission.comments.list()
            for comment in comment_list:
                comments=comments+'\n'+comment.body
            df=df.append({'flair':flair,'title':submission.title,'author':submission.author,'text':submission.selftext,'url':submission.url,'comments':comments,'score':submission.score,'domain':submission.domain},ignore_index=True)

Save the database obtained into a csv file for further use during the project.

In [None]:
#df.to_csv(r'reddit_flair.csv',index=False)
df.to_csv(r'reddit_flair3.csv',index=False)

(Ran twice between a span of few days to get more recent data and saved the data from different days in different files, therefore one is commented out.)

Part II:

Use Google BigQuery to get data from the dataset.


In [10]:
import numpy as np
import pandas as pd
from google.cloud import bigquery
import praw
import os
import datetime
os.environ["GOOGLE_APPLICATION_CREDENTIALS"]=r""


reddit=praw.Reddit(client_id='',
                   client_secret='',
                   username='',
                   password='',
                   user_agent=''
                   )

client=bigquery.Client()


QUERY_POSTS=(
'SELECT * '
'FROM `fh-bigquery.reddit_posts.201*`'
'WHERE subreddit = "india" and link_flair_text in ("AskIndia","Non-Political","Scheduled","Photography","Science/Technology","Politics","Business/Finance","Policy/Economy","Sports","Food") ' 
'LIMIT 100000'
)

query_job = client.query(QUERY_POSTS)
query = query_job.result().to_dataframe()


keep = []
data = query

data.to_csv(r'reddit_flair2.csv',index=False)




To take specific features from the dataset only. And then to get a max. of 2000 entries.

In [None]:
df2=pd.read_csv('reddit_flair2.csv')
df3=df2[['link_flair_text','title','author','selftext','url','id','score','domain']]

keep = []
data = df3
flairs = ['AskIndia','Non-Political','Scheduled','Photography','Science/Technology','Politics','Business/Finance','Policy/Economy','Sports','Food']
for flair in flairs:
    l = len(df3[df3['link_flair_text'] == flair])
    if l > 2000:
        l = 2000
    idx = list(df3[df3['link_flair_text'] == flair]['id'])
    lst = np.random.choice(idx, l, replace=False)
    for item in lst:
        keep.append(item)

df4 = df3[df3['id'].isin(keep)]

The above snippet only got us the posts. We now need the comments too. For this we will use praw.

In [None]:
def getcomments(id_num):
    submission=reddit.submission(id=id_num)
    submission.comments.replace_more(limit=0)
    sub_comments=''
    for i,comment in enumerate(submission.comments):
        sub_comments+=comment.body
        if i==10:
            break
    return sub_comments

In [None]:
df4['comments']=df4['id'].apply(getcomments)
df4[['id','comments']].head()
print('done')

df4.to_csv('out.csv')

The data is now collected and stored. The recent data is stored in reddit_flair and reddit_flair3 and the data from previous year is saved in out.csv .

Now, we combine all the different data and save it in final_db_2.csv

In [None]:
df1_1=pd.read_csv('reddit_flair3.csv')
df1_2=pd.read_csv('reddit_flair.csv')

In [None]:
df1=pd.concat([df1_1,df1_2],ignore_index=True)

In [None]:
df2=pd.read_csv('out.csv')
df2.drop('Unnamed: 0',axis=1,inplace=True)
df2.drop('id',axis=1,inplace=True)
df2.rename(columns={"selftext": "text","link_flair_text":"flair"},inplace=True)

In [None]:
df_final=pd.concat([df1,df2],ignore_index=True)
df_final.to_csv('final_db_2.csv',index=False)

After the EDA in the next notebook,we see that the dataset is very imbalanced. So oversampling is done here and it is stored in corrected_dataset.csv

In [None]:
df=pd.read_csv('final_db_2.csv')
df_final=pd.concat([
    df,
    df[df['flair']=='Scheduled'].sample(n=550),
    df[df['flair']=='Food'].sample(n=600),
    df[df['flair']=='Photography']
])
df_final=pd.concat([
    df_final,
    df_final[df_final['flair']=='Photography'].sample(n=500),
])
df_final.to_csv('corrected_dataset.csv')