# Reddit Data Collection

Import pandas library

In [0]:
import pandas as pd 

PRAW is a python wrapper for the Reddit API, which facilitates us to scrape data from Reddit.  

**Getting Started**

Install PRAW using pip:

In [0]:
pip install praw

Collecting praw
[?25l  Downloading https://files.pythonhosted.org/packages/25/c0/b9714b4fb164368843b41482a3cac11938021871adf99bf5aaa3980b0182/praw-6.5.1-py3-none-any.whl (134kB)
[K     |██▍                             | 10kB 18.9MB/s eta 0:00:01[K     |████▉                           | 20kB 3.3MB/s eta 0:00:01[K     |███████▎                        | 30kB 4.8MB/s eta 0:00:01[K     |█████████▊                      | 40kB 3.1MB/s eta 0:00:01[K     |████████████▏                   | 51kB 3.8MB/s eta 0:00:01[K     |██████████████▋                 | 61kB 4.5MB/s eta 0:00:01[K     |█████████████████               | 71kB 5.2MB/s eta 0:00:01[K     |███████████████████▌            | 81kB 4.1MB/s eta 0:00:01[K     |█████████████████████▉          | 92kB 4.5MB/s eta 0:00:01[K     |████████████████████████▎       | 102kB 5.0MB/s eta 0:00:01[K     |██████████████████████████▊     | 112kB 5.0MB/s eta 0:00:01[K     |█████████████████████████████▏  | 122kB 5.0MB/s eta 0:00:01

Import PRAW:

In [0]:
import praw

Before scraping the data, we need to authenticate ourselves first. This can be done by creating the Reddit instance and providing the `client_id`, `client_secret` and a `user_agent`.  

In [0]:
#@title Store the `client_id`, `client_secret` and a `user_agent` in a hidden cell
client_id = '1jhEvPDMQYI5ZQ'
client_secret = 'OGids43hs9-E-e6iS9t1JCsW3Es'
user_agent = 'Reddit Webscapping'

In [0]:
reddit = praw.Reddit(client_id = client_id, client_secret = client_secret, user_agent = user_agent) #initializing instance

**Get subreddit data**

Collected all the information from the Reddit posts that are described below:
* `id`: post id
* `title`: title of the post
* `url` : URL of the post
* `score` : score of the post
* `created_at`: when post was created
* `body` : the text of the post
* `nups` : number of up votes of the post
* `ncomments`: number of comments of the post
* `comments`: comments of the post
* `flair`: flair of the post


Define the dataframe that will store all the collected information of the Reddit posts. 

In [0]:
# define the dataframe columns
column_names = ['id', 'title', 'url', 'score', 'created_at', 'body', 'author', 'nups', 'ncomments','comments', 'flair'] 

# instantiate the dataframe
df = pd.DataFrame(columns=column_names)
df

Unnamed: 0,id,title,url,score,created_at,body,author,nups,ncomments,comments,flair


List all the possible flairs. This will help in classifying the Subreddit posts. 

In [0]:
flairs = ["AskIndia", "Non-Political", "[R]eddiquette", 
          "Scheduled", "Photography", "Science/Technology",
          "Politics", "Business/Finance", "Policy/Economy",
          "Sports", "Food", "AMA"]

Use the `praw.Reddit` instance to collect data from [r/India](https://www.reddit.com/r/india/).  

In [0]:
subreddit_india = reddit.subreddit('India')
subreddit_india

Subreddit(display_name='India')

The Reddit posts are collected by searching the `flair` name in the `flairs` list. Top 100 reddit posts of each flair are collected and stored for analysis.

In [0]:
for flair in flairs:
    posts = subreddit_india.search(flair, limit=100)
    for post in posts:
        id = post.id
        title = post.title
        url = post.url
        score = post.score
        created_at = post.created
        body = post.selftext
        author = post.author
        nups = post.ups
        ndowns = post.downs
        ncomments = 0
        comment = ''
        
        post.comments.replace_more(limit=0)
        for top_comment in post.comments.list():
            comment = comment + '' + top_comment.body
            ncomments = ncomments+1
            
        df = df.append({'id':id, 'title':title, 'url':url, 'score':score, 'created_at':created_at, 'body':body, 
                                                      'author':author, 'nups':nups, 'ncomments':ncomments, 'comments':comment, 
                                                      'flair':flair }, ignore_index=True)

Print the top 5 rows of the dataframe

In [0]:
df.head(5)

Unnamed: 0,id,title,url,score,created_at,body,author,nups,ncomments,comments,flair
0,fwjdqr,4 days ago we had pending orders of 100 millio...,https://www.reddit.com/r/india/comments/fwjdqr...,91,1586290000.0,> We are getting frantic calls from our pharma...,india_ko_vanakkam,91,5,"Modi has Stockholm syndromeTo be fair, the evi...",AskIndia
1,fizkkk,Randians who were big time users of dating app...,https://www.reddit.com/r/india/comments/fizkkk...,20,1584298000.0,I'd my own stint with these apps(a couple of m...,__knockknockturnal__,20,19,Someone matched with me just to tell me that I...,AskIndia
2,f25vx0,What does r/India thinks about the Flat Earthers?,https://www.reddit.com/r/india/comments/f25vx0...,5,1581441000.0,"I encountered a Foreigner in IG who says "" Rou...",Dev1003,5,31,I haven't found a Indian yet who believes eart...,AskIndia
3,dtvliq,People who left their 9 to 5 jobs to pursue a ...,https://www.reddit.com/r/india/comments/dtvliq...,46,1573333000.0,Couldn't add AskIndia flair from the mobile br...,c0mrade34,46,36,"An Engineer, doing advertisement shoots since ...",AskIndia
4,1s57oi,Need feedback for Insurance Policy that I took...,https://www.reddit.com/r/india/comments/1s57oi...,1,1386254000.0,**Re-posting here because of lack of activity ...,dhavalcoholic,1,1,"Dear Policy Holder(Dhavalcoholic),\n \nWe requ...",AskIndia


Shape of the dataframe: 

In [0]:
df.shape

(1118, 11)

**Save the dataframe**

Save the collected data in CSV format to My Google Drive. 

Mount the Google Drive to the colaboratory 

In [0]:
from google.colab import drive 
drive.mount('drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at drive


In [0]:
df.to_csv('data.csv')
!cp data.csv /content/drive/My\ Drive/Reddit\ Flare\ Detector