## **REDDIT DATA CRAWLER**
* This data crawler uses [pushlift ](https://github.com/pushshift/api) to scrape JSONs from Reddit's r\india and dump the data to a csv.
* Checkpoints are enabled for this implementation you can save and load checkpoints in form of pickle files (Implementation Below)
* All warnings exception and http request results are handled and kept in a logger file.
* By providing 4 parameters you can you can extract any number of posts in any period of time
* The script for this is also available, which enables you to create a csv of data in only one line of code
* To avoid hot topics to form a bias the posts have been taken over a wide period - 2 years of data has been scraped* A total of ***365,000*** submissions have been scraped from the subreddit where only ***48,380*** submissions are still available on Reddit and hence are a part of the dataset.
* A committed version of this notebook is available on [kaggle](https://www.kaggle.com/someshsingh22/redditcrawlertest?scriptVersionId=31577144)* To scrape 365K submissions over 2 years this takes a total of second where the runtime was only ******s and wait-time was ***2190***s

### **IMPORTS**

In [1]:
import requests #For querying pushlift to extract json files from reddit
import time #For generating timestamps
import json #To handle json files from pushlift
import logging  #Logging files for error handling
import datetime #To convert unix timestamps to dates
import pickle #to load and save from checkpoints
from tqdm import tqdm_notebook #tqdm range to gauge progress
import matplotlib.pyplot as plt #for analytics
import seaborn as sns
import pandas as pd #convert extracted data to csv

  import pandas.util.testing as tm


### **Create Logger for maintaining log file for errors and progress**

In [0]:
#Create and configure logger 
logging.basicConfig(filename="crawler.log", 
                    format='%(asctime)s %(message)s', 
                    filemode='w',
                    level=logging.DEBUG)

### **Reddit Crawler Class**
The class takes four parameters for extraction of data


1.   size : Number of results to lookup from (Advised $ 100 \leq size \leq 500 $ to avoid DNS blocking).
2.   start : UNIX timestamp of the date from where the scraping should begin.
3.   difference : The leap of time between two queries, the time-range (in days) to query reddit.
4.   sleep : The delay between successive queris to pushlift (Advised to keep $ \geq 1 $).

Invalid parameters are handled as exceptions and all runtime errors along with the associated URLs are logged in crawler.log

In [0]:
class Crawler:
  def __init__(self, size=250, start=time.time(), difference=7, sleep=1):
    
    #Data Collected by the crawler
    self.data={
        'Author' : [],
        'Title'  : [],
        'Flair'  : [],
        'Text'   : [],
    }

    self.stats=[] #Number of posts collected on every leap of time
    self.sleep, self.size, self.start, self.difference = sleep, size, start, difference #Init the parameters
    self.__validate__() #Validate Parmaeters and log warnings
    self.difference=self.difference*24*3600 #set difference to day format
    self.current=self.start #set timer for query to time of init
    self.url_generator=self._url_generator() #create a url generator
    self.url = 'https://api.pushshift.io/reddit/search/submission/?subreddit=india&size={}&{}' #url format for pushlift

  #Validator for Web Crawler
  def __validate__(self):
    #validate sleep value to be a number >= 1
    try :
      assert self.sleep >= 1 and (isinstance(self.sleep,int) or isinstance(self.sleep,float))
    except:
      logging.warning("Invalid sleep value, may cause DNS server blocking, set to 1")
      self.sleep=1

    #validate query size to be a number <= 500
    try :
      assert self.size <= 500 and self.size>= 100 and isinstance(self.size,int)
    except:
      logging.warning("Invalid query size, may cause DNS server blocking, set to 500")
      self.size=500

    #Validate start to be valid a present\past timestamp
    try :
      self.start=int(self.start)
      assert self.start <= time.time() and isinstance(self.start,int)
    except:
      logging.warning("Invalid start time, being set to current")
      self.start=int(time.time())
    
    #Validate the difference to be a valid positive number
    try :
      assert isinstance(self.difference,int) and self.difference > 0
    except:
      logging.warning("Invalid difference, setting to a week")
      self.difference=7

  #URL Generator for pushlift
  def _url_generator(self):
    while True:
      timestamp='before={}&after={}'.format(self.current, self.current-self.difference) #Get timestamp
      yield self.url.format(self.size,timestamp) #get URL
      self.current-=self.difference #Update Current time

  #Process json output from pushlift to data
  def process_json(self,jsons):
    for json in jsons:
      self.data['Author'].append(json['author'] if 'author' in json else None)
      self.data['Title'].append(json['title'] if 'title' in json else None)
      self.data['Text'].append(json['selftext'] if 'selftext' in json else None)
      self.data['Flair'].append(json['link_flair_text'] if 'link_flair_text' in json else None)

  #Query and update the data from Web Crawler
  def query(self):
    loc_url=next(self.url_generator) #generate URL
    try:
      r = requests.get(loc_url) #get JSON
      data = json.loads(r.text) #load JSON to dict format

      #filter deleted/removed and posts without text
      jsons=[post for post in data['data'] if 'selftext' in post and not (post['selftext'] =="[removed]" or post['selftext'] =="[deleted]" or post['selftext']=="")]
      
      #process jsons to dict
      self.process_json(jsons=jsons)

      #append the datestamp and number of valid posts fetched in this leap
      self.stats.append((datetime.datetime.fromtimestamp(self.current).strftime('%Y-%m-%d'),len(jsons)))

      #delay interval to prevent DNS blocking
      time.sleep(self.sleep)
      logging.info("Query Successfull at {}".format(loc_url))
    except:
      logging.error("Query Failed at {}".format(loc_url))

  #to save progress for multi stop database generation and save the checkpoint to a JSON
  def save(self,pre=""):
    logging.info("Pickle dumped to {}.pkl".format(self.current))
    with open(pre+('{}.pkl'.format(self.current)), 'wb+') as f:
        pickle.dump([self.data, self.stats], f)

  #to load from a previous checkpoint
  def load(self,js):
    with open(js,'rb') as f:
      self.data, self.stats = pickle.load(f)
    self.current = int(js.split('/')[-1][:-4]) #init current stamp to saved stamp
    self.timer=self._url_generator() #init generator from loaded stamp

  #dump data to csv and stats to pkl
  def dump(self,pre=""):
    self.csv=pd.DataFrame() #csv init
    for key,values in self.data.items(): #create csv
      self.csv[key]=values
    self.csv.to_csv(pre+'raw_data.csv',index=False) #dump .csv
    with open(pre+'stats.pkl', 'wb') as f:
        pickle.dump(self.stats, f)

### **Functionality and use case :**
* Initialize the Crawler class and set the four parameters of choice (optional)
* To start scraping loop `Object.query()` for the number of defined leaps
* Load checkpoints by : `Object.load(path_to_pkl_file.pkl)`
* To continue looping follow the earlier procedure of scraping without any changes
* To save your checkpoints : `Object.save(pre=prefix_target_directory)`, they will be saved named as `UNIX_TIMESTAMP.pkl`, so that their dates are never lost, and they are kept in their natural order
* To dump your data into your csv and stats to pkl files : `Object.dump(pre=prefix_target_directory)`

In [0]:
# initialize your crawler
red=Crawler(size=500, difference=1, sleep=3)

#Uncomment to scrape from beginning
#for i in tqdm_notebook(range(3)):
#  red.query()

# load checkpoints
red.load('checkpoints/1523347657.pkl')

In [5]:
#continue scraping with more days of scrape, the function remains the same just query to scrape a day of posts
for i in tqdm_notebook(range(3)):
  red.query()

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  """Entry point for launching an IPython kernel.


HBox(children=(IntProgress(value=0, max=3), HTML(value='')))




In [0]:
red.dump(pre="data/") #dump your data and stats
red.save("checkpoints/") #save your checkpoints