## Science or Science Fiction?
### Can we spot the difference? 
---
#### Problem Statement

1. Can a predictive model be built to classify a reddit submission into science or science fiction categories with a higher than 70% accuracy?
2. Which model is the best at predicting the class of the content?
3. Which pre-processing methods and estimator parameters prove to work best when predicting the class of the content?

#### Data
Data was gathered from two subreddits: askscience and scifi. (r/askscience: 18.9m members, r/scifi: 1.6m members)

In [1]:
import pandas as pd
import numpy as np

import requests
from time import sleep

In [2]:
# render all columns visible in jupyter notebook

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', 100)

# https://towardsdatascience.com/how-to-show-all-columns-rows-of-a-pandas-dataframe-c49d4507fcf

In [3]:
# web location of data

url = 'https://api.pushshift.io/reddit/search/submission'

In [5]:
# for loop: get data from subreddit: askscience
# set current time as before_time (https://www.epochconverter.com/)
before_time = 1587328870

# set data to None 
data = None

# if reddit.csv already exists = read in file as data
try:
    data = pd.read_csv('./data/askscience.csv')
# if not: start running for loop    
except:
    pass

# for loop - get 1000 * 500 submissions at a time from r/askscience
# get only selftext, title, subreddit name, epoch number (created_utc)
# for id purposes (and in case we wanted to link submissions to comments later) get:
        # subreddit_id, id, and permalink

for i in range(1000):
    try:
         r = requests.get(
                url,
                params={
                    'subreddit': 'askscience',
                    'lang': True,
                    'size': 500,
                    'filter': ['selftext', 'title', 'subreddit', 'created_utc', 'subreddit_id', 'id', 'permalink'],
                    'before': before_time
        }
        )

        # decode json object
        # create dataframe from decoded data as df
        df = pd.DataFrame(r.json()['data'])

        # if data is empty - overwrite with df
        if data is None:
            data = df
        # if data not empty any more - add new data to dataframe
        else: 
            data = pd.concat([data, df], axis=0)

        # overwrite askscience.csv with new data
        data.to_csv('./data/askscience.csv', index=False)

        # look up the earliest timestamp of submissions
        # overwrite before_time with new earliest timestamp (to get data from before that time in next iteration of the loop)
        before_time = data['created_utc'].min()

    # if for loop stops because of error
    # print 'Loop restarted' (after 2 seconds sleep run 'try' again)
        # idea of try-except to run code for a longer time without errors borrowed from Janos Sallai
    except:
        print('Loop restarted.')
        
    # let 2 second elapse between iterations
    sleep(2)
    print(i)

0
1


In [4]:
# for loop: get data from subreddit: scifi
# set current time as before_time (https://www.epochconverter.com/)
before_time = 1587328870

# set data to None 
data = None

# if scifi.csv already exists = read in file as data
try:
    data = pd.read_csv('./data/scifi_new.csv')
# if not: start loop    
except:
    pass

# if we already have data from the before_time set above:
# in order to get earlier submissions that we already have
# set before_time to the smallest created_utc in the existing file 
if before_time > data['created_utc'].min():
        before_time = data['created_utc'].min()
        
# for-loop - get 500 submissions at a time from r/scifi
# get only selftext, title, subreddit name, epoch number (created_utc)
# for id purposes (and in case we wanted to link submissions to comments later) get:
        # subreddit_id, id, and permalink
for i in range(1000):

    r = requests.get(
            url,
            params={
                'subreddit': 'scifi',
                'lang': True,
                'size': 500,
                'filter': ['selftext', 'title', 'subreddit', 'created_utc', 'subreddit_id', 'id', 'permalink'],
                'before': before_time
    }
    )

    # decode json object
    # create dataframe from decoded data as df
    df = pd.DataFrame(r.json()['data'])

    # if data is empty - overwrite with df
    if data is None:
        data = df
    # if data not empty any more - add new data to dataframe
    else: 
        data = pd.concat([data, df], axis=0)

    # overwrite reddit.csv with new data
    data.to_csv('./data/scifi_new.csv', index=False)

    # look up the earliest timestamp of submissions
    # overwrite before_time with new earliest timestamp (to get data from before that time in next iteration of the loop)
    before_time = data['created_utc'].min()
        
    # let 2 second elapse between iterations
    sleep(2)
    print(i)

dataframe has 96635 rows
0
dataframe has 96635 rows
1


Data collected and saved in [scifi_new.csv](./data/scifi_new.csv) and in [askscience.csv](./data/askscience.csv)