# Scrap Data from Reddit and Combine Scrapped Files
This notebook is to scrap the data from the r/dadjokes and r/antijokes subreddit and combine them into csv format. The final csv will only keep relevant information needed for NLP analysis and machine learning to classify between dadjokes and antijokes.

In [1]:
import requests
import pandas as pd
import time
import random
import glob
from datetime import date, datetime

## Scrapping Reddit Pages

In [2]:
# create a reddit scrapper function
# the function takes in a topic and will scrap records/posts from the topic (subreddit)
def scrap_reddit(topic, n=100):
    
    i = 1
    
    # the function will scrap the hot, top and rising posts to gather more posts
    # other categories such as new will not be scrapped as it might not be representative of the subreddit
    categories = ['hot', 'top', 'rising']

    for category in categories:
        
        posts = []
        after = None

        url = f'https://www.reddit.com/r/{topic}/{category}.json'
        
        # print the progress - which category it is running
        print(f'Starting No. {i} Run - Category: {category}')
        print('------------------------------------------------')

        for n in range(n):
            if n == 0:
                current_url = url
            else:
                if after != None:
                    current_url = url + '?after=' + after
                else:
                    print(f'Stopping after {n+1} Pages, reached the end of website')
                    print(current_url)
                    break
            
            res = requests.get(current_url, headers={'User-agent': 'Pony Inc 45812'})

            if res.status_code != 200:
                print('Status error', res.status_code)
                break

            current_dict = res.json()
            current_posts = [p['data']for p in current_dict['data']['children']]
            posts.extend(current_posts)
            previous = after
            after = current_dict['data']['after']
            
            if after == previous:
                print(f'The next url will be the same as current url - {current_url}')
                break
            
            # generate a random sleep duration to look more 'natural'
            sleep_duration = random.randint(2, 5)
            time.sleep(sleep_duration)
            
            # print the progress - how many page it has scrapped
            print(f'{category.capitalize()} Category: Page {n+1} Completed')
        
        # convert the scrapped data to a dataframe and save it to csv
        # named by topic_date_category for easy reference
        df = pd.DataFrame(posts)
        today = date.today().strftime("%d_%b")
        df.to_csv(f'../datasets/scrapped/{topic}_{today}_{category}.csv', index=False)
        

        print(f'{i}/{len(categories)} Completed Category: {category}')
        print('------------------------------------------------')
        print('------------------------------------------------\n')
        
        i += 1

In [None]:
# scrap data for r/dadjokes

# scrap_reddit('dadjokes', n = 100)

In [None]:
# scrap data for r/antijokes

# scrap_reddit('antijokes', n = 100)

## Combining the Scrapped Data
The scrapped data are sitting in different csv files. This step will combine the different csv files by topic and keep only the necessary features/columns for data analysis.

In [3]:
# define a function to create the dataframe using the scrapped csv
def create_df(topic):
    
    # find the directory of the csv files with the topic
    list_of_csvfiles = glob.glob(f'../datasets/scrapped/{topic}*.csv')
    
    # create a list of the columns to be kept - title, post (selftext) and subreddit (target)
    # i am also keeping url so that i can refer to the actual post if needed
    features_kept = ['subreddit', 'title', 'selftext', 'url']
    
    # open the csv and combine into 1 dataframe keeping only the columns selected
    for i, csv_file in enumerate(list_of_csvfiles):
        
        if i == 0:
            df = pd.read_csv(csv_file)
            topic_df = df[features_kept].copy()            
            
        else:
            df = pd.read_csv(csv_file)
            df = df[features_kept]
            
            topic_df = pd.concat([topic_df, df])
        
        print(f'Completed: {csv_file}')
        
    # drop any rows that have the same title and post         
    topic_df.drop_duplicates(['title', 'selftext'], inplace = True)
    
    # rename the columns for better understanding
    topic_df.rename(columns={
        'title': 'original_title',
        'selftext': 'original_post',
    }, inplace=True)

    return topic_df

In [4]:
# define a function to save a dataframe into a csv with the time that it is created
def save_to_csv(df, topic):
    now = datetime.now().strftime("%d_%b_%H%M")
    df.to_csv(f'../datasets/{topic}_{now}_Combined.csv', index=False)

In [5]:
# create the dataframe for r/antijoke
anti_df = create_df('antijoke')

Completed: ../datasets/scrapped\antijokes_25_Nov_hot.csv
Completed: ../datasets/scrapped\antijokes_27_Nov_hot.csv
Completed: ../datasets/scrapped\antijokes_27_Nov_top.csv
Completed: ../datasets/scrapped\antijokes_29_Nov_hot.csv
Completed: ../datasets/scrapped\antijokes_29_Nov_top.csv


In [6]:
# check that the dataframe is correct
anti_df.reset_index(drop = True, inplace = True)

anti_df.head()

Unnamed: 0,subreddit,original_title,original_post,url
0,AntiJokes,You know what they say about black guys in bed,they are in a bed,https://www.reddit.com/r/AntiJokes/comments/k0...
1,AntiJokes,What's an octopus' favorite month?,Despite being an extraordinarily brilliant spe...,https://www.reddit.com/r/AntiJokes/comments/k0...
2,AntiJokes,What do you call a melted snowman?,Water,https://www.reddit.com/r/AntiJokes/comments/k0...
3,AntiJokes,What did the ice cream say to the old man,Jesus fuck I just want an upvote I don’t even ...,https://www.reddit.com/r/AntiJokes/comments/jz...
4,AntiJokes,A bartender walks into a bar,He gets working,https://www.reddit.com/r/AntiJokes/comments/k0...


In [7]:
# save the dataframe to csv
# save_to_csv(anti_df, 'antijokes')

In [8]:
# create the dataframe for r/dadjokes
dad_df = create_df('dadjokes')

Completed: ../datasets/scrapped\dadjokes_25_Nov_hot.csv
Completed: ../datasets/scrapped\dadjokes_27_Nov_hot.csv
Completed: ../datasets/scrapped\dadjokes_27_Nov_rising.csv
Completed: ../datasets/scrapped\dadjokes_27_Nov_top.csv
Completed: ../datasets/scrapped\dadjokes_29_Nov_hot.csv
Completed: ../datasets/scrapped\dadjokes_29_Nov_rising.csv
Completed: ../datasets/scrapped\dadjokes_29_Nov_top.csv


In [9]:
# check that the dataframe is correct
dad_df.reset_index(drop = True, inplace = True)

dad_df.head()

Unnamed: 0,subreddit,original_title,original_post,url
0,dadjokes,We just found out my Grandpa is addicted to Vi...,No one is taking it harder than my Grandma.,https://www.reddit.com/r/dadjokes/comments/k0i...
1,dadjokes,What do you call a line of men waiting to get ...,A barberqueue,https://www.reddit.com/r/dadjokes/comments/k04...
2,dadjokes,Why is dark written with a K not a C?,Because you can't C in the dark,https://www.reddit.com/r/dadjokes/comments/k0j...
3,dadjokes,I watched Bohemian Rhapsody three times in a r...,It must be the high Mercury content.,https://www.reddit.com/r/dadjokes/comments/k0g...
4,dadjokes,I recently decided to learn sign language...,So that I can tell jokes nobody has ever heard.,https://www.reddit.com/r/dadjokes/comments/k04...


In [10]:
# save the dataframe to csv
# save_to_csv(dad_df, 'dadjokes')