# Part 1: Scraping, Cleaning, & Formatting

## Imports

In [None]:
import time
import pandas             as pd
import requests           as re
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

## Table Of Contents

-----

1. [Scraping The Reddit API](#Scraping-The-Reddit-API)
    - [URLs & The User_Agent](#URLs-&-The-User-Agent)
    - [Requests](#Requests)
    - [Saving The Data](#Saving-The-Data)
    - [Scraping](#Scraping)
    - [Conversion To Dataframes](#Conversion-To-Dataframes)


2. [Formatting](#Formatting)
    - [Column Extraction](#Column-Extraction)
    - [Creating New .csv Files](#Creating-New-.csv-Files)

3. [Cleaning](#Cleaning)
    - [Reading In The Data](#Reading-In-The-Data)
    - [Removing Duplicates](#Removing-Duplicates)

## Scraping The Reddit API

Reddit keeps raw data of all posts in a JSON format.  The documentation for its API can be found [here](https://www.reddit.com/dev/api/).

Before we can start working with Reddit's API data, we have to set up a request for the data.

### URLs & The User_Agent

Because we will be classifying posts from [r/Cooking](https://www.reddit.com/r/Cooking/) and [r/AskCulinary](https://www.reddit.com/r/AskCulinary), we need to have a URL and a `user_agent` for both subreddits.

In [None]:
# For r/Cooking

cooking_url = "http://reddit.com/r/Cooking.json"   
user_agent  = {"user-agent": "andrew_bergman"}                        

# For r/AskCulinary

askculinary_url = "http://reddit.com/r/AskCulinary.json"
user_agent      = {"user-agent": "andrew_bergman"}

### Requests

Now that we actually have the data, we can go ahead and set up the request.  The request is just a simple HTML request through the `requests` library.  We will also print out the status and hopefully get a status of 200.

In [None]:
# The r/Cooking request

cooking_request  = re.get(url = cooking_url, headers = user_agent)

# The r/ AskCulinary request

askculinary_request  = re.get(url = askculinary_url, headers = user_agent)

# Print status codes

print(f"The r/Cooking status code is    : {cooking_request.status_code}")
print(f"The r/AskCulinary status code is: {askculinary_request.status_code}")

### Saving The Data

Now that we have two working requests, we can go ahead and save the data as a variable.  To do that, we will convert the data to a JSON object and save that as the variable.


Because we already have some older posts, we will be adding the new pulls to our old ones.

In [None]:
# Saving the new r/Cooking data

new_cooking_data     = cooking_request.json()

# Saving the new r/AskCulinary data

new_askculinary_data = askculinary_request.json()


# Checking to make sure I got 25 posts from my first pull

print(f'The initial r/Cooking request returned    : {len(new_cooking_data["data"]["children"])}')
print(f'The initial r/AskCulinary request returned: {len(new_askculinary_data["data"]["children"])}')

In [None]:
# Looking at the `id`s from both pulls

print(f'The r/Cooking ID is    : {new_cooking_data["data"]["after"]}')
print(f'The r/AskCulinary ID is: {new_askculinary_data["data"]["after"]}')

### Scraping

The Reddit API allows for 1,000 posts to be scraped per subreddit per day.  In total we will have roughly 2,000 posts in addition to the older scraped data.

To make the scraping easier, we made use of a `for` loop to scrape the API 40 times.

In [None]:
# Scraping r/Cooking

# Creating an empty list to save the scrapes to
new_cooking_posts = []

# Setting it to `None` for use in the loop
after         = None

for pull in range(40):
    
    # Tells us the post being scraped in case of errors
    print(f"Pull Attempt {pull + 1}")
    
    if after == None:
        
        # Sets up the initial loop
        new_url = cooking_url
        
    else:
        
        # Allows for the creation of the next pull
        new_url = cooking_url + "?after=" + after
        
    # Resetting the request    
    request = re.get(url = new_url, headers = user_agent)
    
    # Only works if the status is good
    if request.status_code == 200:
        # creates a new dictionary & then appends it to the empty list
        new_cooking_data = request.json()
        new_cooking_posts.extend(new_cooking_data["data"]["children"])
        
        # Sets a new after value
        after = new_cooking_data["data"]["after"]
        
    else:
        print(f"An Error Has Occurred.  Error Code {request.status_code}")
        break
        
    # Setting a sleep time prevents me from being interpretted as a bot        
    time.sleep(2)

In [None]:
# For r/AskCulinary


new_askculinary_posts = []
after             = None

for pull in range(40):
    print(f"Pull Attempt {pull + 1}")
    if after == None:    
        new_url = askculinary_url
    else:
        new_url = askculinary_url + "?after=" + after
    new_askculinary_request = re.get(url = new_url, headers = user_agent)
    if new_askculinary_request.status_code == 200:
        new_askculinary_data = new_askculinary_request.json()
        new_askculinary_posts.extend(new_askculinary_data["data"]["children"])
        after = new_askculinary_data["data"]["after"]
    else:
        print(f"An Error Has Occurred.  Error Code {askculinary_request.status_code}")
        break
    time.sleep(2)

### Conversion To Dataframes

Now that we have 1,000 posts from each subreddit, we chose to save them as Pandas dataframes because it is easier to manipulate them in that format.

In [None]:
# For r/Cooking posts

new_cooking_data     = pd.DataFrame(new_askculinary_posts)

# For r/AskCulinary posts

new_askculinary_data = pd.DataFrame(new_askculinary_posts)

## Formatting

Even though the scraped data is now in the form of a dataframe, we cannot work with it yet because the entirety of the posts data is in the form of a dictionary in each cell.  To be able to work with the data, we will have to extract certain key-value pairs from the data and have those be features in the modified dataframe.

### Column Extraction


The full dictionary for each post has a lot of key-value pairs, but we only need four: `id`, `author`, `title`, and `selftext`.

While most subreddits are image or video based, we are lucky in that r/Cooking and r/AskCulinary are primarily text based communities: in addition to the title we have a body of text written by the author.

In [None]:
# For the r/Cooking data

# Using list comprehension to create new columns

cooking_id    = [new_cooking_data['data'][post]['id'] for post in range(len(new_cooking_data['data']))]
cooking_auth  = [new_cooking_data['data'][post]['author'] for post in range(len(new_cooking_data['data']))]
cooking_title = [new_cooking_data['data'][post]['title'] for post in range(len(new_cooking_data['data']))]
cooking_self  = [new_cooking_data['data'][post]['selftext'] for post in range(len(new_cooking_data['data']))]

# Creating new columns and setting them equal to the list comprehension results

new_cooking_data["id"]       = cooking_id
new_cooking_data["title"]    = cooking_title
new_cooking_data["selftext"] = cooking_self
new_cooking_data["author"]   = cooking_auth
new_cooking_data["source"]   = "cooking"

In [None]:
# For the r/AskCulinary data:

# Using list comprehension to create new columns

askcul_id    = [new_askculinary_data['data'][post]['id'] for post in range(len(new_askculinary_data['data']))]
askcul_auth  = [new_askculinary_data['data'][post]['author'] for post in range(len(new_askculinary_data['data']))]
askcul_title = [new_askculinary_data['data'][post]['title'] for post in range(len(new_askculinary_data['data']))]
askcul_self  = [new_askculinary_data['data'][post]['selftext'] for post in range(len(new_askculinary_data['data']))]

# Creating new columns and setting them equal to the list comprehension results

new_askculinary_data["id"]            = askcul_id
new_askculinary_data["title"]         = askcul_title
new_askculinary_data["selftext"]      = askcul_self
new_askculinary_data["author"]        = askcul_auth
new_askculinary_data["source"]        = "askculinary"

In [None]:
# Checking to make sure that the two are dataframs

print(f"The r/Cooking data is a    : {type(new_cooking_data)}")
print(f"The r/AskCulinary data is a: {type(new_askculinary_data)}")

In [None]:
# Checking the format of the r/Cooking data

new_cooking_data.head()

In [None]:
# Checking the format of the r/Cooking data

new_askculinary_data.head()

### Creating New .csv Files

In [None]:
# Creating a r/Cooking .csv file

new_cooking_data.to_csv("../Data/new_cooking_df.csv")

# Creating a r/AskCulinary .csv file

new_askculinary_data.to_csv("../Data/new_askculinary_df.csv")

## Cleaning

Now that we have the Reddit data in the format we want it in, we have to start cleaning it.  The process involves removing unnecessary columns, removing punctuation and non alphanumeric characters, and removing null values.

### Reading In The Data

In [None]:
old_cooking_data = pd.read_csv("../Data/cooking_df.csv")
new_cooking_data = pd.read_csv("")

old_askculinary_data = pd.read_csv("../Data/askculinary_df.csv")
new_askculinary_data = pd.read_csv("")

### Removing Duplicates