# Part 1: Scraping & Preprocessing

## Imports

In [1]:
import time
import pandas             as pd
import requests           as re
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

## Table Of Contents


- [Scraping The Reddit API](#Scraping-The-Reddit-API)
    - [URLs & The User_Agent](#URLs-&-The-User-Agent)
    - [Requests](#Requests)
    - [Saving The Data](#Saving-The-Data)
    - [Scraping](#Scraping)
    - [Conversion To Dataframes](#Conversion-To-Dataframes)


- [Formatting](#Formatting)
    - [Column Extraction](#Column-Extraction)
    - [Creating New .csv Files](#Creating-New-.csv-Files)


- [Data Cleaning](#Data-Cleaning)
    - [Reading In The Data](#Reading-In-The-Data)
    - [Concatenating The Dataframes](#Concatenating-The-Dataframes)
    - [Cleaning](#Cleaning)
    - [Creating A Modeling Dataframe](#Creating-A-Modeling-Dataframe)

## Scraping The Reddit API

Reddit keeps raw data of all posts in a JSON format.  The documentation for its API can be found [here](https://www.reddit.com/dev/api/).

Before we can start working with Reddit's API data, we have to set up a request for the data.

### URLs & The User_Agent

Because we will be classifying posts from [r/Cooking](https://www.reddit.com/r/Cooking/) and [r/AskCulinary](https://www.reddit.com/r/AskCulinary), we need to have a URL and a `user_agent` for both subreddits.

In [2]:
# For r/Cooking

cooking_url = "http://reddit.com/r/Cooking.json"   
user_agent  = {"user-agent": "andrew_bergman"}                        

# For r/AskCulinary

askculinary_url = "http://reddit.com/r/AskCulinary.json"
user_agent      = {"user-agent": "andrew_bergman"}

### Requests

Now that we actually have the data, we can go ahead and set up the request which is just a simple HTML request through the `requests` library; we will print out the status and hopefully get a status of 200.

In [3]:
# The r/Cooking request

cooking_request      = re.get(url     = cooking_url, 
                              headers = user_agent)

# The r/ AskCulinary request

askculinary_request  = re.get(url     = askculinary_url, 
                              headers = user_agent)

# Print status codes

print(f"The r/Cooking status code is    : {cooking_request.status_code}")
print(f"The r/AskCulinary status code is: {askculinary_request.status_code}")

The r/Cooking status code is    : 200
The r/AskCulinary status code is: 200


### Saving The Data

Now that we have two working requests, we can go ahead and save the data as a variable.  To do that, we will convert the data to a JSON object and save that as the variable.


Because we already have some older posts, we will be adding the new pulls to our old ones.

In [4]:
# Saving the new r/Cooking data

new_cooking_data     = cooking_request.json()

# Saving the new r/AskCulinary data

new_askculinary_data = askculinary_request.json()


# Checking to make sure I got 25 posts from my first pull

print(f'The initial r/Cooking request returned    : {len(new_cooking_data["data"]["children"])}')
print(f'The initial r/AskCulinary request returned: {len(new_askculinary_data["data"]["children"])}')

The initial r/Cooking request returned    : 25
The initial r/AskCulinary request returned: 27


In [5]:
# Looking at the `id`s from both pulls

print(f'The r/Cooking ID is    : {new_cooking_data["data"]["after"]}')
print(f'The r/AskCulinary ID is: {new_askculinary_data["data"]["after"]}')

The r/Cooking ID is    : t3_d04ht7
The r/AskCulinary ID is: t3_czuzx0


### Scraping

The Reddit API allows for 1,000 posts to be scraped per subreddit per day.  In total we will have roughly 2,000 posts in addition to the older scraped data.

To make the scraping easier, we made use of a `for` loop to scrape the API 40 times.

In [6]:
# Scraping r/Cooking

# Creating an empty list to save the scraped posts to
new_cooking_posts = []

# Setting it to `None` for use in the loop
after         = None

for pull in range(40):
    
    # Tells us the post being scraped in case of errors
    print(f"Pull Attempt {pull + 1}")
    
    if after == None:
        
        # Sets up the initial loop
        new_url = cooking_url
        
    else:
        
        # Allows for the creation of the next pull
        new_url = cooking_url + "?after=" + after
        
    # Resetting the request    
    request = re.get(url = new_url, headers = user_agent)
    
    # Only works if the status is good
    if request.status_code == 200:
        # creates a new dictionary & then appends it to the empty list
        new_cooking_data = request.json()
        new_cooking_posts.extend(new_cooking_data["data"]["children"])
        
        # Sets a new after value
        after = new_cooking_data["data"]["after"]
        
    else:
        print(f"An Error Has Occurred.  Error Code {request.status_code}")
        break
        
    # Setting a sleep time prevents me from being interpretted as a bot        
    time.sleep(2)

Pull Attempt 1
Pull Attempt 2
Pull Attempt 3
Pull Attempt 4
Pull Attempt 5
Pull Attempt 6
Pull Attempt 7
Pull Attempt 8
Pull Attempt 9
Pull Attempt 10
Pull Attempt 11
Pull Attempt 12
Pull Attempt 13
Pull Attempt 14
Pull Attempt 15
Pull Attempt 16
Pull Attempt 17
Pull Attempt 18
Pull Attempt 19
Pull Attempt 20
Pull Attempt 21
Pull Attempt 22
Pull Attempt 23
Pull Attempt 24
Pull Attempt 25
Pull Attempt 26
Pull Attempt 27
Pull Attempt 28
Pull Attempt 29
Pull Attempt 30
Pull Attempt 31
Pull Attempt 32
Pull Attempt 33
Pull Attempt 34
Pull Attempt 35
Pull Attempt 36
Pull Attempt 37
Pull Attempt 38
Pull Attempt 39
Pull Attempt 40


[Top](#Table-Of-Contents)

In [7]:
# For r/AskCulinary

new_askculinary_posts = []
after             = None

for pull in range(40):
    print(f"Pull Attempt {pull + 1}")
    if after == None:    
        new_url = askculinary_url
    else:
        new_url = askculinary_url + "?after=" + after
    new_askculinary_request = re.get(url = new_url, headers = user_agent)
    if new_askculinary_request.status_code == 200:
        new_askculinary_data = new_askculinary_request.json()
        new_askculinary_posts.extend(new_askculinary_data["data"]["children"])
        after = new_askculinary_data["data"]["after"]
    else:
        print(f"An Error Has Occurred.  Error Code {askculinary_request.status_code}")
        break
    time.sleep(2)

Pull Attempt 1
Pull Attempt 2
Pull Attempt 3
Pull Attempt 4
Pull Attempt 5
Pull Attempt 6
Pull Attempt 7
Pull Attempt 8
Pull Attempt 9
Pull Attempt 10
Pull Attempt 11
Pull Attempt 12
Pull Attempt 13
Pull Attempt 14
Pull Attempt 15
Pull Attempt 16
Pull Attempt 17
Pull Attempt 18
Pull Attempt 19
Pull Attempt 20
Pull Attempt 21
Pull Attempt 22
Pull Attempt 23
Pull Attempt 24
Pull Attempt 25
Pull Attempt 26
Pull Attempt 27
Pull Attempt 28
Pull Attempt 29
Pull Attempt 30
Pull Attempt 31
Pull Attempt 32
Pull Attempt 33
Pull Attempt 34
Pull Attempt 35
Pull Attempt 36
Pull Attempt 37
Pull Attempt 38
Pull Attempt 39
Pull Attempt 40


### Conversion To Dataframes

Now that we have 1,000 posts from each subreddit, we chose to save them as Pandas dataframes because it is easier to manipulate them in that format: the entirety of each dictionary will be saved in a single row.  Despite that, we will be able to extract data we want.

In [8]:
# For r/Cooking posts

new_cooking_data     = pd.DataFrame(new_cooking_posts)

# For r/AskCulinary posts

new_askculinary_data = pd.DataFrame(new_askculinary_posts)

[Top](#Table-Of-Contents)

## Formatting

As mentioned above, the entirety of each post is stored as a single cell in the dataframes and for that reason we will have to extract the data we want.  To do that, we will use list comprehensions and then set the comprehensions as a new column in the dataframe.

### Column Extraction


The full dictionary for each post has a lot of key-value pairs, but we only need four: `id`, `author`, `title`, and `selftext`.

While most subreddits are image or video based, we are lucky in that r/Cooking and r/AskCulinary are primarily text based communities: in addition to the title we have a body of text written by the author.

In [9]:
# For the r/Cooking data

# Using list comprehension to create new columns

cooking_id    = [new_cooking_data['data'][post]['id'] for post in range(len(new_cooking_data['data']))]
cooking_auth  = [new_cooking_data['data'][post]['author'] for post in range(len(new_cooking_data['data']))]
cooking_title = [new_cooking_data['data'][post]['title'] for post in range(len(new_cooking_data['data']))]
cooking_self  = [new_cooking_data['data'][post]['selftext'] for post in range(len(new_cooking_data['data']))]

# Creating new columns and setting them equal to the list comprehension results

new_cooking_data["id"]       = cooking_id
new_cooking_data["title"]    = cooking_title
new_cooking_data["selftext"] = cooking_self
new_cooking_data["author"]   = cooking_auth
new_cooking_data["source"]   = "cooking"

In [10]:
# For the r/AskCulinary data:

# Using list comprehension to create new columns

askcul_id    = [new_askculinary_data['data'][post]['id'] for post in range(len(new_askculinary_data['data']))]
askcul_auth  = [new_askculinary_data['data'][post]['author'] for post in range(len(new_askculinary_data['data']))]
askcul_title = [new_askculinary_data['data'][post]['title'] for post in range(len(new_askculinary_data['data']))]
askcul_self  = [new_askculinary_data['data'][post]['selftext'] for post in range(len(new_askculinary_data['data']))]

# Creating new columns and setting them equal to the list comprehension results

new_askculinary_data["id"]       = askcul_id
new_askculinary_data["title"]    = askcul_title
new_askculinary_data["selftext"] = askcul_self
new_askculinary_data["author"]   = askcul_auth
new_askculinary_data["source"]   = "askculinary"

In [11]:
# Checking to make sure that the two are dataframs

print(f"The r/Cooking data is a    : {type(new_cooking_data)}")
print(f"The r/AskCulinary data is a: {type(new_askculinary_data)}")

The r/Cooking data is a    : <class 'pandas.core.frame.DataFrame'>
The r/AskCulinary data is a: <class 'pandas.core.frame.DataFrame'>


In [12]:
# Checking the format of the r/Cooking data

new_cooking_data.head()

Unnamed: 0,kind,data,id,title,selftext,author,source
0,t3,"{'approved_at_utc': None, 'subreddit': 'Cookin...",czue8a,My mom just boiled potatoes for 45 mins for ma...,It's water now.,CashewBaby,cooking
1,t3,"{'approved_at_utc': None, 'subreddit': 'Cookin...",czz7xx,"Recipe: The Original Hot &amp; Sour Soup, Wuxi...",So today I wanted to show you how to make Hot ...,mthmchris,cooking
2,t3,"{'approved_at_utc': None, 'subreddit': 'Cookin...",d05rzq,What do you look for in 'food/recipe blogging'?,Hey there!\n\nI hope this relates enough to r/...,ijnyh,cooking
3,t3,"{'approved_at_utc': None, 'subreddit': 'Cookin...",d00t78,[New to cooking] When I eat out at a restauran...,"I cook my chicken in a skillet, on med-high he...",-angry-dude,cooking
4,t3,"{'approved_at_utc': None, 'subreddit': 'Cookin...",czkuk6,What's a home-cooked meal you've never seen in...,"Originally had this as a comment, but for me i...",faitswulff,cooking


In [13]:
# Checking the format of the r/AskCulinary data

new_askculinary_data.head()

Unnamed: 0,kind,data,id,title,selftext,author,source
0,t3,"{'approved_at_utc': None, 'subreddit': 'AskCul...",61bcic,/r/AskCulinary best practices guide,"/r/AskCulinary is well over 100,000 subscriber...",bigtcm,askculinary
1,t3,"{'approved_at_utc': None, 'subreddit': 'AskCul...",cyupiv,Weekly discussion: homemade pastas / noodles,Fresh pasta/noodles aren't the most approachab...,albino-rhino,askculinary
2,t3,"{'approved_at_utc': None, 'subreddit': 'AskCul...",d0152m,Was recently in Italy and most tiramisu looks ...,,PumbaPLS,askculinary
3,t3,"{'approved_at_utc': None, 'subreddit': 'AskCul...",d041lu,Can i deep fry duck like chicken?,I want to try and deep fry Duck in the same wa...,HTKTHEPRODUCER,askculinary
4,t3,"{'approved_at_utc': None, 'subreddit': 'AskCul...",czrsd2,Maple syrup is a Spring food. So why do we ass...,Was doing some research but couldn't find hist...,shiva14b,askculinary


The first and second rows of the r/AskCulinary dataframe are stickied posts which have to be removed.

In [14]:
new_askculinary_data.drop([0,1], 
                          inplace = True)

In [15]:
new_askculinary_data.head(2)

Unnamed: 0,kind,data,id,title,selftext,author,source
2,t3,"{'approved_at_utc': None, 'subreddit': 'AskCul...",d0152m,Was recently in Italy and most tiramisu looks ...,,PumbaPLS,askculinary
3,t3,"{'approved_at_utc': None, 'subreddit': 'AskCul...",d041lu,Can i deep fry duck like chicken?,I want to try and deep fry Duck in the same wa...,HTKTHEPRODUCER,askculinary


### Creating New .csv Files

In [16]:
# Creating a r/Cooking .csv file

new_cooking_data.to_csv("../Data/new_cooking_df.csv")

# Creating a r/AskCulinary .csv file

new_askculinary_data.to_csv("../Data/new_askculinary_df.csv")

[Top](#Table-Of-Contents)

## Data Cleaning

### Reading In The Data

In [17]:
# The r/Cooking data

old_cooking_data = pd.read_csv("../Data/cooking_df.csv")
new_cooking_data = pd.read_csv("../Data/new_cooking_df.csv")

# The r/AskCulinary data

old_askculinary_data = pd.read_csv("../Data/askculinary_df.csv")
new_askculinary_data = pd.read_csv("../Data/new_askculinary_df.csv")

In [18]:
# Checking the shapes of the .csv files
 
print(f"The shape of the new r/Cooking dataframe is    : {new_cooking_data.shape}")
print(f"The shape of the new r/Askculinary dataframe is: {new_askculinary_data.shape}")

print(f"The shape of the old r/Cooking dataframe is    : {old_cooking_data.shape}")
print(f"The shape of the old r/Askculinary dataframe is: {old_askculinary_data.shape}")

The shape of the new r/Cooking dataframe is    : (980, 8)
The shape of the new r/Askculinary dataframe is: (995, 8)
The shape of the old r/Cooking dataframe is    : (989, 8)
The shape of the old r/Askculinary dataframe is: (999, 8)


In [19]:
# Checking the heads of the old r/Cooking data

old_cooking_data.head()

Unnamed: 0.1,Unnamed: 0,data,kind,id,title,selftext,author,source
0,0,"{'approved_at_utc': None, 'subreddit': 'Cookin...",t3,cbl354,Does anyone else immediately distrust a recipe...,Edit: if anyone else tries to tell me they can...,bobs_aspergers,cooking
1,1,"{'approved_at_utc': None, 'subreddit': 'Cookin...",t3,cbuvkb,Best potato salad recipe??,,coolbeanbeans,cooking
2,2,"{'approved_at_utc': None, 'subreddit': 'Cookin...",t3,cbtuhn,Mortar &amp; Pestle questions,1.) Is marble dust safe to ingest? I ground sa...,Swigart,cooking
3,3,"{'approved_at_utc': None, 'subreddit': 'Cookin...",t3,cbmrz3,Weekly menu-setting has changed my life,"I’ve always enjoyed cooking, but in the past s...",chuy1530,cooking
4,4,"{'approved_at_utc': None, 'subreddit': 'Cookin...",t3,cb5pvy,This guy in India has a cooking channel where ...,I just stumbled upon this guy's channel today ...,Svargas05,cooking


In [20]:
# Checking the heads of the old r/AskCulinary data

old_askculinary_data.head()

Unnamed: 0.1,Unnamed: 0,data,kind,id,title,selftext,author,source
0,0,"{'approved_at_utc': None, 'subreddit': 'AskCul...",t3,61bcic,/r/AskCulinary best practices guide,"/r/AskCulinary is well over 100,000 subscriber...",bigtcm,askculinary
1,1,"{'approved_at_utc': None, 'subreddit': 'AskCul...",t3,cap1n7,Weekly discussion: Melons,It is hot everywhere in the northern hemispher...,albino-rhino,askculinary
2,2,"{'approved_at_utc': None, 'subreddit': 'AskCul...",t3,cbm3b1,Everyone who has went to culinary school - spe...,"I’m about to start a dual degree (JD/MA), the ...",Link_the_Fox,askculinary
3,3,"{'approved_at_utc': None, 'subreddit': 'AskCul...",t3,cbqq5d,How is “Stage” pronounced?,"I always read it as the word stage, like a sta...",andykndr,askculinary
4,4,"{'approved_at_utc': None, 'subreddit': 'AskCul...",t3,cbfibu,"Still working on ""cheater"" ramen and now need ...","So adding gelatin to the ""cheater"" ramen was t...",gingernuts13,askculinary


The first two rows of the old r/AskCulinary dataframe are stickied posts and have to be removed.

In [21]:
old_askculinary_data.drop([0,1], 
                          inplace = True)

In [22]:
old_askculinary_data.head()

Unnamed: 0.1,Unnamed: 0,data,kind,id,title,selftext,author,source
2,2,"{'approved_at_utc': None, 'subreddit': 'AskCul...",t3,cbm3b1,Everyone who has went to culinary school - spe...,"I’m about to start a dual degree (JD/MA), the ...",Link_the_Fox,askculinary
3,3,"{'approved_at_utc': None, 'subreddit': 'AskCul...",t3,cbqq5d,How is “Stage” pronounced?,"I always read it as the word stage, like a sta...",andykndr,askculinary
4,4,"{'approved_at_utc': None, 'subreddit': 'AskCul...",t3,cbfibu,"Still working on ""cheater"" ramen and now need ...","So adding gelatin to the ""cheater"" ramen was t...",gingernuts13,askculinary
5,5,"{'approved_at_utc': None, 'subreddit': 'AskCul...",t3,cbnam6,What did I do wrong with my chickpea curry?,I attempted to make the recipe shown at the en...,robsc_16,askculinary
6,6,"{'approved_at_utc': None, 'subreddit': 'AskCul...",t3,cbisi4,Cookbooks on Spices... based on science,Does anyone have a cookbook on spices that inc...,espierc,askculinary


### Concatenating The Dataframes

In [23]:
# Concatenating the cooking dataframes together

cooking_combined     = pd.concat(objs = [old_cooking_data, 
                                         new_cooking_data],
                                 axis = 0, 
                                 sort = False)

# Concatenating the askculinary dataframes together

askculinary_combined = pd.concat(objs = [old_askculinary_data, 
                                         new_askculinary_data],
                                 axis = 0, 
                                 sort = False)

In [24]:
# Making sure there are no cross-posts

cooking_combined      = cooking_combined[cooking_combined["title"].isin(askculinary_combined["title"]) == False]
askculinary_combined  = askculinary_combined[askculinary_combined["title"].isin(cooking_combined["title"]) == False]

In [25]:
# Concatenating the two combined dataframes together

combined_data = pd.concat(objs = [cooking_combined, 
                                  askculinary_combined],
                          axis = 0, 
                          sort = False)

In [26]:
# Checking the shape of the dataframe

combined_data.shape

(3950, 8)

[Top](#Table-Of-Contents)

### Cleaning

#### Dropping Null Values

Unfortunately, we have to odrop all rows with null values because there is no way for us to fill in the missing data.

In [27]:
combined_data.dropna(inplace = True)

# Checking the length of the dataframe

print(f"The dataframe now has {combined_data.shape[0]} rows.")

The dataframe now has 3682 rows.


#### Removing Duplicate IDs

There is a chance that a post may have been posted more than once to a subreddit, so by removing duplicated `id`s we make sure only unique text is going into the model.

In [28]:
combined_data.drop_duplicates("id",
                              keep    = "first",
                              inplace = True)

# Checking the length of the dataframe

print(f"The dataframe now has {combined_data.shape[0]} rows.")

The dataframe now has 3249 rows.


#### Removing Unnecessary Columns

We no longer need the `data` and `kind` columns because we extracted the features we need.  Additionally, `Unnamed: 0` is a relic from concatenating the dataframes together.

In [29]:
combined_data.columns

Index(['Unnamed: 0', 'data', 'kind', 'id', 'title', 'selftext', 'author',
       'source'],
      dtype='object')

In [30]:
# Dropping the three columns

combined_data.drop(["Unnamed: 0", "data", "kind"],
                   axis    = 1,
                   inplace = True)

In [31]:
# Making sure the columns were dropped successfully

combined_data.columns

Index(['id', 'title', 'selftext', 'author', 'source'], dtype='object')

#### Combining `title` & `selftext`

We decided that instead of having two separate columns of text, we would combine the two columns of text to make the modeling process a little bit easier.  To do that, we will simply append `title` and `selftext` together.  Although we ran a regular expression earlier, we diced to run another one here to make sure everything was removed.

In [32]:
combined_data["text"] = combined_data["title"] + " " + combined_data["selftext"]
combined_data["text"] = combined_data["text"].str.replace("[^a-zA-Z ]", "")

Now that we have combined the two columns, we can drop the two originals.

In [33]:
combined_data.drop(["title", "selftext"], 
                   axis    = 1,
                   inplace = True);

#### Mapping The `source` Column

For the models we will construct, we need to have the `source` column be binary values.

In [34]:
combined_data["target"] = combined_data["source"].apply(lambda x: 1 if x == "cooking" else 0)

#### Creating A Modeling Dataframe

In [35]:
combined_data.to_csv("../Data/model_data.csv", index = False)

[Top](#Table-Of-Contents)