# Initial Data Collection 

### Problem Statement

I work for [TourismOhio](https://ohio.org/home/about-us), where our mission statement is to “aggressively position Ohio as a relevant travel destination and support Ohio’s tourism industry to drive economic prosperity throughout the state.” We saw a 21% drop in visitor spending in 2020, but rebounded in 2021, and we are now trying to continue the growth momentum with more relevant ads and offers.

To do this, I’ve first been asked to scrape social media sites and find out what people are saying about our state and why they may or may not visit so that we can adapt our advertisements to our target audiences. I am starting on Reddit, with what I believe to be two relevant subreddit pages:
- Ohio: Created on October 4, 2008, with 342k members
- IHateOhio: Created on September 28, 2018, with 4.6k members

Before I can analyze the messages or make any recommendations, I need to be able to pull posts from each of these threads into an aggregated data frame. My goal is to build a classification model that can predict, with at least 85% accuracy, which subreddit each post is from. 

In [1]:
# Imports

import pandas as pd
import numpy as np
import requests 
import time

In [2]:
hate_url = 'https://api.pushshift.io/reddit/search/submission?subreddit=ihateohio'

love_url = 'https://api.pushshift.io/reddit/search/submission?subreddit=ohio'

res = requests.get(hate_url)

# Testing status 
res.status_code

200

In [3]:
# Defining a function to pull from the api several times and return a full dataframe

def api_pull(url):
    '''Setting multiple day parameters to pull several iterations of the subreddit'''
    days = ['30d', '60d', '90d', '200d', '365d', '500d', '730d', '900d', '1095d', '1460d']

    '''Empty list to add to'''
    list_df = []

    '''Iterating through the days list and appending each df to the df list'''
    for i in days:
        params = {'size': 500, 'after': i, 'is_self': True, 'selftext:not': '[removed]'}
        res = requests.get(url, params=params)
        data = res.json()
        posts = data['data']
        df = pd.DataFrame(posts)
        list_df.append(df)
        time.sleep(2.4) # Added to delay execution
    list_df = pd.concat(list_df, axis=0, ignore_index=True)

    return list_df

After quite a bit of trial and error with the 'days' list values and whether to use 'before' or 'after' as the parameter, this function produced the most robust dataframes for both subreddits.

In [4]:
# Using the function to create the dataframes for each subreddit

love_df = pd.DataFrame(api_pull(love_url))

print(love_df.shape)
love_df.head()

(2464, 78)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,banned_by,removed_by_category,edited,poll_data,steward_reports,updated_utc,og_description,og_title,link_flair_css_class,link_flair_text
0,[],False,barelycriminal,,[],,text,t2_mnrikfq0,False,False,...,,,,,,,,,,
1,[],False,DreamsAndBoxes,,[],,text,t2_cfk50px5,False,False,...,,,,,,,,,,
2,[],True,magsbunni,,[],,text,t2_ozc0ugna,False,False,...,,,,,,,,,,
3,[],False,Ohiowelder,,[],,text,t2_b0m3oyra,False,False,...,,,,,,,,,,
4,[],False,titanup1993,,[],,text,t2_6plqhogn,False,False,...,,,,,,,,,,


In [5]:
hate_df = pd.DataFrame(api_pull(hate_url))

print(hate_df.shape)
hate_df.head()

(968, 78)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,preview,poll_data,removed_by_category,banned_by,steward_reports,updated_utc,og_description,og_title,gilded,author_cakeday
0,[],False,assfuck1911,,[],,text,t2_147l8en8,False,False,...,,,,,,,,,,
1,[],False,Molotov_YouTube,,[],,text,t2_7rdlbmvn,False,False,...,,,,,,,,,,
2,[],False,CurrentSingleStatus,,[],,text,t2_opgnn5az,False,False,...,,,,,,,,,,
3,[],False,CurrentSingleStatus,,[],,text,t2_opgnn5az,False,False,...,,,,,,,,,,
4,[],False,TO2010SGC,,[],,text,t2_p8oa5os6,False,False,...,,,,,,,,,,


Immediately we can see duplicates in the hate_df first five rows. I'm going to filter on the 'id' column (this is a unique identifier) to check for and remove duplicate posts from the dataframes.

In [6]:
love_df['id'].value_counts()

x2q162    1
jkveyl    1
jkb4y4    1
jkf686    1
jkfmfm    1
         ..
tr06tx    1
tr0y1w    1
tr5few    1
trm3w9    1
aqmcpu    1
Name: id, Length: 2464, dtype: int64

In [7]:
hate_df['id'].value_counts()

x6zmjg    9
xrlo0t    9
xrn5i2    9
xr4qfa    9
we3qto    8
         ..
bynork    1
bysm12    1
bytb4e    1
bytn8n    1
daptva    1
Name: id, Length: 284, dtype: int64

In [9]:
hate_df.loc[hate_df['id'] == 'xrlo0t']

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,preview,poll_data,removed_by_category,banned_by,steward_reports,updated_utc,og_description,og_title,gilded,author_cakeday
2,[],False,CurrentSingleStatus,,[],,text,t2_opgnn5az,False,False,...,,,,,,,,,,
8,[],False,CurrentSingleStatus,,[],,text,t2_opgnn5az,False,False,...,,,,,,,,,,
21,[],False,CurrentSingleStatus,,[],,text,t2_opgnn5az,False,False,...,,,,,,,,,,
53,[],False,CurrentSingleStatus,,[],,text,t2_opgnn5az,False,False,...,,,,,,,,,,
105,[],False,CurrentSingleStatus,,[],,text,t2_opgnn5az,False,False,...,,,,,,,,,,
183,[],False,CurrentSingleStatus,,[],,text,t2_opgnn5az,False,False,...,,,,,,,,,,
312,[],False,CurrentSingleStatus,,[],,text,t2_opgnn5az,False,False,...,,,,,,,,,,
498,[],False,CurrentSingleStatus,,[],,text,t2_opgnn5az,False,False,...,,,,,,,,,,
716,[],False,CurrentSingleStatus,,[],,text,t2_opgnn5az,False,False,...,,,,,,,,,,


Although we have 968 rows in the hate df, only 284 of these are unique. Interestingly, this is not true for the love df.

Because I want a more equal distribution of observations between the two subreddits, I'm going to slightly modify the api_pull function to gather more data from the IHateOhio page. This new function is going to remove the selftext parameter, so the resulting dataframe will not have any selftext values.

In [10]:
def sans_selftext(url):
    '''Setting multiple day parameters to pull several iterations of the subreddit'''
    days = ['30d', '60d', '90d', '200d', '365d', '500d', '730d', '900d', '1095d', '1460d']

    '''Empty list to add to'''
    list_df = []

    '''Iterating through the days list and appending each df to the df list'''
    for i in days:
        params = {'size': 500, 'after': i, 'is_self': False, 'selftext:not': '[removed]'}
        res = requests.get(url, params=params)
        data = res.json()
        posts = data['data']
        df = pd.DataFrame(posts)
        list_df.append(df)
        time.sleep(2.4) 
    list_df = pd.concat(list_df, axis=0, ignore_index=True)

    return list_df

In [11]:
no_subt_hate = pd.DataFrame(sans_selftext(hate_url))

print(no_subt_hate.shape)
no_subt_hate.head()

(1701, 82)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,media_embed,secure_media,secure_media_embed,author_cakeday,removed_by_category,gallery_data,is_gallery,media_metadata,steward_reports,updated_utc
0,[],False,yeet4memes,,[],,text,t2_6i96d33a,False,False,...,,,,,,,,,,
1,[],False,Ow_wow,,"[{'e': 'text', 't': 'Ohio Hater'}]",Ohio Hater,richtext,t2_irzit,False,False,...,,,,,,,,,,
2,[],False,space_gnomke,,[],,text,t2_1hukfzwf,False,False,...,,,,,,,,,,
3,[],False,phaggut69,,[],,text,t2_3ebdkb46,False,False,...,,,,,,,,,,
4,[],False,NickTAB16,,[],,text,t2_skw9vqqn,False,False,...,,,,,,,,,,


In [12]:
no_subt_hate['id'].value_counts()

x70vyf    5
xlduv1    5
xcejcb    5
we0job    5
xr9lew    5
         ..
bvgrlw    1
bukx2y    1
buji95    1
bmybnr    1
d9c83p    1
Name: id, Length: 622, dtype: int64

We can see that there are duplicates in this dataframe as well, but less than our original hate_df. I'll now be removing the duplicate entries from both dataframes, then concatenating the two.

In [13]:
print(hate_df.shape)
print(no_subt_hate.shape)
hate_df.drop_duplicates(subset='id', keep='first', inplace=True)
no_subt_hate.drop_duplicates(subset='id', keep='first', inplace=True)
print(hate_df.shape)
print(no_subt_hate.shape)

(968, 78)
(1701, 82)
(284, 78)
(622, 82)


In [14]:
hate_df = pd.concat([hate_df, no_subt_hate], ignore_index=True)

print(hate_df.shape)
hate_df.head(3)

(906, 87)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,author_cakeday,crosspost_parent,crosspost_parent_list,url_overridden_by_dest,media,media_embed,secure_media,secure_media_embed,gallery_data,is_gallery
0,[],False,assfuck1911,,[],,text,t2_147l8en8,False,False,...,,,,,,,,,,
1,[],False,Molotov_YouTube,,[],,text,t2_7rdlbmvn,False,False,...,,,,,,,,,,
2,[],False,CurrentSingleStatus,,[],,text,t2_opgnn5az,False,False,...,,,,,,,,,,


In [15]:
hate_df['id'].value_counts()

x6zmjg    1
kqrukk    1
k5olwo    1
k60ahe    1
k6fjv9    1
         ..
we0job    1
wf6gth    1
wfa25h    1
wh2r1n    1
d9c83p    1
Name: id, Length: 906, dtype: int64

We're left with 906 original rows of data for the IHateOhio subreddit. 

To get a more even array of observations, I'll be undersampling by randomly dropping a little more than half of the love dataframe rows before combining into one full df.

In [17]:
np.random.seed(42)

num_to_drop = 1500
drop_index = np.random.choice(love_df.index, num_to_drop, replace=False)
love_df = love_df.drop(drop_index)

print(love_df.shape)
love_df.head(3)

(964, 78)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,banned_by,removed_by_category,edited,poll_data,steward_reports,updated_utc,og_description,og_title,link_flair_css_class,link_flair_text
0,[],False,barelycriminal,,[],,text,t2_mnrikfq0,False,False,...,,,,,,,,,,
1,[],False,DreamsAndBoxes,,[],,text,t2_cfk50px5,False,False,...,,,,,,,,,,
3,[],False,Ohiowelder,,[],,text,t2_b0m3oyra,False,False,...,,,,,,,,,,


In [19]:
# Saving a combined dataframe with all columns to use in future notebooks as needed

reddit_all = pd.concat([love_df, hate_df], ignore_index=True)

print(reddit_all.shape)

reddit_all.to_csv('../datasets/reddit_all_columns_df.csv')

(1870, 89)
