### Problem Statement

As a data scientist working on the next presidential campaign I need to get an understanding of what the voting base cares whether they are Republican or Democrat so that we can build a successful and flexible campaign. In this exercise I will collect user posts from Democrat and Republican subreddits and utilize NLP to identify what topics are top of mind and how each voter base is feeling about these topics utlizing Sentiment Analysis.

#### Data Collection
In this workbook I will call the reddit API to download approximately 10k submissions from the Democrat and Republican subreddits. I will remove submissions that were deleted or removed either by the moderator before storing the data to be used in the subsequent notebooks.

In [1]:
import requests
import pandas as pd
import numpy as np

#Not best practice but I am ignoring warnings for the api collection
import warnings
warnings.filterwarnings("ignore")

## Democrat Subreddit Collection

In [2]:
#data collection from Dems Subreddit
url= 'https://api.pushshift.io/reddit/search/submission'

#pull the fist 250 posts 
params1 = {
            'subreddit' : 'democrats',
            'size' : 250
        }
req = requests.get(url, params1)
dem_data = req.json()
tempdf= dem_data['data'] 
dems_df= pd.DataFrame(tempdf)

#using a for loop to pull ~10k posts in batches of 250 (limit set by reddit API)
for i in range(1,40):
    params2 = {
            'subreddit' : 'democrats',
            'size' : 250,
            'before': dems_df['created_utc'][-1:]
        }

    req2 = requests.get(url, params2)
    dem_data2 = req2.json()
    tempdf2= dem_data2['data']
    dems_df= dems_df.append(tempdf2)
    

In [3]:
#Total of 9989 submissions and 83 columns captured
dems_df.shape

(9989, 83)

In [4]:
dems_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,author_flair_background_color,author_flair_text_color,author_flair_template_id,call_to_action,category,distinguished,media_metadata,banned_by,author_cakeday,suggested_sort
0,[],False,NewYorkPainter,,[],,text,t2_ljveak5f,False,False,...,,,,,,,,,,
1,[],False,coffeespeaking,,[],,text,t2_u42v3,False,False,...,,,,,,,,,,
2,[],False,KofCrypto0720,,[],,text,t2_4sfmoujf,False,False,...,,,,,,,,,,
3,[],False,Far_Refrigerator5490,,[],,text,t2_hiept6gc,False,False,...,,,,,,,,,,
4,[],False,LolAtAllOfThis,,[],,text,t2_45njwvci,False,False,...,,,,,,,,,,


## Republican Subreddit Collection

In [5]:
#data collection from Republican Subreddit
url= 'https://api.pushshift.io/reddit/search/submission'

#pull the fist 250 posts 
params1 = {
            'subreddit' : 'Republican',
            'size' : 250
        }
req = requests.get(url, params1)
rep_data = req.json()
tempdf= rep_data['data'] 
rep_df= pd.DataFrame(tempdf)

#using a for loop to pull ~3k posts in batches of 250 (limit set by reddit API)
for i in range(1,40):
    params2 = {
            'subreddit' : 'Republican',
            'size' : 250,
            'before': rep_df['created_utc'][-1:]
        }

    req2 = requests.get(url, params2)
    rep_data2 = req2.json()
    tempdf2= rep_data2['data']
    rep_df= rep_df.append(tempdf2)

In [6]:
#9999 submissions with 81 columns captured
rep_df.shape

(9999, 81)

### Feature Selection
After reviewing the two subreddits its clear that posts generally just include a title and no selftext. As such I will utilize the title field to analyze which topics each group is focused on. Further I will analyze the author and num_comments to better understand which topics followers are more engaged with.

#### Democrats

In [7]:
dems_df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_richtext', 'link_flair_text', 'link_flair_text_color',
       'link_flair_type', 'locked', 'media_only', 'no_follow', 'num_comments',
       'num_crossposts', 'over_18', 'parent_whitelist_status', 'permalink',
       'pinned', 'post_hint', 'preview', 'pwls', 'retrieved_on', 'score',
       'selftext', 'send_replies', 'spoiler', 'stickied', 'subreddit',
       'subreddit_id', 'subreddit_subscribers', 'subreddit_type

In [8]:
# Most submissions (9700+) do not contain selftext only titles.
dems_df[dems_df['selftext'] == ''][['title','selftext']]

Unnamed: 0,title,selftext
0,Emails show Trump lawyers mocked his wealth — ...,
1,Herschel Walker’s Son Goes Nuclear on His ‘Lyi...,
2,Need help to voting on South Florida (Miami-Da...,
3,Which decision in your opinion was worse The F...,
4,'Literally don't understand what that means': ...,
...,...,...
244,Leader Schumer promises Senate vote on filibus...,
245,"NY AG subpoenas Ivanka Trump and Donald Trump,...",
246,I’m guessing for the same reason Trump didn’t ...,
247,Leader Schumer announces vote on changing fili...,


In [9]:
#It looks like about half of the submissions were removed from the subreddit, I only want to use current posts
dems_df[dems_df['removed_by_category'].isnull() == False][['subreddit', 'title', 'author', 'removed_by_category']]

Unnamed: 0,subreddit,title,author,removed_by_category
2,democrats,Need help to voting on South Florida (Miami-Da...,KofCrypto0720,automod_filtered
3,democrats,Which decision in your opinion was worse The F...,Far_Refrigerator5490,moderator
8,democrats,1800 Call A Crackhead,ImaginationFree6807,moderator
9,democrats,Just A Reminder,Souled_Out,automod_filtered
11,democrats,How likely is it that Republicans would win th...,clothingarticle1,moderator
...,...,...,...,...
235,democrats,Apple has become the world's first $3 trillion...,imll99,moderator
236,democrats,"This Thursday, Jan. 6, Declaration for America...",bugleweed,moderator
238,democrats,"Do you live in FL, and like the idea of roofto...",OpportunityFlorida,moderator
246,democrats,I’m guessing for the same reason Trump didn’t ...,enabledvet,automod_filtered


In [10]:
demdata = dems_df[dems_df['removed_by_category'].isnull() == True][['subreddit', 'title', 'author', 'num_comments']]

In [11]:
demdata.shape

(4591, 4)

#### Republicans

In [12]:
rep_df.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext',
       'author_flair_template_id', 'author_flair_text',
       'author_flair_text_color', 'author_flair_type', 'author_fullname',
       'author_is_blocked', 'author_patreon_flair', 'author_premium',
       'awarders', 'can_mod_post', 'contest_mode', 'created_utc', 'domain',
       'full_link', 'gildings', 'id', 'is_created_from_ads_ui',
       'is_crosspostable', 'is_meta', 'is_original_content',
       'is_reddit_media_domain', 'is_robot_indexable', 'is_self', 'is_video',
       'link_flair_background_color', 'link_flair_richtext',
       'link_flair_text_color', 'link_flair_type', 'locked', 'media_only',
       'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'post_hint',
       'preview', 'pwls', 'retrieved_on', 'score', 'selftext', 'send_replies',
       'spoiler', 'stickied', 'subreddit', 'subreddit_id

In [13]:
# Most submissions (8500+) do not contain selftext only titles.
rep_df[rep_df['selftext'] == ''][['title','selftext']]

Unnamed: 0,title,selftext
0,'Extremely concerning': Elon Musk on child por...,
1,5-Year-Old Left In Tears On His Birthday After...,
2,New York Times. Shot. Chaser.,
3,Florida Gov. DeSantis Says Majority Of Post-Hu...,
4,Louden County Virginia Man Arrested On Child S...,
...,...,...
243,Ginni Thomas pushed White House to pursue Trum...,
245,What If the Fishy 'Big Blue Shift' to Democrat...,
246,What a shit show circus we have running the co...,
248,True The Vote Presents Stunning Evidence of Vo...,


In [18]:
repdata = rep_df[rep_df['removed_by_category'].isnull() == True][['subreddit', 'title', 'author', 'num_comments']]

In [20]:
repdata.shape

(5456, 4)

#### Store the Data for the next notebook

In [21]:
%store demdata
%store repdata

Stored 'demdata' (DataFrame)
Stored 'repdata' (DataFrame)
