# Can The Wisdom of Reddit's Crowd Help Us Clarify the Difference Between Data Science and Analytics?

by Graham Lim

# Problem Statement: 

"Can we create a Natural Language Processing model utilizing either Multinomial Bayes or Support Vector Machines to:

   * accurately predict whether a post is from r/DataScience or r/Analytics, and
   * use the better performing model's key words to distinguish what conceptual and technical differences exist between Data Science vs Analytics through keyword analysis of the two subreddits, 
   * in order to make concrete educational or professional recommendations to students and professionals interested in either topic?"

We evaluate success on predictive model accuracy for the first requirement, and quality and uniqueness of distinct keywords to fulfill the other two requirements.

The stakeholders who care are our fellow data science peers, students, as well as professionals considering to improve their own skillsets.

# 1. Web Scraping 

### I put web scraping in a separate file so that I don't rerun API scraping and get different posts if the need to restart kernels arises.

In [1]:
#imports for scraping and interval delaying with requests per second

import requests
import pandas as pd
import json
import time
import random
from random import randint

import urllib.request as ur
#additionally we import this to help us read the json pages - https://docs.python.org/3/library/urllib.request.html
#https://www.educative.io/edpresso/what-is-the-urllib-module-in-python-3

In [2]:
#we're going to scrape the /r/datascience and /r/analytics subreddits. 
#assigning them as .json pages


url1 = 'https://www.reddit.com/r/datascience.json'

url2 = 'https://www.reddit.com/r/analytics.json'

We will start by scraping the /r/datascience subreddit, with the goal of turning it into a dataframe.

In [3]:
#changing reddit access user agent so that access isn't denied when we "get" URLs

header= {"User-Agent": "Graham L"}

res = requests.get(url1, headers = header)

In [4]:
#checking if it worked - 200 means success:

res.status_code

200

In [5]:
ds_json = res.json() #we assign the retrieved json file for /r/datascience as ds_json

In [6]:
ds_json #checking out what our first retrieved json file looks like

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 26,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'datascience',
     'selftext': 'There have been quite a few changes in the subreddit over the past couple of years, both behind the scenes and more visibly.  As such, the mod team thought it would be a good time to discuss our vision for the subreddit and get feedback on these changes and our moderation more generally.\n\n**Our Vision**\n\nTo some extent, our vision for the subreddit is perhaps easier to define by what we aren\'t trying to be, rather than what we want to be:\n\n* We aren\'t trying to be a place for academic/technical discussions, since subreddits like r/MachineLearning, r/AskStatistics, and r/Python already cover those areas more specifically\n* We aren\'t trying to be a place for learning about, transitioning into, or getting a job in data science, since there are countless other blogs and websites discussing how to do that\

In [7]:
#all looks in order; let's find out the keys since the curly braces imply that the data is a dictionary.

ds_json.keys()

dict_keys(['kind', 'data'])

In [8]:
#we will keep exploring the dictionary and deeper sub-keys to identify where in this nested dictionary our post content is.
#guidance provided by GA's reddit API scraping tutorial in Week 4

ds_json['data']['children']

[{'kind': 't3',
  'data': {'approved_at_utc': None,
   'subreddit': 'datascience',
   'selftext': 'There have been quite a few changes in the subreddit over the past couple of years, both behind the scenes and more visibly.  As such, the mod team thought it would be a good time to discuss our vision for the subreddit and get feedback on these changes and our moderation more generally.\n\n**Our Vision**\n\nTo some extent, our vision for the subreddit is perhaps easier to define by what we aren\'t trying to be, rather than what we want to be:\n\n* We aren\'t trying to be a place for academic/technical discussions, since subreddits like r/MachineLearning, r/AskStatistics, and r/Python already cover those areas more specifically\n* We aren\'t trying to be a place for learning about, transitioning into, or getting a job in data science, since there are countless other blogs and websites discussing how to do that\n* We aren\'t trying to be a place for people or companies to promote themselve

GA already provided a starter guide to help us scrape from Reddit's API; we will incorporate it:
https://www.youtube.com/watch?v=5Y3ZE26Ciuk

In [9]:
ds_json['data']['children'][0]

#the data starts after the children subkey. we will write a function based off GA's tutorial watched.

{'kind': 't3',
 'data': {'approved_at_utc': None,
  'subreddit': 'datascience',
  'selftext': 'There have been quite a few changes in the subreddit over the past couple of years, both behind the scenes and more visibly.  As such, the mod team thought it would be a good time to discuss our vision for the subreddit and get feedback on these changes and our moderation more generally.\n\n**Our Vision**\n\nTo some extent, our vision for the subreddit is perhaps easier to define by what we aren\'t trying to be, rather than what we want to be:\n\n* We aren\'t trying to be a place for academic/technical discussions, since subreddits like r/MachineLearning, r/AskStatistics, and r/Python already cover those areas more specifically\n* We aren\'t trying to be a place for learning about, transitioning into, or getting a job in data science, since there are countless other blogs and websites discussing how to do that\n* We aren\'t trying to be a place for people or companies to promote themselves, e

In [12]:
def scraper(url, num_scrapes, raw_list): #the function will scrape to variable "raw_list" for us to pass in.
    after = None 
    for _ in range(num_scrapes):
        if _ == 0:
            print("{} of {} scrapes done".format(1, num_scrapes))
        elif (_+1) % 5 ==0:
            print("{} of {} scrapes done".format((_ + 1), num_scrapes))
        
        if after == None:
            params = {}
        else:
            params = {"after": after}             
        res = requests.get(url, params=params, headers=header)
        if res.status_code == 200:
            the_json = res.json()
            raw_list.extend(the_json["data"]["children"])
            after = the_json["data"]["after"]
        else:
            print(res.status_code)
            break
        time.sleep(randint(1,8)) #this interval timer is important because it prevents overloading server requests
    
    print("Total posts scraped: {}".format(len(raw_list)))
    print("Unique posts scraped: {}".format(len(set([p["data"]["name"] for p in raw_list]))))

In [21]:
ds_data = [] #we define the empty list as our output list. we'll convert this later into a pandas dataframe.

scraper(url1, 60, ds_data)

1 of 60 scrapes done
5 of 60 scrapes done
10 of 60 scrapes done
15 of 60 scrapes done
20 of 60 scrapes done
25 of 60 scrapes done
30 of 60 scrapes done
35 of 60 scrapes done
40 of 60 scrapes done
45 of 60 scrapes done
50 of 60 scrapes done
55 of 60 scrapes done
60 of 60 scrapes done
Total posts scraped: 1456
Unique posts scraped: 628


In [26]:
#we want unique posts only so as to avoid duplicates. we can write a for loop to create our named, unique data.

ds_data_unique = []

named_list = []

for i in range(len(ds_data)):
    if ds_data[i]["data"]["name"] not in named_list:
        ds_data_unique.append(ds_data[i]["data"])
        named_list.append(ds_data[i]["data"]["name"])


In [27]:
ds_df = pd.DataFrame(ds_data_unique)

In [28]:
ds_df.head() #the dataframe loaded up correctly for our /r/datascience subreddit.

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,num_crossposts,media,is_video,post_hint,preview,poll_data,crosspost_parent_list,crosspost_parent,author_cakeday,media_metadata
0,,datascience,There have been quite a few changes in the sub...,t2_6kl7i,False,,0,False,[META] State of the Subreddit - 2020,[],...,0,,False,,,,,,,
1,,datascience,"Let's say during interview, you talk about you...",t2_33bizuj,False,,0,False,Is it okay to discuss the results of a model a...,[],...,0,,False,,,,,,,
2,,datascience,"For starters, I’m a summer intern doing some d...",t2_4jo7xlko,False,,0,False,Presenting Your Value,[],...,0,,False,,,,,,,
3,,datascience,,t2_2t86pj0o,False,,0,False,Do you need to be good at maths areas like cal...,[],...,0,,False,,,,,,,
4,,datascience,"Hi all,\n\nI'm very new to the data science co...",t2_nlcob,False,,0,False,Real Estate Market Selection,[],...,0,,False,,,,,,,


In [36]:
ds_df.shape

#628 rows of individual posts by 104 columns

(628, 112)


We are now going to do the same thing with `/r/analytics`. once again the goal will be to create another dataframe.


In [30]:
res2 = requests.get(url2, headers = header)

In [31]:
#changing reddit access user agent so that access isn't denied when we "get" URLs

header= {"User-Agent": "Graham L"}

res2 = requests.get(url2, headers = header)

In [32]:
#checking if it worked - 200 means success:

res2.status_code

200

In [33]:
analytics_json = res2.json() #we assign the retrieved json file for /r/analytics as analytics_json

In [35]:
analytics_json #checking out what our 2nd retrieved json file looks like

{'kind': 'Listing',
 'data': {'modhash': '',
  'dist': 27,
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'analytics',
     'selftext': 'Have a question regarding interviewing, career advice, certifications?  Please include country, years of experience, vertical market, and size of business if applicable.\n\n*Have suggestions? [Click this link to share them](https://reddit.com/message/compose?to=/r/analytics&amp;message=Suggestion...)*',
     'author_fullname': 't2_6l4z3',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': 'Monthly Career Advice Thread - June 2020',
     'link_flair_richtext': [],
     'subreddit_name_prefixed': 'r/analytics',
     'hidden': False,
     'pwls': 6,
     'link_flair_css_class': None,
     'downs': 0,
     'thumbnail_height': None,
     'top_awarded_type': None,
     'hide_score': False,
     'name': 't3_gum80v',
     'quarantine': False,
     'link_flair_text_c

In [38]:
#once again, all the data starts after the 'children' subkey. 

#we rerun our scraper function for this subreddit too.

analytics_data=[]

scraper(url2, 60, analytics_data)

1 of 60 scrapes done
5 of 60 scrapes done
10 of 60 scrapes done
15 of 60 scrapes done
20 of 60 scrapes done
25 of 60 scrapes done
30 of 60 scrapes done
35 of 60 scrapes done
40 of 60 scrapes done
45 of 60 scrapes done
50 of 60 scrapes done
55 of 60 scrapes done
60 of 60 scrapes done
Total posts scraped: 1466
Unique posts scraped: 682


In [39]:
#again we want unique posts only so as to avoid duplicates. we can write a for loop to create our named, unique data.
analytics_data_unique = []

named_list_2 = []

for i in range(len(analytics_data)):
    if analytics_data[i]["data"]["name"] not in named_list_2:
        analytics_data_unique.append(analytics_data[i]["data"])
        named_list_2.append(analytics_data[i]["data"]["name"])

In [40]:
analytics_df = pd.DataFrame(analytics_data_unique)

In [41]:
analytics_df.head() #the dataframe loaded up correctly for our /r/datascience subreddit.

Unnamed: 0,approved_at_utc,subreddit,selftext,author_fullname,saved,mod_reason_title,gilded,clicked,title,link_flair_richtext,...,created_utc,num_crossposts,media,is_video,link_flair_template_id,post_hint,preview,crosspost_parent_list,crosspost_parent,author_cakeday
0,,analytics,"Have a question regarding interviewing, career...",t2_6l4z3,False,,0,False,Monthly Career Advice Thread - June 2020,[],...,1591024000.0,0,,False,,,,,,
1,,analytics,Share your current marketing openings in the c...,t2_6l4z3,False,,0,False,Monthly Job Openings - June 2020,[],...,1591467000.0,0,,False,,,,,,
2,,analytics,I am a mid-twenties female data analyst with a...,t2_tjfxh,False,,0,False,How to progress as a data analyst while dealin...,[],...,1592821000.0,0,,False,,,,,,
3,,analytics,\n\nHi Everyone!\n\nQuick Run Down of experie...,t2_mjf3txf,False,,0,False,New to Data Analytics - Career Advice Needed,[],...,1592832000.0,0,,False,,,,,,
4,,analytics,I have a unique situation here I think and cou...,t2_12zemhb1,False,,0,False,Best practices to setup Google Analytics?,[],...,1592837000.0,0,,False,,,,,,


In [42]:
analytics_df.shape

(682, 110)

We now have 2 dataframes of slightly differing column length (2), each close to 700 unique rows. we're gonna double check what the offending mismatched columns are with another `for` loop.

In [43]:
analytics_columns = analytics_df.columns #this for loop is to help us see if there are any columns in /r/ analytics dataframe unique to that subreddit alone

ds_columns = ds_df.columns

unmatched_columns = [columns for columns in analytics_columns if columns not in ds_columns]

unmatched_columns 

[]

In [44]:
#an empty list is no news which is good news. no unique columns in analytics 
#let's double check the other way round to make sure there are no unique columns in /r/datascience dataframe

more_unmatched_columns = [columns for columns in ds_columns if columns not in analytics_columns]

more_unmatched_columns

['poll_data', 'media_metadata']

In [45]:
#okay we have 2 unique columns that prevent us from a concatenation. we will drop them as there's no meaningful data

ds_df.drop(['poll_data', 'media_metadata'], axis=1, inplace=True)

In [46]:
#all columns match up in both dataframes; the column names are identical. we can therefore concatenate them.

all_dataframes = [ds_df, analytics_df]

combined=pd.concat(all_dataframes)

In [47]:
combined.shape

(1310, 110)

We have successfully combined our 2 dataframes into one, which will be exported below for cleaning in our next notebook:

In [53]:
combined.to_csv('../project_3/data/combined.csv')