# Data Acquisition: Reddit Scraping

In this exercise, we will search a query (e.g., "data science") on the old Reddit interface (https://www.old.reddit.com/). We will then grab the url (e.g., https://old.reddit.com/search?q=data+science) of the search page and scrap the returned posts. The reason for using the old Reddit interface is that the html tags are user-friendly. We will focus on extracting title, author, author's profile, subreddit, tag, timestamp, number of votes, and number of comments. 
<img src="../images/reddit_search.png" />



* You are free to use your own query string. 
* On the search page, a set of subreddits are shown. Ignore these subreddits and focus on extracting Reddit posts. 



**Activity 1:** Fetch the page and create a soup object using Beautiful soup library

In [29]:
# Your code for activity 1 goes here..
#---------------------------------------

#import the library to query a website
import requests
# import Beautiful soup library to access 
# functions to parse the data returned from the website
from bs4 import BeautifulSoup

#import pandas to convert list to data frame
import pandas as pd
#imprt numpy
import numpy as np


headers = {'User-Agent': 'MyAPP/1.0'}  
# this will make sure our query is comming from a browser and it's not a bot


# specify the url
url = "https://old.reddit.com/search?q=data+science"

# Open website URL and return the html to the variable 'response'
response = requests.get(url, headers=headers)

# Parse the html in the 'response' variable, and store it in Beautiful Soup format
soup = BeautifulSoup(response.text, "html")



In [30]:
attrs = {'class': 'search-result'}

**Activity 2:** Extract the titles and URLs of the retrieved posts from the soup and print them.

In [31]:
titles_html = soup.find_all("a", class_='search-title may-blank')

titles = []
urls = []

for each in titles_html: 
    titles.append(str(each.get_text()))
    
titles = titles[3:]
    
titles

['Who are your data science heroes?',
 'Do you use OOP in your daily data science work?',
 'College Professor in Data Science Course Just Said That Functional Programming Is Better Than OOP, Does He Have a Point?',
 'Ethics of a data science project I am undertaking',
 'The Key Word in Data Science is Science, not Data',
 "I built an interactive map to help people self-teaching Data Science online. It's like a skill tree for Data Science!",
 'What data science skills do you see as in-demand given evolution of data science field in last few years?',
 'So I, a data science noob, ran sentiment analysis on as much BTS MVs on r/kpop I could find...',
 'How I use Data Science to Trade Options Around Earnings',
 'Why do people look down on data science work and “computer” work in general?',
 'Data Science for the Good of Society: are there realistic employment options?',
 'I am interested in creating a group of new comers and intermediate Data science and ML practitioners just to help each ot

In [32]:
for each in titles_html: 
    urls.append(str(each.get('href')))
    
urls = urls[3:]
urls

['https://old.reddit.com/r/datascience/comments/pq44jp/who_are_your_data_science_heroes/',
 'https://old.reddit.com/r/datascience/comments/pkw92b/do_you_use_oop_in_your_daily_data_science_work/',
 'https://old.reddit.com/r/datascience/comments/ppntvz/college_professor_in_data_science_course_just/',
 'https://old.reddit.com/r/datascience/comments/prrou1/ethics_of_a_data_science_project_i_am_undertaking/',
 'https://old.reddit.com/r/datascience/comments/p7hpd9/the_key_word_in_data_science_is_science_not_data/',
 'https://old.reddit.com/r/learndatascience/comments/pjplux/i_built_an_interactive_map_to_help_people/',
 'https://old.reddit.com/r/datascience/comments/pj3dls/what_data_science_skills_do_you_see_as_indemand/',
 'https://old.reddit.com/r/kpopthoughts/comments/plyyes/so_i_a_data_science_noob_ran_sentiment_analysis/',
 'https://old.reddit.com/r/wallstreetbets/comments/psyjv5/how_i_use_data_science_to_trade_options_around/',
 'https://old.reddit.com/r/datascience/comments/pmb7a3/why_

**Activity 3:** Extract the author ids and their profile links from the retrieved posts and print them.

In [33]:
authors_html = soup.find_all("span", class_='search-author')

authors = []

for each in authors_html: 
    authors.append(str(each.get_text()))
    
authors = [e[3:] for e in authors]

authors

['GravityAI',
 'rightheart',
 'Illustrious_Ice_5022',
 'productive_guy123',
 'yoi12321',
 'InstinctiveDoubt',
 'svyas',
 'palebabbu',
 'nema31lebowski',
 'ogretronz',
 'saindoja',
 'yaakarsh1011',
 'saik2363',
 'VictorChen1',
 'MisterInvicta',
 'hyperxenophiliac',
 'TheLSales',
 'fu11m3ta1',
 'SnooPaintings5866',
 'kribz666',
 'bingingwithdata',
 'Kokubo-ubo']

In [34]:
import re

links_html = soup.find_all(class_=re.compile('author may-blank id-.*'))
    
links = []

for each in links_html: 
    links.append(str(each.get('href')))
    
links

['https://old.reddit.com/user/GravityAI',
 'https://old.reddit.com/user/rightheart',
 'https://old.reddit.com/user/Illustrious_Ice_5022',
 'https://old.reddit.com/user/productive_guy123',
 'https://old.reddit.com/user/yoi12321',
 'https://old.reddit.com/user/InstinctiveDoubt',
 'https://old.reddit.com/user/svyas',
 'https://old.reddit.com/user/palebabbu',
 'https://old.reddit.com/user/nema31lebowski',
 'https://old.reddit.com/user/ogretronz',
 'https://old.reddit.com/user/saindoja',
 'https://old.reddit.com/user/yaakarsh1011',
 'https://old.reddit.com/user/saik2363',
 'https://old.reddit.com/user/VictorChen1',
 'https://old.reddit.com/user/MisterInvicta',
 'https://old.reddit.com/user/hyperxenophiliac',
 'https://old.reddit.com/user/TheLSales',
 'https://old.reddit.com/user/fu11m3ta1',
 'https://old.reddit.com/user/SnooPaintings5866',
 'https://old.reddit.com/user/kribz666',
 'https://old.reddit.com/user/bingingwithdata',
 'https://old.reddit.com/user/Kokubo-ubo']

**Activity 4:** Extract the submission time of the retrieved posts and print them.

In [35]:
times_html = soup.find_all("span", class_='search-time')

times = []

for each in times_html: 
    times.append(str(each.get_text()))
    
times = times[3:]

times = [e[10:] for e in times]

times

['6 days ago',
 '14 days ago',
 '6 days ago',
 '3 days ago',
 '1 month ago',
 '16 days ago',
 '17 days ago',
 '12 days ago',
 '1 day ago',
 '12 days ago',
 '1 month ago',
 '1 month ago',
 '6 days ago',
 '6 days ago',
 '2 days ago',
 '2 days ago',
 '9 days ago',
 '4 days ago',
 '4 days ago',
 '3 days ago',
 '5 days ago',
 '3 days ago']

**Activity 5:** Extract the subreddits of the retrieved posts and print them

In [36]:
subreddits_html = soup.find_all("a", class_='search-subreddit-link may-blank')

subreddits = []

for each in subreddits_html: 
    subreddits.append(str(each.get_text()))
    
subreddits = subreddits[3:]
    
subreddits

['r/datascience',
 'r/datascience',
 'r/datascience',
 'r/datascience',
 'r/datascience',
 'r/learndatascience',
 'r/datascience',
 'r/kpopthoughts',
 'r/wallstreetbets',
 'r/datascience',
 'r/datascience',
 'r/datascience',
 'r/datascience',
 'r/Notion',
 'r/canoo',
 'r/rstats',
 'r/AerospaceEngineering',
 'r/analytics',
 'r/learnmachinelearning',
 'r/smallstreetbets',
 'r/FreeKarma4U',
 'r/datascience']

**Activity 6:** Extract the associated tag(s) of the retrieved posts and print them

In [38]:
tags_html = soup.find_all("span", class_="linkflairlabel")

tags = []

for each in tags_html: 
    tags.append(str(each.get_text()))
    
tags.insert(15, np.nan) # this post has no tag
tags.insert(20, np.nan) # this post has no tag

tags

['Fun/Trivia',
 'Discussion',
 'Discussion',
 'Discussion',
 'Discussion',
 'Resources',
 'Discussion',
 'Boy Groups',
 'Discussion',
 'Discussion',
 'Career',
 'Networking',
 'Networking',
 'Showcase',
 'New Hires',
 nan,
 'Career',
 'Question',
 'Tutorial',
 'Discussion',
 nan,
 'Career']

**Activity 7:** Extract the points of the retrieved posts and print them

In [39]:
points_html = soup.find_all("span", class_='search-score')

points = []

for each in points_html: 
    points.append(str(each.get_text()))
    
points = [e[:-7] for e in points]

points

['194',
 '210',
 '110',
 '73',
 '1,208',
 '746',
 '194',
 '183',
 '42',
 '52',
 '245',
 '337',
 '156',
 '161',
 '56',
 '24',
 '22',
 '33',
 '211',
 '98',
 '7',
 '33']

**Activity 8:** Extract the num of comments of the retrieved posts and print them

In [40]:
comments_html = soup.find_all("a", class_='search-comments may-blank')

comments = []

for each in comments_html: 
    comments.append(str(each.get_text()))
    
comments = [e[:-9] for e in comments]
    
comments

['119',
 '158',
 '128',
 '53',
 '156',
 '54',
 '95',
 '50',
 '31',
 '63',
 '152',
 '238',
 '19',
 '17',
 '17',
 '28',
 '40',
 '24',
 '9',
 '12',
 '44',
 '14']

**Activity 9:** Using the above nine features create a dataframe for the retrieved posts, and print the first 10 entries. 

In [43]:
reddit_df = df=pd.DataFrame({'Title':titles, 
                                'URL': urls,
                                'Author':authors,
                                'Profile':links,
                                'Time':times,
                                'Subreddit': subreddits,
                                'Tag': tags,
                                'Points':points,
                                'Comments':comments
                               })

reddit_df.head(10)

Unnamed: 0,Title,URL,Author,Profile,Time,Subreddit,Tag,Points,Comments
0,Who are your data science heroes?,https://old.reddit.com/r/datascience/comments/...,GravityAI,https://old.reddit.com/user/GravityAI,6 days ago,r/datascience,Fun/Trivia,194,119
1,Do you use OOP in your daily data science work?,https://old.reddit.com/r/datascience/comments/...,rightheart,https://old.reddit.com/user/rightheart,14 days ago,r/datascience,Discussion,210,158
2,College Professor in Data Science Course Just ...,https://old.reddit.com/r/datascience/comments/...,Illustrious_Ice_5022,https://old.reddit.com/user/Illustrious_Ice_5022,6 days ago,r/datascience,Discussion,110,128
3,Ethics of a data science project I am undertaking,https://old.reddit.com/r/datascience/comments/...,productive_guy123,https://old.reddit.com/user/productive_guy123,3 days ago,r/datascience,Discussion,73,53
4,"The Key Word in Data Science is Science, not Data",https://old.reddit.com/r/datascience/comments/...,yoi12321,https://old.reddit.com/user/yoi12321,1 month ago,r/datascience,Discussion,1208,156
5,I built an interactive map to help people self...,https://old.reddit.com/r/learndatascience/comm...,InstinctiveDoubt,https://old.reddit.com/user/InstinctiveDoubt,16 days ago,r/learndatascience,Resources,746,54
6,What data science skills do you see as in-dema...,https://old.reddit.com/r/datascience/comments/...,svyas,https://old.reddit.com/user/svyas,17 days ago,r/datascience,Discussion,194,95
7,"So I, a data science noob, ran sentiment analy...",https://old.reddit.com/r/kpopthoughts/comments...,palebabbu,https://old.reddit.com/user/palebabbu,12 days ago,r/kpopthoughts,Boy Groups,183,50
8,How I use Data Science to Trade Options Around...,https://old.reddit.com/r/wallstreetbets/commen...,nema31lebowski,https://old.reddit.com/user/nema31lebowski,1 day ago,r/wallstreetbets,Discussion,42,31
9,Why do people look down on data science work a...,https://old.reddit.com/r/datascience/comments/...,ogretronz,https://old.reddit.com/user/ogretronz,12 days ago,r/datascience,Discussion,52,63


**Activity 10:** Save the retrieved posts in a json file.  

In [44]:
reddit_df.to_json('reddit.json')

# Save your notebook, then `File > Close and Halt`