# Reddit Front Page (PH) Web Scraping and Data Visualization
This program will mine the popular posts in reddit front page for Philippine users. Web scraping will be divided into two parts:

    1. (No API) Listing out a number (say 100) of popular posts in Philippines. 
    2. (API) Gathering the important details of every post including title, subreddit, category, upvotes, comments, etc.
After web scraping, the data will be saved in a JSON flat file which will be visualized below.

### Required modules

In [90]:
%pip install bs4
%pip install requests
%pip install lxml
%pip install markdown



In [94]:
import requests 
import json
from bs4 import BeautifulSoup
import urllib.parse
import time
import logging
from markdown import markdown
import concurrent.futures

## WEB SCRAPING

### (a) List out the posts
The target is to get the links for the posts in reddit.com/r/popular/?geo_filter=PH. Unfortunately, there is no API for this yet. So what I did is create `send_request` function that will send HTTP requests at most 3 times with error handling like Timeout error to pause the program for 10 seconds. 

In [75]:
# Send HTTP request
def send_request(link, **kwargs):
    # Try to send request 3 times
    for attempt in range(3):
        try:            
            logging.debug("Sending request number %s of %s on %s", attempt, 3, link)
            r = requests.get(link, **kwargs)
        except requests.exceptions.Timeout as e:
            logging.warning("Timeout ERROR: Attempt number %s of %s failed. Sleeping for 10 seconds %s\n", attempt, retries, e)
            time.sleep(10)
            continue
        except requests.exceptions.RequestException as e:
            # Something went wrong with the request. Covers other unspecified errors like ConnectionError and TooManyRedirects
            logging.error("ERROR: Request failed. %s. Now exiting the application", e)
            raise SystemExit()

        # If the request proceeds without error, break the loop and return the response
        break
    else:
        # Exit the app if all retries are exhausted
        logging.error("ERROR: All %s attemps of HTTP request on link %s failed. Now exiting the application", 3, link)
        raise SystemExit()

    # If the code continues to run, it means that the request successfully went through.
    logging.debug("SUCCESS: Request granted with status code of %s", r.status_code)
    return r

##### Proxy Rotation
I also added proxy rotation to my requests to prevent my IP from getting banned after continuous requests. Reddit heavily limits bot requests to their server that does not use their official API. I used scraperapi.com service for this which gives 1000 free calls every month. The API key for `scraperapi` is stored in local JSON file (credentials.json) alongside other sensitive information like Reddit username, password, token, and client ID that will be used in the 2nd part of web scraping

In [76]:
# Credentials in local JSON file
    ## username = Reddit account username | password = Reddit account password
    ## client_id and secret_token at https://www.reddit.com/prefs/apps after you created a script app
    ## proxy_api is the API key after you created a free account in scraperapi.com
    
with open("credentials.json", "r") as f:
    credentials = json.load(f)

## Use Proxy rotation since we are not yet using the API in this section
params = {
    'api_key': credentials["proxy_api"],
    'url': "https://old.reddit.com/r/popular/?geo_filter=PH"
}

post_links = set()

## Scrape the first 4 pages of Reddit Front-page (PH region). 
for page in range(4):
    # Send the request
    r = send_request('http://api.scraperapi.com', params = params)
    
    # Parse the response
    html = BeautifulSoup(r.text, "lxml")
    
    for post in html.select('#siteTable :not(.promoted):has(div.entry.unvoted)'):
        post_link = post.get("data-permalink")
        
        if post_link not in post_links and post_link:
            post_links.add(post_link)
    
    logging.info("Posts in page %s gathered. Proceeding to page %s", page + 1, page + 2)
    # Change the link to the next page
    next_page = html.select_one(".next-button a")["href"]
    params['url'] = next_page

### (b) Use Reddit API to get post details
Now, we've got the links for the popular posts in Reddit PH front-page. The next step is to send request to each one of them to get the important details we need. This time we are using the reddit API, so we will no longer need proxy rotation. Access the credentials in the JSON files. We need this to get the OAuth token from reddit to authorize my requests. It will then be added to the header of every HTTP request on Reddit API

In [77]:
   
auth = requests.auth.HTTPBasicAuth(credentials['client_id'], credentials['secret_token'])
account = {
    "grant_type": "password",
    "username": credentials["username"],
    "password": credentials["password"]
}

# App info
headers = {'User-Agent': 'Reddit Front Page Scrape Bot V1.0 by darren-sm'}

# Request for oauth token
r = requests.post('https://www.reddit.com/api/v1/access_token', data = account, auth = auth, headers = headers)

# Add token to the header
token = r.json()['access_token']
headers['Authorization'] =  f"bearer {token}"

##### Process the Data from the Reddit Posts

In [86]:
# Get the comments in a thread
def get_comments(comment_list):
    for comment in comment_list:
        data = comment["data"]
        body = data.get("body")
        if body:    
            html = markdown(body)
            yield ''.join(BeautifulSoup(html).findAll(string=True))            
        
        replies = data.get("replies")
        if isinstance(replies, dict):
            yield from get_comments(replies["data"]["children"])
            
def get_post_data(endpoint):
    link = "https://oauth.reddit.com" + endpoint
    r = send_request(link, headers = headers)
    
    # Basic Post Data
    post_data = {
        "title": r.json()[0]['data']['children'][0]['data']['title'],
        "text": r.json()[0]['data']['children'][0]['data']['selftext'],
        "subreddit": r.json()[0]['data']['children'][0]['data']['subreddit'],
        "upvotes": r.json()[0]['data']['children'][0]['data']['score'],
        "date": r.json()[0]['data']['children'][0]['data']['created'],
        "over_18": r.json()[0]['data']['children'][0]['data']['over_18']
    }

    # Post Category
    category = r.json()[0]['data']['children'][0]['data'].get('post_hint', None)
    if category: 
        post_data["category"] = category
    else:
        post_data["category"] = "text"
    
    # Post comments (200 max)
    post_data["comments"] = list(get_comments(r.json()[1]["data"]["children"]))
    
    return post_data

##### Multi-threading
Make the program faster by sending asynchronous requests to Reddit. There's also a limit on how much requests we can send per minute or second but `send_request` method has error handling in case we passed over the limits

In [95]:
front_page_data = []

# Use multi-threading to send requests
with concurrent.futures.ThreadPoolExecutor() as executor:
    for result in executor.map(get_post_data, post_links):
        front_page_data.append(result)
        
# Save in one gigantic JSON file
with open("front-page.json", 'w') as f:
    json.dump(front_page_data, f)

## DATA VISUALIZATION
Our data is now set! Proceed to data visualization

#### 1. Word Cloud
Word cloud based on the comments of Reddit users

#### 2. 