# Project 3

## Part 0: Scraping and Parsing

This section will illustrate the process of scraping and parsing Reddit that I performed for this project.

All scraping and parsing was done using a custom-made script, `scraper.py`, located in the code directory. This script exports two classes:

`RedditReader`: A class for scraping Reddit using selenium. Because Reddit loads its post data dynamically in chunks, it was required to emulate a human user and scroll down the page as far as necessary to scrape the posts. The goal here was to avoid use of a much easier-to-use API.

`RedditParser`: A class for parsing the Reddit page source for the necessary information.

The `scraper.py` module uses the following third-party packages:

- `selenium` (requires Chrome as ChromeDriver is hardcoded in the script for now)
- `bs4`
- `pandas`

The actual scraping and parsing done for this project was performed in ipython and the command line. I will replicate my steps below.

### 0. Imports and Preliminaries

In [4]:
# import modules
from scraper import RedditReader, RedditParser
import time

### 1. Scraping

**Note:** The code in this section will run an example scrape using the `scraper.py` module. As previously mentioned, the actual scraping was performed from the command line and in ipython. My executable code to automate this is located at the bottom of `scraper.py`.

The below code is commented out to prevent possible errors due to configuration (see above). Remove markdown code quotes and convert to code cell to run.

```python
number_of_scrolls = 10 # adjust to taste

with RedditReader() as rr:
    rr.set_sleep_time(4) # give time for page load
    
    # get URL
    rr.get() # automatically sleeps for above time
    rr.set_sleep_time(8) # give time for scrolling
    
    for i in range(number_of_scrolls):
        # scroll page 10 times to get a subset of data
        rr.scroll() # automatically scrolls to bottom of scrollable area
        
    # done with scrolling - write to disk
    rr.write_page_source() # by default timestamps file and writes to scraped dir
    
    # selenium is automatically closed by virtue of the 'with' call
```

### 2. Parsing