# Project 3

## Part 0: Scraping and Parsing

This section will illustrate the process of scraping and parsing Reddit that I performed for this project.

All scraping and parsing was done using two custom-made scripts, `scraper.py` and `parser.py`, located in the code directory. These scripts export the following classes:

`RedditReader`: A class for scraping Reddit using selenium. Because Reddit loads its post data dynamically in chunks, it was required to emulate a human user and scroll down the page as far as necessary to scrape the posts. The goal here was to avoid use of a much easier-to-use API.

`RedditParser`: A class for parsing the Reddit page source for the necessary information.

The `scraper.py` module uses the following third-party packages:

- `selenium` (requires Chrome as ChromeDriver is hardcoded in the script for now)

The `parser.py` modules uses the following third-party packages:

- `bs4`

The actual scraping and parsing done for this project was performed in ipython and the command line. I will replicate my steps below.

### 0. Imports and Preliminaries

In [11]:
# import modules
from scraper import RedditReader
from parser import RedditParser
import time
import os

import pandas as pd

### 1. Scraping

**Note:** The code in this section will run an example scrape using the `scraper.py` module. As previously mentioned, the actual scraping was performed from the command line and in ipython. My executable code to automate this is located at the bottom of `scraper.py`.

The below code is commented out to prevent possible errors due to configuration (see above). Remove markdown code quotes and convert to code cell to run.

```python
number_of_scrolls = 10 # adjust to taste

with RedditReader() as rr:
    rr.set_sleep_time(4) # give time for page load
    
    # get URL
    rr.get() # automatically sleeps for above time
    rr.set_sleep_time(8) # give time for scrolling
    
    for i in range(number_of_scrolls):
        # scroll page 10 times to get a subset of data
        rr.scroll() # automatically scrolls to bottom of scrollable area
        
    # done with scrolling - write to disk
    rr.write_page_source() # by default timestamps file and writes to scraped dir
    
    # selenium is automatically closed by virtue of the 'with' call
```

### 2. Parsing

In [4]:
# get scrapes
scrape_dir = '../scrapes/'
scrape_files = os.listdir(scrape_dir)
scrape_files[:5]

['scrape_20220830_172503.txt',
 'scrape_20220830_190718.txt',
 'scrape_20220831_110027.txt',
 'scrape_20220829_164748.txt',
 'scrape_20220831_101422.txt']

In [22]:
# for each scrape, parse and collect relevant data, then add to table
info = list()
for file in scrape_files:
    # process contents using custom script (see above)
    # file is read as part of class instantiation
    rp = RedditParser(scrape_dir + file)
    info.append(rp.go())
    print(f'processed ' + scrape_dir + file)

processed ../scrapes/scrape_20220830_172503.txt
processed ../scrapes/scrape_20220830_190718.txt
processed ../scrapes/scrape_20220831_110027.txt
processed ../scrapes/scrape_20220829_164748.txt
processed ../scrapes/scrape_20220831_101422.txt


In [23]:
# convert dict/arrays to dataframes and concat to a single big dataframe
container_df = pd.DataFrame()

for ix, v in enumerate(info):
    container_df = pd.concat([container_df, pd.DataFrame(v)])
    print(f'converted {str(ix)} of {len(info)}')
    
df = container_df.drop_duplicates('uid')

converted 0 of 5
converted 1 of 5
converted 2 of 5
converted 3 of 5
converted 4 of 5


In [24]:
# delete unused variables to save memory
# del info
# del container_df

In [30]:
# check number of rows
df.shape

(2187, 5)

In [31]:
# check general look of table
df.head()

Unnamed: 0,uid,time,title,body-text,comments
0,214SaturnReturnMEGATHREAD-we'vebeengettingalot...,2022-02-01,Saturn Return MEGATHREAD - we've been getting ...,,330
1,"51MERCURYRXINFOGRAPHIC:Taurus/Gemini,Apr-Jun20220",2022-06-01,"MERCURY RX INFOGRAPHIC: Taurus/Gemini, Apr-Jun...",,22
2,17CHANIappissues?221,2022-08-30,CHANI app issues?,I just downloaded the CHANI app to try out and...,4
3,124Makeeverymilecountnextseason—getthelatestcr...,,Make every mile count next season—get the late...,,0
4,86IsMercuryinAquariusinthe6thHouseaspowerfulas...,2022-08-30,Is Mercury in Aquarius in the 6th House as pow...,Not new to the deeper parts of astrology but t...,8


In [29]:
# rearrange columns
df = df[['uid','time','title','body-text','comments']]
df.head()

Unnamed: 0,uid,time,title,body-text,comments
0,214SaturnReturnMEGATHREAD-we'vebeengettingalot...,2022-02-01,Saturn Return MEGATHREAD - we've been getting ...,,330
1,"51MERCURYRXINFOGRAPHIC:Taurus/Gemini,Apr-Jun20220",2022-06-01,"MERCURY RX INFOGRAPHIC: Taurus/Gemini, Apr-Jun...",,22
2,17CHANIappissues?221,2022-08-30,CHANI app issues?,I just downloaded the CHANI app to try out and...,4
3,124Makeeverymilecountnextseason—getthelatestcr...,,Make every mile count next season—get the late...,,0
4,86IsMercuryinAquariusinthe6thHouseaspowerfulas...,2022-08-30,Is Mercury in Aquarius in the 6th House as pow...,Not new to the deeper parts of astrology but t...,8
