# Project 3

## Part 0: Scraping and Parsing

This section will illustrate the process of scraping and parsing Reddit that I performed for this project.

Most scraping and parsing was done using two custom-made scripts, `scraper.py` and `parser.py`, located in the code directory. These scripts export the following classes:

`RedditReader`: A class for scraping Reddit using selenium. Because Reddit loads its post data dynamically in chunks, it was required to emulate a human user and scroll down the page as far as necessary to scrape the posts. The goal here was to avoid use of a much easier-to-use API.

`RedditParser`: A class for parsing the Reddit page source for the necessary information.

The `scraper.py` module uses the following third-party packages:

- `selenium` (requires Chrome as ChromeDriver is hardcoded in the script for now)

The `parser.py` modules uses the following third-party packages:

- `bs4`

The actual scraping done for this project was performed in ipython and the command line. I will replicate my steps below.

### 0. Imports and Preliminaries

In [1]:
# import modules
from scraper import RedditReader
from parser import RedditParser
import os

import pandas as pd

### 1. Scraping

**Note:** The code in this section will run an example scrape using the `scraper.py` module. As previously mentioned, the actual scraping was performed from the command line and in ipython. My executable code to automate this is located at the bottom of `scraper.py` and in the `multiscraper.py` script.

The below code is commented out to prevent possible errors due to configuration (see above). Remove markdown code quotes and convert to code cell to run.

```python
number_of_scrolls = 10 # adjust to taste

with RedditReader() as rr:
    rr.set_sleep_time(4) # give time for page load
    
    # get URL
    rr.get() # automatically sleeps for above time
    rr.set_sleep_time(8) # give time for scrolling
    
    for i in range(number_of_scrolls):
        # scroll page 10 times to get a subset of data
        rr.scroll() # automatically scrolls to bottom of scrollable area
        
    # done with scrolling - write to disk
    rr.write_page_source() # by default timestamps file and writes to scraped dir
    
    # selenium is automatically closed by virtue of the 'with' call
```

### 2. Parsing

In [2]:
# get main subreddit scrapes
scrape_dir = '../scrapes/'
scrape_files = os.listdir(scrape_dir)
print(scrape_files[:5])

['scrape_20220905_210656.txt', 'search', 'scrape_20220905_204314.txt', 'scrape_20220830_172503.txt', 'page']


In [3]:
# for each scrape, parse and collect relevant data, then add to a list
info = list()
for file in scrape_files:
    if not file.endswith('.txt'): continue
    # process contents using custom script (see above)
    # file is read as part of class instantiation
    rp = RedditParser(scrape_dir + file)
    info.append(rp.go())
    print(f'processed ' + scrape_dir + file)

processed ../scrapes/scrape_20220905_210656.txt
processed ../scrapes/scrape_20220905_204314.txt


KeyboardInterrupt: 

In [None]:
# convert dict/arrays to dataframes and concat to a single big dataframe
container_df = pd.DataFrame()

for ix, v in enumerate(info):
    container_df = pd.concat([container_df, pd.DataFrame(v)])
    print(f'converted {str(ix)} of {len(info)}')

# drop duplicate rows (determined by uid created as part of the parser)
df = container_df.drop_duplicates('uid')

In [None]:
# delete unused variables to save memory
del info
del container_df

In [None]:
# we have additional post scrapes to parse
post_scrape_dir = scrape_dir + 'page/'
post_scrape_files = os.listdir(post_scrape_dir)
print(post_scrape_files[:5])

In [None]:
# scrape each post
info = list()
count = 0
for file in post_scrape_files:
    rp = RedditParser(post_scrape_dir + file, tags_file='tags_post.json')
    info.append(rp.go())
    count += 1
    #print(f'processed {post_scrape_dir}{file}')
    if not count % 50: print('.', end='') # 7k+ files - shorten output

In [None]:
container_df = pd.DataFrame()
count = 0

for ix, v in enumerate(info):
    container_df = pd.concat([container_df, pd.DataFrame(v)])
    count += 1
    #print(f'converted {str(ix)} of {len(info)}')
    if not count % 50: print('.', end='') # shorten output

In [None]:
container_df.head()

In [None]:
container_df.shape

In [11]:
# combine all rows
df = pd.concat([df, container_df]).drop_duplicates('uid')
df.shape

(8718, 6)

In [None]:
# TEST FOR A DROP DUPLICATE TITLES

# 3. Verify and Write To Disk

In [12]:
# reset index
df = df.reset_index().drop(columns=['index'])

In [13]:
# check number of rows
df.shape

(8718, 6)

In [14]:
# check general look of table
df.head()

Unnamed: 0,title,time,comments,body-text,media,uid
0,Newbie questions about ascendants and borders,2022-09-05,0,"I'm new to actually learning astrology, not ju...",,45Newbiequestionsaboutascendantsandborders601
1,Thousands of uncharted planets at your fingert...,,0,,True,100Thousandsofunchartedplanetsatyourfingertips...
2,Astrology and cognitive dissonance,2022-09-05,1,Open to anyone who wouldn't mind sharing a rec...,,34Astrologyandcognitivedissonance323
3,what do y’all think of persona charts?,2022-09-05,0,"I feel a bit skeptical of them, since I feel l...",,38whatdoy’allthinkofpersonacharts?180
4,RESOURCE REQUEST: Videos (or articles) with ti...,2022-09-05,2,I think my problem is that I don’t know the pr...,,160RESOURCEREQUEST:Videos(orarticles)withtips/...


In [15]:
# rearrange columns
df = df[['uid','time','title','body-text','media','comments']]
df.head()

Unnamed: 0,uid,time,title,body-text,media,comments
0,45Newbiequestionsaboutascendantsandborders601,2022-09-05,Newbie questions about ascendants and borders,"I'm new to actually learning astrology, not ju...",,0
1,100Thousandsofunchartedplanetsatyourfingertips...,,Thousands of uncharted planets at your fingert...,,True,0
2,34Astrologyandcognitivedissonance323,2022-09-05,Astrology and cognitive dissonance,Open to anyone who wouldn't mind sharing a rec...,,1
3,38whatdoy’allthinkofpersonacharts?180,2022-09-05,what do y’all think of persona charts?,"I feel a bit skeptical of them, since I feel l...",,0
4,160RESOURCEREQUEST:Videos(orarticles)withtips/...,2022-09-05,RESOURCE REQUEST: Videos (or articles) with ti...,I think my problem is that I don’t know the pr...,,2


In [16]:
# will export to json here because that seems more intuitive to me for this
# kind of long string data
df.to_json('../data/scrapes.json', orient='index')

In [17]:
# END