# Project 3

## Part 0: Scraping and Parsing

This section will illustrate the process of scraping and parsing Reddit that I performed for this project.

Most scraping and parsing was done using two custom-made scripts, `scraper.py` and `parser.py`, located in the code directory. These scripts export the following classes:

`RedditReader`: A class for scraping Reddit using selenium. Because Reddit loads its post data dynamically in chunks, it was required to emulate a human user and scroll down the page as far as necessary to scrape the posts. The goal here was to avoid use of a much easier-to-use API.

`RedditParser`: A class for parsing the Reddit page source for the necessary information.

The `scraper.py` module uses the following third-party packages:

- `selenium` (requires Chrome as ChromeDriver is hardcoded in the script for now)

The `parser.py` modules uses the following third-party packages:

- `bs4`

The actual scraping done for this project was performed in ipython and the command line. I will replicate my steps below.

### 0. Imports and Preliminaries

In [1]:
# import modules
from scraper import RedditReader
from parser import RedditParser
import os

import pandas as pd

### 1. Scraping

**Note:** The code in this section will run an example scrape using the `scraper.py` module. As previously mentioned, the actual scraping was performed from the command line and in ipython. My executable code to automate this is located at the bottom of `scraper.py` and in the `multiscraper.py` script.

The below code is commented out to prevent possible errors due to configuration (see above). Remove markdown code quotes and convert to code cell to run.

```python
number_of_scrolls = 10 # adjust to taste

with RedditReader() as rr:
    rr.set_sleep_time(4) # give time for page load
    
    # get URL
    rr.get() # automatically sleeps for above time
    rr.set_sleep_time(8) # give time for scrolling
    
    for i in range(number_of_scrolls):
        # scroll page 10 times to get a subset of data
        rr.scroll() # automatically scrolls to bottom of scrollable area
        
    # done with scrolling - write to disk
    rr.write_page_source() # by default timestamps file and writes to scraped dir
    
    # selenium is automatically closed by virtue of the 'with' call
```

### 2. Parsing

In [2]:
# get main subreddit scrapes
scrape_dir = '../scrapes/'
scrape_files = os.listdir(scrape_dir)
print(scrape_files[:5])

['scrape_20220905_210656.txt', 'search', 'scrape_20220905_204314.txt', 'scrape_20220830_172503.txt', 'page']


In [3]:
# for each scrape, parse and collect relevant data, then add to a list
info = list()
for file in scrape_files:
    if not file.endswith('.txt'): continue
    # process contents using custom script (see above)
    # file is read as part of class instantiation
    rp = RedditParser(scrape_dir + file)
    info.append(rp.go())
    print(f'processed ' + scrape_dir + file)

processed ../scrapes/scrape_20220905_210656.txt
processed ../scrapes/scrape_20220905_204314.txt
processed ../scrapes/scrape_20220830_172503.txt
processed ../scrapes/scrape_20220830_190718.txt
processed ../scrapes/scrape_20220831_110027.txt
processed ../scrapes/scrape_20220829_164748.txt
processed ../scrapes/scrape_20220831_101422.txt
processed ../scrapes/scrape_20220905_194303.txt


In [4]:
# convert dict/arrays to dataframes and concat to a single big dataframe
container_df = pd.DataFrame()

for ix, v in enumerate(info):
    container_df = pd.concat([container_df, pd.DataFrame(v)])
    print(f'converted {str(ix)} of {len(info)}')

# drop duplicate rows (determined by uid created as part of the parser)
df = container_df.drop_duplicates('uid')

converted 0 of 8
converted 1 of 8
converted 2 of 8
converted 3 of 8
converted 4 of 8
converted 5 of 8
converted 6 of 8
converted 7 of 8


In [5]:
# delete unused variables to save memory
del info
del container_df

In [26]:
# we have additional post scrapes to parse
post_scrape_dir = scrape_dir + 'page/'
post_scrape_files = os.listdir(post_scrape_dir)
print(post_scrape_files[:5])

['scrape_page_20220906_091432.txt', 'scrape_page_20220905_222246.txt', 'scrape_page_20220906_073832.txt', 'scrape_page_20220906_042333.txt', 'scrape_page_20220906_090702.txt']


In [33]:
# scrape each post
info = list()
count = 0
for file in post_scrape_files:
    rp = RedditParser(post_scrape_dir + file, tags_file='tags_post.json')
    info.append(rp.go())
    count += 1
    #print(f'processed {post_scrape_dir}{file}')
    if not count % 50: print('.', end='') # 7k+ files - shorten output

...............................................................................................................................................

In [35]:
container_df = pd.DataFrame()
count = 0

for ix, v in enumerate(info):
    container_df = pd.concat([container_df, pd.DataFrame(v)])
    count += 1
    #print(f'converted {str(ix)} of {len(info)}')
    if not count % 50: print('.', end='') # shorten output

...............................................................................................................................................

In [36]:
container_df.head()

Unnamed: 0,title,time,comments,body-text,media,uid
0,A morbid /showerthought for scorpio season onl...,2020-09-16,,If I were to die by some sort of sudden accide...,,75Amorbid/showerthoughtforscorpioseasononlyast...
0,"Scorpio risings, how do you interpret and/or r...",2021-09-10,,Do you feel like you are subconsciously projec...,,"88Scorpiorisings,howdoyouinterpretand/orrespon..."
0,Air Signs and Emotional Reactions,2021-09-11,,This is possibly an unpopular opinion but I'd ...,,33AirSignsandEmotionalReactions666
0,Libra moon and Taurus Moon a surprisingly good...,2021-09-11,,Normally people think that air and earth aren’...,,53LibramoonandTaurusMoonasurprisinglygoodmatch...
0,jupiter in gemini meaning?,2021-09-11,,is having a placement in detriment more diffic...,,26jupiteringeminimeaning?50


In [37]:
container_df.shape

(7150, 6)

In [38]:
# combine all rows
df = pd.concat([df, container_df]).drop_duplicates('uid')
df.shape

(8718, 6)

# 3. Verify and Write To Disk

In [6]:
# reset index
df = df.reset_index().drop(columns=['index'])

In [7]:
# check number of rows
df.shape

(2187, 6)

In [8]:
# check general look of table
df.head()

Unnamed: 0,title,time,comments,body-text,media,uid
0,Saturn Return MEGATHREAD - we've been getting ...,2022-02-01,330,,,214SaturnReturnMEGATHREAD-we'vebeengettingalot...
1,"MERCURY RX INFOGRAPHIC: Taurus/Gemini, Apr-Jun...",2022-06-01,22,,,"51MERCURYRXINFOGRAPHIC:Taurus/Gemini,Apr-Jun20220"
2,CHANI app issues?,2022-08-30,4,I just downloaded the CHANI app to try out and...,,17CHANIappissues?221
3,Make every mile count next season—get the late...,,0,,True,124Makeeverymilecountnextseason—getthelatestcr...
4,Is Mercury in Aquarius in the 6th House as pow...,2022-08-30,8,Not new to the deeper parts of astrology but t...,,86IsMercuryinAquariusinthe6thHouseaspowerfulas...


In [9]:
# rearrange columns
df = df[['uid','time','title','body-text','media','comments']]
df.head()

Unnamed: 0,uid,time,title,body-text,media,comments
0,214SaturnReturnMEGATHREAD-we'vebeengettingalot...,2022-02-01,Saturn Return MEGATHREAD - we've been getting ...,,,330
1,"51MERCURYRXINFOGRAPHIC:Taurus/Gemini,Apr-Jun20220",2022-06-01,"MERCURY RX INFOGRAPHIC: Taurus/Gemini, Apr-Jun...",,,22
2,17CHANIappissues?221,2022-08-30,CHANI app issues?,I just downloaded the CHANI app to try out and...,,4
3,124Makeeverymilecountnextseason—getthelatestcr...,,Make every mile count next season—get the late...,,True,0
4,86IsMercuryinAquariusinthe6thHouseaspowerfulas...,2022-08-30,Is Mercury in Aquarius in the 6th House as pow...,Not new to the deeper parts of astrology but t...,,8


In [10]:
# will export to json here because that seems more intuitive to me for this
# kind of long string data
df.to_json('../data/scrapes.json', orient='index')

In [11]:
# END