# Scraping MoonBoard Problems
Scrapes MoonBoard problems from the MoonBoard site using an automated clicking routine defined via Selenium.

In the process of scraping, four (4) intermediate files will be produced:
1. problems_dict.pickle
2. failed_uids_dict.pickle
3. problems_dict_holds.pickle
4. moonboard_data.pickle

These items can be organized under the several phases of data mining:

**Phase 1: Get all URLs leading to specific problems**

*Produces: Item(s) 1*
* Accessing all problems in the MoonBoard problems repository requires clicking through every page on their site 
* On each page, a set of problems are shown as a scrollable UI element
* Each problem within this scrollable UI element has a URL leading to a unique webpage that displays a problem and related metadata

**Phase 2: Accessing each problem's page and extract metadata**

*Produces: Item(s) 2, 3*
* After Phase 1, we have a dictionary that maps each unique problem to its corresponding webpage via URL
* We access each unique webpage and extract metadata into **Item (3)**
* Every unsuccessful access attempt is stored in **Item (2)**

**Phase 3: Format schema for neural network**

*Produces: Item(s) 4*
* After Phase 2, we have a dictionary of MoonBoard problems and associated metadata
* Phase 3 processes this scraped data into a schema that is consistent and suitable for input to neural network training

## Setup:

In [None]:
import shutil

from moonboard_helper import *

In [None]:
# Load credentials
with open('./credentials.txt') as f:
    flines = f.readlines()

cred_dict = {s.split('-')[0].strip() : s.split('-')[1].strip() for s in flines}
print(cred_dict)

In [None]:
username = cred_dict['username']
password = cred_dict['password']
driver_path = cred_dict['driver_path']
save_path = cred_dict['save_path']
save_path_holds = cred_dict['save_path_holds']
save_path_failed = cred_dict['save_path_failed']
save_path_final = cred_dict['save_path_final']

moonboard_url = 'https://moonboard.com/'

## Phase 1: Preliminary Scraping (URLs)

In [None]:
# Load browser and login to MoonBoard
browser = load_browser(driver_path)
loginMoonBoard(browser, moonboard_url, username, password)
time.sleep(2)

In [None]:
# Get problems view
click_view_problems(browser)
click_holdsetup(browser)

In [None]:
# Process all pages (num_pages == -1 gets all pages)
if not os.path.exists(save_path):
    problems_dict = process_all_pages(browser, save_path, num_pages=1)
    save_pickle(problems_dict, save_path)
else:
    problems_dict = load_pickle(save_path)

In [None]:
# Number of scraped problems
print('Number of problems:', len(problems_dict))

## Phase 2: Secondary Scraping (Problems)

In [None]:
# Copy problem dict
if not os.path.exists(save_path_holds):
    shutil.copyfile(save_path, save_path_holds)

holds_dict = load_pickle(save_path_holds)

In [None]:
# Failed uids
if not os.path.exists(save_path_failed):
    print('Creating failed uids dictionary...')
    failed_uids_dict = {}
    save_pickle(failed_uids_dict, save_path_failed)
else:
    print('Loading failed uids dictionary...')
    failed_uids_dict = load_pickle(save_path_failed)
    print('Number of failed Uids:', len(failed_uids_dict))

In [None]:
# Scrape specific problems
holds_dict, failed_uids_dict = scrape_problems(
    browser, 
    holds_dict, 
    save_path_holds, 
    failed_uids_dict, 
    save_path_failed
)

In [None]:
# Close browser
browser.close()

## Phase 3: Schema Organization

In [None]:
# Format mined problems
final_dict = cast_to_basic_schema(holds_dict)
save_pickle(final_dict, save_path_final)