## WSL Web Scraper

#### This notebook is solely intended to scrape the WSL webpage to collect every wave scored from 2016-2018 and store this data in a separte file to be analyzed at a later time. The notebook relies on both BeautifulSoup and Selenium in order to scrape all the data necessary.

#### The web scraping aspect of the notebook in multi-processed, allowing the scraping to be completed faster.
<br>

##### By: Connor Secen

In [1]:
# imports required
import time
import requests
import multiprocessing
import csv
import pandas as pd

# imports for webscraping
from bs4 import BeautifulSoup, Comment

from selenium import webdriver
import selenium.webdriver.support.expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

The scraping itself takes up the most amount of time and the most amount of code for this project. There is one function called `scraper` that handles all of the web scraping. Each process calls this function with a different url, corresponding to a specific year. 

BeautifulSoup and Selenium are used in parallel in this project because both have there limitations. BeautifulSoup is very useful with parsing html and collecting all the information for the contents of the webpage, which is exactly what it is used for in this project. However, because the WSL webpage uses modals to store the information we are trying to collect, BeautifulSoup cannot access this information directly. This is where Selenium comes into play. Because Selenium has a webdriver that allows you to simulate a page, it can be used to navigate throughout the site in order to access the information that BeautifulSoup can then parse out. 

In [2]:
def scraper(url, filename):
    year = url[39:43]# get the year of the events
    
    roundNames = ['Round 1', 'Round 2', 'Round 3', 'Round 4', 'Round 5', 'Quarterfinals', 'Semifinals', 'Finals']
    
    r = requests.get(url)   # send request to the starting url
    soup = BeautifulSoup(r.content, 'html.parser')
    eventsRaw = soup.findAll('span', attrs={'class':'tour-event-name'})
    events = [e.text for e in eventsRaw]   # collect all names of the events that happened in the corresponding year
    
    driver = webdriver.Chrome('/usr/bin/chromedriver')   # create an instance of a selenium web driver
    driver.get(url)
    
    # loop through every event for the year and collect every wave 
    for event in events:
        if event == 'Surf Ranch Pro':
            continue
            
        #print('Starting new event')
        # wait for the page to load, locate the next link to and move it into view
        time.sleep(5)
        element = driver.find_element_by_xpath("//a[@title=\"" + event + "\"]")
        driver.execute_script("arguments[0].scrollIntoView(true);", element)
        driver.execute_script("window.scrollBy(0, -100)")
        
        # navigate to the results page for the current event
        time.sleep(5)
        element.click()
        time.sleep(5)
        driver.find_element_by_link_text('Results').click()
        time.sleep(5)
        
        rounds = driver.find_elements_by_xpath("//a[@data-request-name='postEventWatch']")
        
        if (year == '2018' and event == 'Billabong Pipe Masters'):
            rounds = rounds[4:]
            roundNames = ['Round 1', 'Round 2', 'Round 3', 'Round 4', 'Quarterfinals', 'Semifinals', 'Finals']
        
        driver.execute_script("arguments[0].scrollIntoView(true);", rounds[0])
        driver.execute_script("window.scrollBy(0, -100)")
        time.sleep(5)
        #driver.execute_script("window.scrollTo(0, 150)")
        for idx1, r in enumerate(rounds):
            #print('Starting on the round')
            rounds2 = driver.find_elements_by_xpath("//a[@data-request-name='postEventWatch']")
            
            if (year == '2018' and event == 'Billabong Pipe Masters'):
                rounds2 = rounds2[4:]
            
            driver.execute_script("arguments[0].scrollIntoView(true);", rounds2[idx1])
            driver.execute_script("window.scrollBy(0, -100)")
            rounds2[idx1].click()
            time.sleep(5)

        
            viewDetails = driver.find_elements_by_class_name('hot-heat__details-link')   # find all view details buttons
            firstPassDone = False
            # loop through all modals on the current pages and collect the all wave scores
            for i, x in enumerate(viewDetails):
                
                #print('starting to collect heat scores')
                x.click()   # open the scores for the heat
                time.sleep(5)

                source = driver.page_source # get page content
                soup2 = BeautifulSoup(source, 'html.parser')

                # collect raw scores, how many athletes are in the heat, and the shortened names of the athletes
                rawScores = soup2.findAll(['span', 'div'], attrs={'class':['wave-score', 'wave--empty']})
                rawCount = soup2.findAll('div', attrs={'class':'hot-heat__athletes'})
                rawAthletes = soup2.findAll('div', attrs ={'class':'hot-heat-athlete__name--short'})

                scores = [s.text for s in rawScores]   # clean the scores into just the text
                # rawCount is a list of all instances of the number of surfers in each heat. The first subscript get the
                # content. The '.attrs' gets the class names. The second subscript get the second class name which holds
                # the number of surfer. The third subscript gets the last character which is the actual number. Then it
                # is converted to an int
                count = int(rawCount[0].attrs['class'][1][-1])
                athletes = [a.text for a in rawAthletes]   # clean the athletes names into just text
                
                for x in range(0, count):
                    
                    score = scores[x::count]
                    csvData = [year, athletes[x], event, roundNames[idx1], score]
                    
                    with open(filename, 'a') as csvFile:
                        writer = csv.writer(csvFile)
                        writer.writerow(csvData)

                        csvFile.close()
                    

                time.sleep(5)
                driver.find_element_by_xpath("//*[@class='close pass']").click()   # close the current modal
                time.sleep(5)
                #print('finsihed collecting heat scores')
                
                index = i + 1
                if index%4 == 0 and firstPassDone == True:
                    driver.execute_script("window.scrollBy(0,500)")
                    #print("scrolling up")
                firstPassDone = True
            #print('finishing the round')

        # return to the list of events by going back twice
        driver.back()
        time.sleep(5)
        driver.back()
        time.sleep(2)
        #print('Finishing event')
    time.sleep(5)    
    driver.close()

In the `main` function of the program, the scraper function from above is called. Because there are 3 years of data that need to be collected, multiple processes are put to use to scrape each year seperately and make the scraping itself more efficient. There is one process per year of data being collected.

Each process is writing to a separate csv file. Once each processes has completed, all 3 files will be concatinated into one single file.

In [None]:
def main():
    
    filenames = ['2016.csv', '2017.csv', '2018.csv']
    structure = ['Year', 'Name', 'Event', 'Round', 'Scores']
    
    for file in filenames:
        with open(file, 'w') as csvFile:
            writer = csv.writer(csvFile)
            writer.writerow(structure)

            csvFile.close()
    
    # create the urls
    url1 = 'https://www.worldsurfleague.com/events/2016/mct'
    url2 = 'https://www.worldsurfleague.com/events/2017/mct'
    url3 = 'https://www.worldsurfleague.com/events/2018/mct'
    
    # create the processes
    p1 = multiprocessing.Process(target=scraper, args=(url1, filenames[0], )) 
    p2 = multiprocessing.Process(target=scraper, args=(url2, filenames[1], ))
    p3 = multiprocessing.Process(target=scraper, args=(url3, filenames[2], ))
    
    # start the processes
    p1.start()
    p2.start()
    p3.start()
    
    
    # wait for all processes to finish
    p1.join()
    p2.join()
    p3.join()
    
    #combine all files in the list
    combined_csv = pd.concat([pd.read_csv(f) for f in filenames ])
    #export to csv
    combined_csv.to_csv( "combined_csv.csv", index=False, encoding='utf-8-sig')

    print('done')

In [None]:
main()