# Web Scraping Ultimate Season Data

Before we start with code, a recap of the goal: we are hoping to make some spreadsheets to summarize game data for major Ultimate tournaments. To do this, we plan to use the website Ultiarchive to get a list of major tournaments, and use the corresponding links to the USAU website to get the actual game information.

For example, if we wanted to get the 2019 Club Open National Championship information, we would:

    1. Visit https://ultiarchive.com/years/2019
    2. Find the club open page here: https://ultiarchive.com/tournaments/national-championships-club/years/2019/divisions/club-open 
    3.Visit the USAU link at the top of the page to get to the USAU website page for the tournament https://play.usaultimate.org/events/USA-Ultimate-National-Championships-2019/schedule/Men/Club-Men/
    4. Use this page to get game data.


First of all, why do it this way? The Ultiarchive list is great because it only includes USAU sanctioned tournaments, which are only a small subset of all the tournaments on the USAU website. It *is* possible to find them on USAU directly, but it's difficult since it relies on the built-in event search engine. Using Ultiarchive we have a pretty simple (if a bit long) pipeline to get to the tournament data we want.

Second: the Ultiarchive pages also have the tournament results. Why not just pull the results from there? The reason is that the Ultiarchive pages are populated with Javascript which, as we will see below, are difficult to work with. The USAU pages are hard-coded, so we can use BeautifulSoup to work with the source code to get the information we need.

###### A general note: the way this code is designed/structured is just what I've figured out through a combination of Google and trial and error. I'm sure there are better ways to do everything here, but this works at least!

Finally, for people unfamiliar with HTML structure, here are the basics: every piece of information on a webpage is stored in a tag, which is marked which is either standalone `<tag>` or marked with an open tag `<tag>` and a close tag `</tag>`. There can be information (attributes) stored in the tag itself, or in child tags. For example, we might have:


`<div class='center'>`
`<img src="image.jpg">` Some text.
`</div>`

where the `<div>` tag has a `class` attribute, but also has an `<img>` child with `src` attribute. The div tag is a container, so it has the close tag too (`</div>`). If we wanted to get the text inside the tag, we would call the `.text` attribute of the `<div>` tag.

Our job here will essentially be to look at the source code and figure out how the page is structured with these tags in order to figure out which ones we need to get data from and which we can ignore. This is what BeautifulSoup does! See below.

## 0 Basics

In [22]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup, Comment
# BeautifulSoup (hereafter BS) and requests are the two main packages for normal web scraping.
# requests mainly actually gets you the source data, and BeautifulSoup lets
# you actually work with it?
import urllib3
import urllib.request
import matplotlib.pyplot as plt
import seaborn as sns
# Seaborn is a plotting package that can make things look nicer than Matplotlib, but is most useful
# since it works nicely with Pandas dataframes. I'm including it here as a standard import, at time
# of writing I'm not sure if it will be used or not. It's a nice thing to have though!

import itertools as it
import more_itertools as mit
# Not sure if we'll need either of these, but again, standard imports.

import json
import re
# This is the regex package. For those unfamiliar, regex is a "language" for
# very specific text searching and extraction. Check out https://regexone.com/
# for a tutorial / explanation. 
import io
import math
import time
import cProfile
from pprint import pprint
# Will be useful for demonstration
from tqdm.notebook import tqdm
# Will make nice progress bars

import selenium
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
# I am using Chrome here, there are other supported browsers out there too.
# I think Firefox is actually the more common one for this

chrome_options = Options()
chrome_options.add_argument("--headless")
# Headless stops a browser from being visibly opened on your computer.

driver = webdriver.Chrome(options=chrome_options, 
              executable_path=r"C:\Users\alexs\Projects\lls\chromedriver.exe")

# You will have to download the chromedriver.exe file and change this path to
# point at it.

## 0.1 The general strategy

Here is the basic step-by-step process we will follow for reading a season:

    0. Input a year and competition division (open, women's, mixed)
    1. Visit the Ultiarchive page for that season and get a list of links corresponding to each tournament
    2. For each tournament, parse that page to visit the USAU page.
    3. Parse the USAU page to get each game type (pool play or bracket play)
    4. For each tournament stage, parse the page to get each game subtype (pool A/B/..., 1st place/3rd place/... bracket)
    6. For each game subtype, get each bracket position (quarter/semi/finals)
    7. For each bracket position, get the corresponding game.
    8. For each game, parse the game to get the information we want
    9. Put the game information together into a Pandas dataframe row
   
We will have functions for each of these steps, and other utility functions to handle the lower-level processes. Because of this, it can sometimes be a bit tough to understand how all of these pieces fit together, but keeping this step-by-step picture in the back of your mind should help. 

## 1 Dealing with Javascript pages

Most of the fun stuff will be with BeautifulSoup as we will see below, but there is a wrinkle as mentioned above: Ultiarchive.com is a javascript-based website, so we can't use any of the normal BeautifulSoup techniques to scrape it. Below is the entire source of the webpage for 2019 tournaments:

We can see that there isn't actually any information there, just some calls to Javascript at the bottom. Requests and BS cannot actually get the result of the Javascript, so we use a package called selenium instead. Selenium more or less uses a regular browser to access the page, so the Javascript gets loaded normally.

In [39]:
def jspage_to_soup(url, scroll=True):
    """Visit a JavaScript page and convert it into a soup."""
    # This is all selenium code!
    chrome_options = Options()
    chrome_options.add_argument("--headless")

    # Headless stops a browser from being visibly opened on your computer.

    driver = webdriver.Chrome(options=chrome_options, 
                  executable_path=r"C:\Users\alexs\Projects\lls\chromedriver.exe")

    # You will have to download the chromedriver.exe file and change this path to
    # point at it.
   
    driver.get(url)
    # Visit the page with the driver!

    time.sleep(1.5)
    # !!! We need to give the page time to load!
    if scroll:
        for i in range(20):
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);\
                                  var lenOfPage=document.body.scrollHeight;\
                                  return lenOfPage;")
            time.sleep(1)
            # This process literally scrolls the page all the way down to make sure that all of the content
            # gets read and stored. We won't always need to do this (see below), so we add an option to control it.
    
    data = driver.page_source.encode("utf-8")
    # Takes the driver response and grabs the page source code including the javascript. This puts things in
    # a form that BS can handle.
    
    soup = BeautifulSoup(data, 'html.parser')
    # Makes the soup object!
    # Note: a soup object is essentially a version of the source code that has the nested tag HTML structure the
    # same way as if you look at a page in a browser inspector.
    return soup

To reiterate: Selenium actually loads the webpage using a normal browser, so we need to explicitly give it time to load!

## 2 Getting the tournament links (fun with HTML)

Recall that at this step we want to find links to the Ultiarchive pages for **USAU** tournaments. That USAU part is a big caveat: we need to ignore some of the tournaments on this page!


At this point, we want to describe the process of figuring out how to actually get that information of soup that we generated above. To start with: what does that soup object actually contain? Let's take a look at the tournaments from 2019 (https://ultiarchive.com/years/2019):

In [24]:
# (this will take a while to load!)
soup = jspage_to_soup("https://ultiarchive.com/years/2019")
soup

<html lang="en"><head>
<meta charset="utf-8"/>
<title>Tournaments</title>
<base href="/"/>
<meta content="width=device-width,initial-scale=1,maximum-scale=5,minimum-scale=1" name="viewport"/>
<meta content="Past, current, and future ultimate firsbee tournaments" name="Description"/>
<link href="favicon.png" rel="icon" type="image/x-icon"/>
<link href="manifest.webmanifest" rel="manifest"/>
<link href="https://firestore.googleapis.com" rel="preconnect"/>
<meta content="#337ab7" name="theme-color"/>
<link href="https://fonts.googleapis.com/css?family=Roboto:300,400,500" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
<link href="https://cdnjs.cloudflare.com/ajax/libs/flag-icon-css/2.9.0/css/flag-icon.min.css" rel="stylesheet"/>
<link href="styles.98a99bf1902d6c02d16b.css" rel="stylesheet"/><script charset="utf-8" src="6-es2015.19e90bb1f4121cbac358.js"></script><style>.light-theme[_ngcontent-jlt-c25]{color:#333}.light-theme[_ngcon

This is a lot. To make sense of all of thise code, we just have to carefully look for the information we wnat and figure out how it is stored. In our case, we see that the rows of the table are stored inside of `<tr>` tags. Below we have bolded the opening and closing tag of such a row:

`/college-womens">College Women's</a><!-- --><!-- --></td></tr>`**`<tr _ngcontent-jua-c26="" class="ng-star-inserted">`**`<td _ngcontent-jua-c26=""> Jan 25<!-- --></td><td _ngcontent-jua-c26=""><span _ngcontent-jua-c26="" class="tournament-type usau">usau</span> Santa Barbara Invite </td><td _ngcontent-jua-c26=""><!-- --><a _ngcontent-jua-c26="" class="ng-star-inserted" href="/tournaments/santa-barbara-invite-college/years/2019/divisions/college-open">College Open</a><!-- -->, <!-- --><!-- --><a _ngcontent-jua-c26="" class="ng-star-inserted" href="/tournaments/santa-barbara-invite-college/years/2019/divisions/college-womens">College Women's</a><!-- --><!-- --></td>`**`</tr>`**`<tr _ngcontent-jua-c26="" class="ng-star-inserted"><td _ngcontent-`

Looking inside the tag, we can see that the information we want is there! Using BS, we can search for all instances of just those tags:

In [25]:
soup.find_all('tr') # searches for tags with the 'tr' attribute.

[<tr _ngcontent-jlt-c26="" class="ng-star-inserted"><td _ngcontent-jlt-c26=""> Nov 15<!-- --></td><td _ngcontent-jlt-c26=""><span _ngcontent-jlt-c26="" class="tournament-type usau">usau</span> Southwest College Regionals </td><td _ngcontent-jlt-c26=""><!-- --><a _ngcontent-jlt-c26="" class="ng-star-inserted" href="/tournaments/southwest-college-regionals-college/years/2019/divisions/college-mixed">College Mixed</a><!-- --><!-- --></td></tr>,
 <tr _ngcontent-jlt-c26="" class="ng-star-inserted"><td _ngcontent-jlt-c26=""> Nov 9<!-- --></td><td _ngcontent-jlt-c26=""><span _ngcontent-jlt-c26="" class="tournament-type usau">usau</span> North College Regionals </td><td _ngcontent-jlt-c26=""><!-- --><a _ngcontent-jlt-c26="" class="ng-star-inserted" href="/tournaments/north-college-regionals-college/years/2019/divisions/college-mixed">College Mixed</a><!-- --><!-- --></td></tr>,
 <tr _ngcontent-jlt-c26="" class="ng-star-inserted"><td _ngcontent-jlt-c26=""> Nov 8<!-- --></td><td _ngcontent-jlt-c

This is a lot better! Now we only have the table information instead of everything from the page. If we visit the page in a browser (link above), we can start to make sense of this: it looks like we have one `<td>` child tag for each table entry, so one for each date, tournament name/organizer, and all competition divisions. We don't need the dates, but we do need the tournament name/organizer and competition division cells.

Let's look at one of the rows (one `tr` tag). In particular, we'll look at the children individually. To do this, we will call the `.children` attribute of the tag to get a generator containing all child tags. We'll convert this to a list to see what's going on:

In [361]:
tr_tag = soup.find_all('tr')[-4]
# We'll be using this particular tag for a running example.
pprint(tr_tag)
print()
pprint(list(tr_tag.children)) # again, .children returns a generator that we need to convert to a list to view.

<tr _ngcontent-jlt-c26="" class="ng-star-inserted"><td _ngcontent-jlt-c26=""> Jan 25<!-- --></td><td _ngcontent-jlt-c26=""><span _ngcontent-jlt-c26="" class="tournament-type usau">usau</span> Santa Barbara Invite </td><td _ngcontent-jlt-c26=""><!-- --><a _ngcontent-jlt-c26="" class="ng-star-inserted" href="/tournaments/santa-barbara-invite-college/years/2019/divisions/college-open">College Open</a><!-- -->, <!-- --><!-- --><a _ngcontent-jlt-c26="" class="ng-star-inserted" href="/tournaments/santa-barbara-invite-college/years/2019/divisions/college-womens">College Women's</a><!-- --><!-- --></td></tr>

[<td _ngcontent-jlt-c26=""> Jan 25<!-- --></td>,
 <td _ngcontent-jlt-c26=""><span _ngcontent-jlt-c26="" class="tournament-type usau">usau</span> Santa Barbara Invite </td>,
 <td _ngcontent-jlt-c26=""><!-- --><a _ngcontent-jlt-c26="" class="ng-star-inserted" href="/tournaments/santa-barbara-invite-college/years/2019/divisions/college-open">College Open</a><!-- -->, <!-- --><!-- --><a _ngco

To get the tournament type and links, we need to extract the bolded parts from the tags:

`[<td _ngcontent-jua-c26=""> Jan 25<!-- --></td>,
 <td _ngcontent-jua-c26=""><span _ngcontent-jua-c26="" class="tournament-type usau">`**`usau`**`</span> Santa Barbara Invite </td>,
 <td _ngcontent-jua-c26=""><!-- --><a _ngcontent-jua-c26="" class="ng-star-inserted" `**`href="/tournaments/santa-barbara-invite-college/years/2019/divisions/college-open"`**`>College Open</a><!-- -->, <!-- --><!-- --><a _ngcontent-jua-c26="" class="ng-star-inserted" `**`href="/tournaments/santa-barbara-invite-college/years/2019/divisions/college-womens"`**`>College Women's</a><!-- --><!-- --></td>]`

Looking at which tags those are, we see that the tournament type tag has `class="tournament-type"`, and the link tags have the `href` attribute. We can do a search for those properties explicitly with BS.

We'll eventually need to filter the links based on the tournament types, so we will find the types and the links separately.

In [360]:
tournament_type = tr_tag.find(class_="tournament-type") # searches for a particular class of tag (as opposed to attribute)
links = tr_tag.find_all(href=True) # searches for tags that have an href attribute

pprint(tournament_type)
print()
pprint(links)

<span _ngcontent-jlt-c26="" class="tournament-type usau">usau</span>

[<a _ngcontent-jlt-c26="" class="ng-star-inserted" href="/tournaments/santa-barbara-invite-college/years/2019/divisions/college-open">College Open</a>,
 <a _ngcontent-jlt-c26="" class="ng-star-inserted" href="/tournaments/santa-barbara-invite-college/years/2019/divisions/college-womens">College Women's</a>]


Great! The last thing we need to do is extract the information from the individual tags. For the tournament type, since that information is just the text content of the tag, we call `.text`. To get the `href` attribute from the tag, we call `href` as we would for a dictionary.

In [28]:
tournament_type = tr_tag.find(class_="tournament-type").text

links = [tag['href'] for tag in tr_tag.find_all(href=True)]

pprint(tournament_type)
pprint(links)

'usau'
['/tournaments/santa-barbara-invite-college/years/2019/divisions/college-open',
 '/tournaments/santa-barbara-invite-college/years/2019/divisions/college-womens']


We now know how to get the information we want out of a tag. We'll write a function to extract the information from an individual `tr` tag, including filtering for the division we want and adding the url prefix:

In [463]:
def extract_tournament_info(tr_tag, division='club-open'):
    """Extract the tournament type and ultiarchive link from a tag."""
    tournament_type = tr_tag.find(class_="tournament-type").text
    links = ["https://ultiarchive.com" + tag['href'] for tag in tr_tag.find_all(href=True) if division in tag['href']]
    # Makes links to each tournament as long as the link contains the right division.
    return tournament_type, links

Now we can iterate through each of the `tr` tags in the page:

In [464]:
def soup_to_list_of_tournaments(soup, division='club-open'):
    """Turn an ultiarchive soup into a list of all ultiarchive tournament links."""
    all_links = []
    for tr_tag in soup.find_all("tr"):
        tournament_type, links = extract_tournament_info(tr_tag, division)
        if tournament_type == "usau":
            all_links.extend(links)
    return all_links

In [31]:
club_open_tournaments = soup_to_list_of_tournaments(soup)
print(len(club_open_tournaments))
pprint(club_open_tournaments)

72
['https://ultiarchive.com/tournaments/red-tide-ultimate-clambake-club/years/2019/divisions/club-open',
 'https://ultiarchive.com/tournaments/national-championships-club/years/2019/divisions/club-open',
 'https://ultiarchive.com/tournaments/southwest-regionals-club/years/2019/divisions/club-open',
 'https://ultiarchive.com/tournaments/southeast-regionals-club/years/2019/divisions/club-open',
 'https://ultiarchive.com/tournaments/south-central-regionals-club/years/2019/divisions/club-open',
 'https://ultiarchive.com/tournaments/northwest-regionals-club/years/2019/divisions/club-open',
 'https://ultiarchive.com/tournaments/northeast-regionals-club/years/2019/divisions/club-open',
 'https://ultiarchive.com/tournaments/north-central-regionals-club/years/2019/divisions/club-open',
 'https://ultiarchive.com/tournaments/mid-atlantic-regionals-club/years/2019/divisions/club-open',
 'https://ultiarchive.com/tournaments/great-lakes-regionals-club/years/2019/divisions/club-open',
 'https://ulti

If we go back in our browser and search for 'club open', we'll see that we get 75 matches, less three non-USAU tournaments. So we're getting what we want!

If we want to start with just a year and get links out, we can write the following function:

In [32]:
def year_to_tournaments(year=2019, division='club-open'):
    """ Gets links to each ultiarchive tournament page for a given year"""
    url = fr"https://ultiarchive.com/years/{year}"
    # This is the general form for the ultiarchive pages
    soup = jspage_to_soup(url)
    links = soup_to_list_of_tournaments(soup, division='club-open')
    return links


Next we can look at actually reading tournaments.

## 3 Reading tournaments

### 3.1 Getting to the USAU website

To start, we need to find the USAU link at the top of the Ultiarchive tournament page. This is another JavaScript page so we'll use the `js_to_soup()` script defined above. Since the result we want is at the top of the page, we can set `scroll=False` to save some time. 

For this section, we'll use the 2019 Club Open National Championships as our example, since it's the largest tournament and has the most data to wrangle on the website. There are some other complications that come with smaller tournaments which we will address at the end. Let's take a look:

*Note: Nationals is NOT a representative tournament, in part because it is are far better (and more consistently) documented than most other tournaments. This will be important later.

In [33]:
ultiarchive_link = club_open_tournaments[1]
print(ultiarchive_link)
ultiarchive_soup = jspage_to_soup(ultiarchive_link, scroll=False) # this will still take a couple seconds, but not 20
ultiarchive_soup.find_all(href=True)

https://ultiarchive.com/tournaments/national-championships-club/years/2019/divisions/club-open


[<base href="/"/>,
 <link href="favicon.png" rel="icon" type="image/x-icon"/>,
 <link href="manifest.webmanifest" rel="manifest"/>,
 <link href="https://firestore.googleapis.com" rel="preconnect"/>,
 <link href="https://fonts.googleapis.com/css?family=Roboto:300,400,500" rel="stylesheet"/>,
 <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>,
 <link href="https://cdnjs.cloudflare.com/ajax/libs/flag-icon-css/2.9.0/css/flag-icon.min.css" rel="stylesheet"/>,
 <link href="styles.98a99bf1902d6c02d16b.css" rel="stylesheet"/>,
 <a _ngcontent-fdf-c14="" href="/">Ultiarchive</a>,
 <a _ngcontent-fdf-c23="" class="ng-star-inserted" href="https://play.usaultimate.org/events/USA-Ultimate-National-Championships-2019/schedule/Men/Club-Men/" rel="noopener" target="_blank"> USAU <i _ngcontent-fdf-c23="" class="material-icons">open_in_new</i></a>,
 <a _ngcontent-fdf-c29="" class="ng-star-inserted" href="/teams/sockeye-club-open/years/2019">Sockeye</a>,
 <a _ngcontent

If we look around 10 lines down we see the `play.usaultimate.org` link. If we want to get just this tag, we can search for matching attributes. We'll try the `rel="noopener` tag:

In [34]:
ultiarchive_soup.find(href=True, rel="noopener")

<a _ngcontent-fdf-c23="" class="ng-star-inserted" href="https://play.usaultimate.org/events/USA-Ultimate-National-Championships-2019/schedule/Men/Club-Men/" rel="noopener" target="_blank"> USAU <i _ngcontent-fdf-c23="" class="material-icons">open_in_new</i></a>

We can extract the link exactly as we did above.

In [35]:
def ulti_to_usau(url):
    """Converts an Ultiarchive tournament link to the corresponding USAU tournament link."""
    soup = jspage_to_soup(url, scroll=False)
    return soup.find(href=True, rel="noopener")["href"]

In [36]:
usau_link = ulti_to_usau(ultiarchive_link)
usau_link

'https://play.usaultimate.org/events/USA-Ultimate-National-Championships-2019/schedule/Men/Club-Men/'

Finally, we'll update the function we wrote above to take us directly to the USAU links instead of the Ultiarchive links.

In [37]:
def year_to_tournaments(year=2019, division='club-open'):
    """ Gets links to each ultiarchive tournament page for a given year"""
    url = fr"https://ultiarchive.com/years/{year}"
    # This is the general form for the ultiarchive pages
    soup = jspage_to_soup(url)
    links = soup_to_list_of_tournaments(soup, division='club-open')
    print(links[:10])
    print("Found links!")
    usau_links = []
    for link in tqdm(links, position=0):
        print(link)
        usau_links.append(ulti_to_usau(link))
    return usau_links

In [38]:
### THIS WILL TAKE A LONG TIME!
usau_links = year_to_tournaments()


['https://ultiarchive.com/tournaments/red-tide-ultimate-clambake-club/years/2019/divisions/club-open', 'https://ultiarchive.com/tournaments/national-championships-club/years/2019/divisions/club-open', 'https://ultiarchive.com/tournaments/southwest-regionals-club/years/2019/divisions/club-open', 'https://ultiarchive.com/tournaments/southeast-regionals-club/years/2019/divisions/club-open', 'https://ultiarchive.com/tournaments/south-central-regionals-club/years/2019/divisions/club-open', 'https://ultiarchive.com/tournaments/northwest-regionals-club/years/2019/divisions/club-open', 'https://ultiarchive.com/tournaments/northeast-regionals-club/years/2019/divisions/club-open', 'https://ultiarchive.com/tournaments/north-central-regionals-club/years/2019/divisions/club-open', 'https://ultiarchive.com/tournaments/mid-atlantic-regionals-club/years/2019/divisions/club-open', 'https://ultiarchive.com/tournaments/great-lakes-regionals-club/years/2019/divisions/club-open']
Found links!


  0%|          | 0/72 [00:00<?, ?it/s]

https://ultiarchive.com/tournaments/red-tide-ultimate-clambake-club/years/2019/divisions/club-open
https://ultiarchive.com/tournaments/national-championships-club/years/2019/divisions/club-open
https://ultiarchive.com/tournaments/southwest-regionals-club/years/2019/divisions/club-open
https://ultiarchive.com/tournaments/southeast-regionals-club/years/2019/divisions/club-open
https://ultiarchive.com/tournaments/south-central-regionals-club/years/2019/divisions/club-open
https://ultiarchive.com/tournaments/northwest-regionals-club/years/2019/divisions/club-open
https://ultiarchive.com/tournaments/northeast-regionals-club/years/2019/divisions/club-open
https://ultiarchive.com/tournaments/north-central-regionals-club/years/2019/divisions/club-open
https://ultiarchive.com/tournaments/mid-atlantic-regionals-club/years/2019/divisions/club-open
https://ultiarchive.com/tournaments/great-lakes-regionals-club/years/2019/divisions/club-open
https://ultiarchive.com/tournaments/nor-cal-sectionals-cl

TypeError: 'NoneType' object is not subscriptable

Now we've got what we need to move on to actually parsing the tournament.

### 3.2 What data do we need?

The good news is that everything for the USAU website is hard-coded HTML, so it will be simpler to extract the data we need. The bad news is that the page structure is a lot more complicated! Take a look at https://play.usaultimate.org/events/USA-Ultimate-National-Championships-2019/schedule/Men/Club-Men/ (note: Men and Open are the same).

We should also specify here what information we want to track. We also need to start thinking about how we want to structure things for Pandas (dataframe/spreadsheet creation), so we'll add column names here as well for reference. On a game-by-game basis, we want to get:

    TN: Tournament name
    GT: Game type (pool play or bracket play)
    GST: Game subtype (pool number, 1st place bracket, crossover play)
    BP: Bracket position (None for pool play, else quarterfinals / semifinals / etc.)
    D: Date
    T: Time
    F: Field number
    H: Home team
    HSd: Home seed
    HS: Home score
    A: Away team
    ASd: Away seed
    AS: Away score
    
This is all of the raw data that we can get-- we might want more columns (e.g. score differential), but in terms of creating a raw data set it is likely best to do that later.


### 3.3 Getting the tournament name

With all that said, let's take a look at the source code! We will start by trying to figure out how to get a list of games out of the soup, as well as the name of the tournament.

In [42]:
usau_data = requests.get(usau_link)
usau_soup = BeautifulSoup(usau_data.text, 'html.parser')
usau_soup


<!DOCTYPE html>

<!--[if lt IE 7]>      <html class="html no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html class="html no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html class="html no-js lt-ie9"> <![endif]-->
<!--[if IE 9 ]>        <html class="html no-js ie9"> <![endif]-->
<!--[if gt IE 9]><!--> <html class="html no-js"> <!--<![endif]-->
<head id="head"><meta content="SiteStartup Description" name="description"/><meta content="SiteStartup,Keywords" name="keywords"/><title>
	Competition Schedules and Results | Play USA Ultimate
</title><meta charset="utf-8"/><meta content="width=device-width" name="viewport"/><meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/><link href="/cms/includes/style-new.v2.min.css" rel="stylesheet"/>
<script src="/cms/includes/modernizr-1.7.min.js"></script>
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.js"></script>
<script>!window.jQuery && document.write(unescape('%3Cscript src="/cms/i

That is, as always, a lot. To start with, let's just try to find the name of the tournament. If we visit the webpage again, the tournament name is found in the small navigation bar: `Home > USA Ultimate National Championships > Schedules & Standings`. Checking the source code (in Chrome, we can use Inspect Element for this), we see that we need:

`v class="column" id="main_col"> <div class="headingBox"> <div class="breadcrumbs_area"> <div class="breadcrumbs"> <a href="/events/tournament/">Home</a><span class="separator">&gt;</span>`**`<a href="/events/USA-Ultimate-National-Championships-2019/">USA Ultimate National Championships</a><span class="separator">`**`&gt;</span>Schedules &amp; Standings    </div></div></di...`

So we can start by looking for the `breadcrumbs` class:

In [44]:
usau_soup.find(class_="breadcrumbs")

<div class="breadcrumbs">
<a href="/events/tournament/">Home</a><span class="separator">&gt;</span><a href="/events/USA-Ultimate-National-Championships-2019/">USA Ultimate National Championships</a><span class="separator">&gt;</span>Schedules &amp; Standings
    </div>

We'll always have a "Home" entry before the tournament name, so we can look at the children of this tag and pick the correct list item, then get the text from there.

In [45]:
pprint(list(usau_soup.find(class_="breadcrumbs").children))

print()

tournament_name = list(usau_soup.find(class_="breadcrumbs").children)[3].text
print(tournament_name)

['\n',
 <a href="/events/tournament/">Home</a>,
 <span class="separator">&gt;</span>,
 <a href="/events/USA-Ultimate-National-Championships-2019/">USA Ultimate National Championships</a>,
 <span class="separator">&gt;</span>,
 'Schedules & Standings\r\n    ']

USA Ultimate National Championships


So that's one step down. We'll make this into a function for clarity.

In [465]:
def get_tournament_name(soup):
    """Get the name of a tournament from a soup."""
    return list(soup.find(class_="breadcrumbs").children)[3].text

### 3.4 How do we actually get the game types we want?

This is the part where things get difficult. To understand what is going on, we first need to have a good understanding of how we *want* the data to look. For the USAU National Championships, we have the following game types, subtypes, and bracket positions:

| Game type | Game Subtype | Bracket Position |
| --- | --- | --- |
| Pool | | |
| | Pool A | |
| | Pool B | |
| | Pool C | |
| | Pool D | |
| Bracket | | |
| | 1st Place Bracket | |
| | | Pre-Quarterfinals |
| | | Quarterfinals | 
| | | Semifinals | 
| | | Finals |
| | 5th Place Bracket | |
| | | 5th Place Semifinals |
| | | 5th Place Finals | 
| | 7th Place Bracket (Pro-Flight Play-In (A)| |
| | | 9th Place Quarterfinals |
| | | 9th Place Semifinals |
| | | Pro Flight Play-In / 7th Place (tie) | 
| | 7th Place Bracket - Pro Flight Play-In (B) | |
| | | 9th Place Quarterfinals |
| | | 9th Place Semifinals |
| | | Pro Flight Play-In / 7th Place (tie) |
| | 11th Place Bracket | |
| | | 11th Place |
| | 13th Place Bracket | | 
| | | 13th Place Semifinals |
| | | 13th Place Finals |
| | 15th Place Bracket | |
| | | 15th Place |
| | Crossover Round (4 Seeds From Pool Play) |

In an ideal world, we would be able to scrape this data as follows:

    1. Get containers for pool and bracket play.
    2. For each container, get containers for the subtype containers (Pool A/B/C/D, 1st place bracket, etc.).
    3. For each subtype container, get containers for each bracket position (or return same container if pool).
    4. From each bracket position container, get each game.

Unfortunately, there are a few problems:

- We don't actually have separate containers for pool and bracket play, rather, the tabs we see (Pool Play, Championship Bracket, Consolation Bracket, etc.) are the containers that contain the subtypes we want.
- There are different data structures for pool play games as opposed to bracket games, so we have to handle each of these cases differently.
    - One important problem is that the crossover round is formatted as pool play, so we can't use whatever label we generate for the game type to tell us which kind of game to expect. 
    
    In other words, we might otherwise want to find the game type first and use that as part of the game reading process to figure out whether to expect the pool formatting or the bracket formatting, but the crossover round will mean that that approach won't work.
    
- **THIS STRUCTURE DOESN'T WORK FOR OTHER TOURNAMENTS!** Check out:
        
    [Great Lakes Men's Club Regional Championship](https://play.usaultimate.org/events/Great-Lakes-Mens-Club-Regional-Championship-2021/schedule/Men/Club-Men/)
    
    [Southeast Club Club Men's Regional Championship](https://play.usaultimate.org/events/Southeast-Club-Mens-Regional-Championship-2021/schedule/Men/Club-Men/)
    
    There are a lot of different formats here, and we need to make sure we handle things appropriately. In the latter example, we would probably want to treat the "Consolations 13th-15th pool" as bracket play even though it is labeled as pool play and has pool play game formatting, since it is being played for final placements.
    
    
  
  
Overall, we will try to handle things as follows:

    1. Get all SUBTYPE containers (these are more convenient to find)
    2. For each subtype container, assign a game type of either *Pool* or *Bracket* based on the name of the container (e.g. if the name contains the word "pool" and does not contain a placement ordinal ("9th", for instance), call it a pool play container.
    3. For each subtype container, get the bracket position if possible. If none exists, just use the subtype container (for pool play)
    4. For each bracket position container, read the games within. Determine which kind of data structure to read based on  the contents of the container.

Our hope is that this approach will be the most resistant to different tournament structures and the inconsistent labeling in the data.  

## 4.4 Game types, subtypes, and bracket positions

We will try to use the `class="section page"` tag to find the subtype containers. We can find the subtype containers that we want as below:

In [47]:
for subtype_container in usau_soup.find_all(class_="global_table scores_table"): 
    # "Things that are stored as pool play, but may or may not ACTUALLY be pool play:"
    name = subtype_container.find("th", colspan="8").text
    print(name)

for subtype_container in usau_soup.find_all(class_="mod_slide alt_slide"):
    # "Things that are stored as bracket play, but may or may not ACTUALLY be bracket play:"
    name = subtype_container.find(href="#").text
    print(name)


Pool A Schedule & Scores
Pool B Schedule & Scores
Pool C Schedule & Scores
Pool D Schedule & Scores
Crossover Round (4 Seeds from Pool Play)
1st Place Bracket
5th Place Bracket
7th Place Bracket - Pro Flight Play-In (A)
7th Place Bracket - Pro Flight Play-In (B)
11th Place Bracket
13th Place Bracket
15th Place Bracket


Let's wrap this into a function:

In [466]:
def get_subtype_container_lists(soup):
    """Gets list of all subtype containers from a tournament soup."""
    containers = []
    
    for subtype_container in soup.find_all(class_="global_table scores_table"): 
    # "Things that are stored as pool play, but may or may not ACTUALLY be pool play:"
        name = subtype_container.find("th", colspan="8").text
        containers.append([subtype_container, name])

    for subtype_container in soup.find_all(class_="mod_slide alt_slide"):
    # "Things that are stored as bracket play, but may or may not ACTUALLY be bracket play:"
        name = subtype_container.find(href="#").text
        containers.append([subtype_container, name])
    
    return containers

pprint([container[1] for container in get_subtype_container_lists(usau_soup)])
subtype_names = [container[1] for container in get_subtype_container_lists(usau_soup)]

['Pool A Schedule & Scores',
 'Pool B Schedule & Scores',
 'Pool C Schedule & Scores',
 'Pool D Schedule & Scores',
 'Crossover Round (4 Seeds from Pool Play)',
 '1st Place Bracket',
 '5th Place Bracket',
 '7th Place Bracket - Pro Flight Play-In (A)',
 '7th Place Bracket - Pro Flight Play-In (B)',
 '11th Place Bracket',
 '13th Place Bracket',
 '15th Place Bracket']


Now we need to take these containers and figure out whether it is pool or bracket play. We will use the following conditions for pool play (and say everything else is bracket):

*A container name represents pool play if...*
- It does NOT contain the word "crossover"
- It does NOT contain an ordinal ("1st", "2nd", "3rd", etc.)
- It DOES contain the word "pool"

We can determine this from a string as follows. We'll immediately write this as a function.

In [467]:
def subtype_name_to_type(subtype_name):
    """Determines whether a given subtype container name corresponds to pool or bracket play."""
    if not any([keyword in subtype_name.lower() for keyword in ["st", "nd", "rd", "th", "crossover"]])\
        and "pool" in subtype_name.lower():
        return "Pool"
    else:
        return "Bracket"

def subtype_lists_to_type_lists(subtype_lists):
    """Adds game types to list of subtype containers."""
    for lst in subtype_lists:
        lst.insert(1, subtype_name_to_type(lst[1]))
    return subtype_lists

def get_type_container_lists(soup):
    """Gets list of subtype containers with types from a tournament soup."""
    subtype_lists = get_subtype_container_lists(soup)
    return subtype_lists_to_type_lists(subtype_lists)

pprint([container[1:] for container in get_type_container_lists(usau_soup)])
type_subtype_lists = get_type_container_lists(usau_soup)

[['Pool', 'Pool A Schedule & Scores'],
 ['Pool', 'Pool B Schedule & Scores'],
 ['Pool', 'Pool C Schedule & Scores'],
 ['Pool', 'Pool D Schedule & Scores'],
 ['Bracket', 'Crossover Round (4 Seeds from Pool Play)'],
 ['Bracket', '1st Place Bracket'],
 ['Bracket', '5th Place Bracket'],
 ['Bracket', '7th Place Bracket - Pro Flight Play-In (A)'],
 ['Bracket', '7th Place Bracket - Pro Flight Play-In (B)'],
 ['Bracket', '11th Place Bracket'],
 ['Bracket', '13th Place Bracket'],
 ['Bracket', '15th Place Bracket']]


Looking good so far! Now we need to get the bracket position containers.

In [468]:
def type_subtype_lists_to_bracket_position_lists(type_subtype_lists):
    """Gets a list of bracket position containers from list of subtype containers."""
    containers = []
    
    for lst in type_subtype_lists:
        
        positions = lst[0].find_all(class_="bracket_col")
        
        # Find the bracket position containers inside of a given subtype container. If there are none, 
        # just return the original container with None for the bracket position.
        if not positions:
            containers.append([*lst, None])
            continue
        # If there are bracket position containers, iterate through them and make a new list for each.
        for bracket_position in positions:
            pos_name = bracket_position.find("h4", class_="col_title").text
            containers.append([bracket_position, *lst[1:], pos_name])

    return containers

pprint([container[1:] for container in type_subtype_lists_to_bracket_position_lists(type_subtype_lists)])
bracket_position_lists = type_subtype_lists_to_bracket_position_lists(type_subtype_lists)

[['Pool', 'Pool A Schedule & Scores', None],
 ['Pool', 'Pool B Schedule & Scores', None],
 ['Pool', 'Pool C Schedule & Scores', None],
 ['Pool', 'Pool D Schedule & Scores', None],
 ['Bracket', 'Crossover Round (4 Seeds from Pool Play)', None],
 ['Bracket', '1st Place Bracket', 'Finals'],
 ['Bracket', '1st Place Bracket', 'Semifinals'],
 ['Bracket', '1st Place Bracket', 'Quarterfinals'],
 ['Bracket', '1st Place Bracket', 'Pre-Quarterfinals'],
 ['Bracket', '5th Place Bracket', '5th Place Finals'],
 ['Bracket', '5th Place Bracket', '5th Place Semifinals'],
 ['Bracket',
  '7th Place Bracket - Pro Flight Play-In (A)',
  'Pro Flight Play-In / 7th Place (tie)'],
 ['Bracket',
  '7th Place Bracket - Pro Flight Play-In (A)',
  '9th Place Semifinals'],
 ['Bracket',
  '7th Place Bracket - Pro Flight Play-In (A)',
  '9th Place Quarterfinals'],
 ['Bracket',
  '7th Place Bracket - Pro Flight Play-In (B)',
  'Pro Flight Play-In / 7th Place (tie)'],
 ['Bracket',
  '7th Place Bracket - Pro Flight Play-I

Finally, we need to get each game inside the bracket position containers. To do this, we'll search for the two relevant attributes: `data-game` and `data-relation`.

In [469]:
def bracket_position_lists_to_games(bracket_position_lists):
    """Gets list of games from list of bracket position containers."""
    games = []
    for bracket_position in bracket_position_lists:
        
        for game in bracket_position[0].find_all(attrs={"data-game":True}) + bracket_position[0].find_all(attrs={"data-relation":True}):
            games.append([game, *bracket_position[1:]])
    
    return games

games_list = bracket_position_lists_to_games(bracket_position_lists)
pprint([lst[1:] for lst in games_list])

[['Pool', 'Pool A Schedule & Scores', None],
 ['Pool', 'Pool A Schedule & Scores', None],
 ['Pool', 'Pool A Schedule & Scores', None],
 ['Pool', 'Pool A Schedule & Scores', None],
 ['Pool', 'Pool A Schedule & Scores', None],
 ['Pool', 'Pool A Schedule & Scores', None],
 ['Pool', 'Pool B Schedule & Scores', None],
 ['Pool', 'Pool B Schedule & Scores', None],
 ['Pool', 'Pool B Schedule & Scores', None],
 ['Pool', 'Pool B Schedule & Scores', None],
 ['Pool', 'Pool B Schedule & Scores', None],
 ['Pool', 'Pool B Schedule & Scores', None],
 ['Pool', 'Pool C Schedule & Scores', None],
 ['Pool', 'Pool C Schedule & Scores', None],
 ['Pool', 'Pool C Schedule & Scores', None],
 ['Pool', 'Pool C Schedule & Scores', None],
 ['Pool', 'Pool C Schedule & Scores', None],
 ['Pool', 'Pool C Schedule & Scores', None],
 ['Pool', 'Pool D Schedule & Scores', None],
 ['Pool', 'Pool D Schedule & Scores', None],
 ['Pool', 'Pool D Schedule & Scores', None],
 ['Pool', 'Pool D Schedule & Scores', None],
 ['Pool', 

We can verify that we have the expected games in this list. Now we need to actually parse the games!

## 4.5 Parsing the games

For Pandas, we will want to take each game and return a list containing the values that we want in each column. This list will end up being one row of the resulting dataframe. To recap, we wanted to find:


    TN: Tournament name
    GT: Game type (pool play or bracket play)
    GST: Game subtype (pool number, 1st place bracket, crossover play)
    BP: Bracket position (None for pool play, else quarterfinals / semifinals / etc.)
    D: Date
    T: Time
    F: Field number
    H: Home team
    HSd: Home seed
    HS: Home score
    A: Away team
    ASd: Away seed
    AS: Away score

We have taken care of the first four items already, since they are not contained in the actual game containers. Thankfully, it's not so hard to get the remaining items. For the most part, we can just call attributes. See the appendix for an explanation of the Regex.

One more note: when we talk about building a dataframe row, we really mean building a nested list where each entry is a list containing the values of a given row. Pandas dataframe construction is much faster all at once than iteratively, so storing things as nested lists until all the information is gathered is better. See https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html for a snippet on this.

In [470]:
def get_team_name_and_seed(raw_string):
    """Converts a raw team string to team name and seed number."""
    try:
        return re.search(r"\n?([^\(]*) \((\d*)", raw_string).groups()
    except:
        # Sometimes there is no listed name for the team, just a seed. To handle this,
        # we're just calling the team "Unknown (seed)", and including the seed number as usual.
        if re.match(r"\n?\(\d+\)", raw_string):
            stripped = raw_string.replace('\n', '') # removes newline characters
            return f"Unknown {stripped}", re.match(r"\n?.*(\d+)", raw_string).groups()[0]
        else:
            print(repr(raw_string))
            return raw_string, None

def get_teams_and_results(game_container):
    """Gets the team names, seeds, and scores from a game container."""
    home = game_container.find(attrs={"data-type":"game-team-home"}).text

    H, HSd = get_team_name_and_seed(home)
    HS = game_container.find(attrs={"data-type":"game-score-home"}).text

    away = game_container.find(attrs={"data-type":"game-team-away"}).text

    A, ASd = get_team_name_and_seed(away)
    AS = game_container.find(attrs={"data-type":"game-score-away"}).text
    
    return H,HSd,HS,A,ASd,AS

def game_to_df_row(game_list, TN, yr): # (Tournament Name), Year
    """Converts a game list to a list of dataframe row lists."""
    game_container, GT, GST, BP = game_list
    
    if "data-game" in game_container.attrs: #HANDLES POOL FORMATTING
        raw_date = game_container.find(attrs={"data-type":"game-date"}).text
        py_date = time.strptime(raw_date, "%a %m/%d") 
        # Uses the time package to interpret a string a date. The string "%a %m/%d"
        # tells Python to expect the format "<3-letter day abbreviation> <month in digits>/<day in digits>",
        # for example, "Thu 9/25".
        D = time.strftime(f"%m/%d/{yr}",py_date)
        # Converts the time to a uniform string format, for us, "month/day/year". Since we don't
        # get the year directly from the container, we pass it in as an argument to the function like
        # the tournament name.
        
        try:
            T = game_container.find(attrs={"data-type":"game-time"}).text
        except:
            T = None
        try:
            F = game_container.find(attrs={'data-type"game-field"':True}).text
        except:
            F = None
            
        H,HSd,HS,A,ASd,AS = get_teams_and_results(game_container)
    
    else:
        raw_date = game_container.find(class_="date", recursive=True).text
        try:
            py_date = time.strptime(raw_date, "%m/%d/%Y %I:%M %p") 
            D = time.strftime(f"%m/%d/{yr}",py_date)
            T = time.strftime(f"%I:%M %p",py_date)
        except:
            py_date = time.strptime(raw_date[:raw_date.find(" ")], "%m/%d/%Y") 
            D = time.strftime(f"%m/%d/{yr}",py_date)
            T = None
        try:
            F = game_container.find(class_="location").text
        except:
            F = None
       
        H,HSd,HS,A,ASd,AS = get_teams_and_results(game_container)
    

    
    return [TN,GT,GST,BP,D,T,F,H,HSd, HS,A, ASd,AS]


game_row_list = [game_to_df_row(game, "USA Ultimate National Championships", 2019) for game in games_list]

We can finally put these pieces together to read a tournament from a soup:

In [471]:
def read_tournament_from_soup(soup, year):
    """Get a list of game rows from a given tournament soup."""
    tournament_name = get_tournament_name(soup)
    print(tournament_name)
    type_subtype_lists = get_type_container_lists(soup)
    bracket_position_lists = type_subtype_lists_to_bracket_position_lists(type_subtype_lists)
    games_list = bracket_position_lists_to_games(bracket_position_lists)
    game_rows = [game_to_df_row(game, tournament_name, 2019) for game in games_list]
    return game_rows

At long last, we can actually make a dataframe for each tournament!

In [161]:
game_row_list = read_tournament_from_soup(usau_soup, 2019)
pd.DataFrame(game_row_list, columns=["TN","GT","GST","BP","D","T","F","H","HSd", "HS","A", "ASd","AS"])

USA Ultimate National Championships


Unnamed: 0,TN,GT,GST,BP,D,T,F,H,HSd,HS,A,ASd,AS
0,USA Ultimate National Championships,Pool,Pool A Schedule & Scores,,10/24/2019,9:00 AM,11,Sockeye,1,15,SoCal Condors,12,10
1,USA Ultimate National Championships,Pool,Pool A Schedule & Scores,,10/24/2019,9:00 AM,12,DiG,8,15,Furious George,13,8
2,USA Ultimate National Championships,Pool,Pool A Schedule & Scores,,10/24/2019,11:15 AM,11,Sockeye,1,15,Furious George,13,8
3,USA Ultimate National Championships,Pool,Pool A Schedule & Scores,,10/24/2019,11:15 AM,12,DiG,8,12,SoCal Condors,12,11
4,USA Ultimate National Championships,Pool,Pool A Schedule & Scores,,10/24/2019,1:30 PM,11,Sockeye,1,13,DiG,8,12
5,USA Ultimate National Championships,Pool,Pool A Schedule & Scores,,10/24/2019,1:30 PM,5,SoCal Condors,12,14,Furious George,13,11
6,USA Ultimate National Championships,Pool,Pool B Schedule & Scores,,10/24/2019,11:15 AM,9,PoNY,2,15,Pittsburgh Temper,11,8
7,USA Ultimate National Championships,Pool,Pool B Schedule & Scores,,10/24/2019,11:15 AM,10,Sub Zero,7,15,Johnny Bravo,14,12
8,USA Ultimate National Championships,Pool,Pool B Schedule & Scores,,10/24/2019,1:30 PM,9,PoNY,2,14,Johnny Bravo,14,6
9,USA Ultimate National Championships,Pool,Pool B Schedule & Scores,,10/24/2019,1:30 PM,10,Sub Zero,7,15,Pittsburgh Temper,11,13


Let's put it all together. Since we've been making all of these functions as we go, this is pretty simple!

In [472]:
def read_year(year=2019, division="club-open", links=None):
    """Read a year into a dataframe of games."""
    if links is None:
        print("Getting tournaments...")
        links = year_to_tournaments(year, division)
    game_rows_full = []
    print("Reading tournaments...")
    for link in tqdm(links, position=0):
        tournament_data = requests.get(link)
        tournament_soup = BeautifulSoup(tournament_data.text, 'html.parser')
        game_rows = read_tournament_from_soup(tournament_soup, year)
        game_rows_full.extend(game_rows)
    df = pd.DataFrame(game_rows_full, columns=["TN","GT","GST","BP","D","T","F","H","HSd", "HS","A", "ASd","AS"])
    return df

In [171]:
df = read_year(2019, links=usau_links)
df

here


  0%|          | 0/72 [00:00<?, ?it/s]

31st Annual Red Tide Ultimate Clambake
<built-in function repr>
<built-in function repr>
<built-in function repr>
<built-in function repr>
<built-in function repr>
<built-in function repr>
USA Ultimate National Championships
Southwest Club Men's Regional Championship 2019
Southeast Club Men's Regional Championship 2019
South Central Club Men's Regional Championship 2019
Northwest Club Men's Regional Championship 2019
Northeast Club Men's Regional Championship 2019
North Central Club Men's Regional Championship 2019
Mid-Atlantic Men's Club Regional Championship 2019
Great Lakes Men's Club Regional Championship
Nor Cal Men's Club Sectional Championship 2019
West Plains Men's Club Sectional Championship 2019
West New England Men's Club Sectional Championship 2019
Washington Men's Club Sectional Championship 2019
Upstate New York Men's Club Sectional Championship 2019
Texas Men's Club Sectional Championship 2019
So Cal Men's Club Sectional Championship 2019
Rocky Mountain Men's Club Sectio

Unnamed: 0,TN,GT,GST,BP,D,T,F,H,HSd,HS,A,ASd,AS
0,31st Annual Red Tide Ultimate Clambake,Pool,Pool A Schedule & Scores,,10/26/2019,11:30 AM,W05,yEUTH-ANd-Age-SIA,1,15,Unknown \n(7)\n,7,6
1,31st Annual Red Tide Ultimate Clambake,Pool,Pool A Schedule & Scores,,10/26/2019,11:30 AM,W17,Unknown \n(6)\n,6,0,Bowdoin,12,0
2,31st Annual Red Tide Ultimate Clambake,Pool,Pool A Schedule & Scores,,10/26/2019,1:00 PM,W05,yEUTH-ANd-Age-SIA,1,0,Bowdoin,12,0
3,31st Annual Red Tide Ultimate Clambake,Pool,Pool A Schedule & Scores,,10/26/2019,1:00 PM,W17,Unknown \n(6)\n,6,0,Unknown \n(7)\n,7,0
4,31st Annual Red Tide Ultimate Clambake,Pool,Pool A Schedule & Scores,,10/26/2019,2:30 PM,W05,yEUTH-ANd-Age-SIA,1,0,Unknown \n(6)\n,6,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2600,ATL Classic 2019,Bracket,1st Place Bracket,Finals,06/16/2019,01:30 PM,GSP #7,Freaks,1,11,El Niño,2,12
2601,ATL Classic 2019,Bracket,1st Place Bracket,Semis,06/16/2019,10:10 AM,GSP #10,Freaks,1,11,Bullet,3,10
2602,ATL Classic 2019,Bracket,1st Place Bracket,Semis,06/16/2019,10:10 AM,GSP #11,El Niño,2,12,ATLiens,9,11
2603,ATL Classic 2019,Bracket,1st Place Bracket,Pre-Semis,06/16/2019,08:30 AM,GSP #10,Bullet,3,10,Ironmen,4,8


In [216]:
df.to_csv("2019_club_open.csv")
odf = df.copy()

And there we go! That's everything!

# 5 A bit of fun with rankings

Since this is slightly beyond scope we won't get too deep into this, but here is the code that converts this dataframe into a set of rankings.

In [475]:
def get_teams(df):
    """Get a list of all teams in dataframe."""
    teams = df[["H","A"]].melt()["value"].unique()
    index_to_name = dict(enumerate(teams))
    name_to_index = dict([(index_to_name[i], i) for i in index_to_name])
    return teams, index_to_name, name_to_index

def clean_df(df):
    """Cleans dataframe to remove undesirable rows."""
    df = df.fillna("None")
    
    df["HS"] = df["HS"].replace(["W","L","F"], np.nan)
    df["AS"] = df["AS"].replace(["W","L","F"], np.nan)
    # We ignore games that don't have listed score differentials.
    df.dropna(inplace=True)
    df["HS"] = df["HS"].astype(int)
    df["AS"] = df["AS"].astype(int)
    
    df = df[~((df["HS"]==0) & (df["AS"] == 0))]
    # ignore games that were not played (both scores 0)
    df = df.reset_index(drop=True)
    return df

def make_incidence_matrix(df, zerosum=False):
    """Makes incidence matrix and score vector from dataframe."""
    teams, index_to_name, name_to_index = get_teams(df)
    m = len(df)
    n = len(teams)
    A = np.zeros((m,n))
    b = np.zeros((m,1))
    
    for i, row in df.iterrows():
        home = name_to_index[row["H"]]
        away = name_to_index[row["A"]]
        A[i, home] = 1
        A[i, away] = -1
        b[i] = row["HS"] - row["AS"]
        b[i] = b[i] + np.sign(b[i])
        ### GIVES THE WINNING TEAM ONE EXTRA POINT!
    return A,b

def get_sort(ratings, index_to_name):
    """Sorts a rating vector into (team name, rating) tuples."""
    team_ratings = [(index_to_name[i], j.item()) for i,j in enumerate(ratings)]
    return sorted(team_ratings, key=itemgetter(1), reverse=True)  
 
def display_sort_list(sort_list):
    """Displays ranked list of teams from sorted tuple list."""
    for i, tup in enumerate(sort_list):
        print(f"{i+1}. {tup[0]} | {tup[1]*100:.0f}")
        # Ratings multiplied by 100 for readability
    
def find_team(df, team):
    "Utility function: returns all games that a given team played in."
    return df[np.any(df[["H","A"]] == team, axis=1)]

In [440]:
df = clean_df(df)
teams, index_to_name, name_to_index = get_teams(df)
A,b = make_incidence_matrix(df)
x = np.linalg.lstsq(A,b,rcond=None)[0]
sort_list = get_sort(x, index_to_name)
display_sort_list(sort_list)

1. Sockeye | 1604
2. Truck Stop | 1599
3. Ring of Fire | 1517
4. PoNY | 1476
5. Revolver | 1423
6. Chicago Machine | 1399
7. Sub Zero | 1349
8. DiG | 1312
9. Nomadic Tribe | 1300
10. GOAT | 1248
11. SoCal Condors | 1235
12. Pittsburgh Temper | 1212
13. Furious George | 1196
14. Bunka Shutter Buzz Bullets | 1171
15. Doublewide | 1163
16. Chain Lightning | 1134
17. Johnny Bravo | 1116
18. Sprout | 1114
19. Yogosbo | 1016
20. Rhino Slam! | 1013
21. Voodoo | 975
22. Brickyard | 908
23. Patrol | 875
24. General Strike | 867
25. Vault | 848
26. CLE Smokestack | 838
27. H.I.P | 788
28. NexGen All-Star Tour | 772
29. Prairie Fire | 753
30. Nain Rouge | 742
31. Clutch | 740
32. yEUTH-ANd-Age-SIA | 733
33. Brickhouse | 731
34. Johnny Encore | 728
35. Freaks | 702
36. Black Market I | 698
37. Nitro | 680
38. Mad Men | 678
39. Blueprint | 678
40. Lost Boys | 665
41. MKE | 660
42. Garden State Ultimate | 653
43. Phoenix | 636
44. Blackfish | 614
45. CITYWIDE Special | 613
46. Tanasi | 605
47. Slugf

# A.1 Regex explanation

Recall that our goal was to take a team string (e.g. "Sockeye (1)") and separate it into a team name and seed. Since there are a lot of weird team names, we use regex to handle this. Below is the original function:

In [None]:
def get_team_name_and_seed(raw_string):
    """Converts a raw team string to team name and seed number."""
    try:
        return re.search(r"\n?([^\(]*) \((\d*)", raw_string).groups()
    except:
        # Sometimes there is no listed name for the team, just a seed. To handle this,
        # we're just calling the team "Unknown (seed)", and including the seed number as usual.
        if re.match(r"\n?\(\d+\)", raw_string):
            stripped = raw_string.replace('\n', '') # removes newline characters
            return f"Unknown {stripped}", re.match(r"\n?.*(\d+)", raw_string).groups()[0]
        else:
            print(repr(raw_string))
            return raw_string, None

We'll start with the third line. This is looking for matches the search string, that roughly means:

\n : match the newline character

? : optional (0 or 1 occurrences)

( : a *capturing group*: when a match is found, this tells the compiler what group of characters to return.

[]: matches a specific group of characters

^ : negation (match all BUT these characters)

\\( : raw open parenthesis (using the backslash to escape out of the capturing group above

* : any number of occurences

) : closes the capture group
So: [^\(]* means "Match any number of characters that are NOT an open parenthesis"

 : raw whitespace
 
\d: any digit character

Now in larger terms, the whole string could be phrased as:

"Match strings that may have a newline at the beginning, some number of characters, a space and an open parenthesis, and a digit at the end. Capture only the first non-parenthesis characters as well as the digits."


 


# A.2 Combined code

In [476]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup, Comment
# BeautifulSoup (hereafter BS) and requests are the two main packages for normal web scraping.
# requests mainly actually gets you the source data, and BeautifulSoup lets
# you actually work with it?
import urllib3
import urllib.request
import matplotlib.pyplot as plt
import seaborn as sns
# Seaborn is a plotting package that can make things look nicer than Matplotlib, but is most useful
# since it works nicely with Pandas dataframes. I'm including it here as a standard import, at time
# of writing I'm not sure if it will be used or not. It's a nice thing to have though!

import itertools as it
import more_itertools as mit
# Not sure if we'll need either of these, but again, standard imports.

import json
import re
# This is the regex package. For those unfamiliar, regex is a "language" for
# very specific text searching and extraction. Check out https://regexone.com/
# for a tutorial / explanation. 
import io
import math
import time
import cProfile
from pprint import pprint
# Will be useful for demonstration
from tqdm.notebook import tqdm
# Will make nice progress bars

import selenium
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
# I am using Chrome here, there are other supported browsers out there too.
# I think Firefox is actually the more common one for this

chrome_options = Options()
chrome_options.add_argument("--headless")
# Headless stops a browser from being visibly opened on your computer.

driver = webdriver.Chrome(options=chrome_options, 
              executable_path=r"C:\Users\alexs\Projects\lls\chromedriver.exe")

# You will have to download the chromedriver.exe file and change this path to
# point at it.

def jspage_to_soup(url, scroll=True):
    """Visit a JavaScript page and convert it into a soup."""
    # This is all selenium code!
    chrome_options = Options()
    chrome_options.add_argument("--headless")

    # Headless stops a browser from being visibly opened on your computer.

    driver = webdriver.Chrome(options=chrome_options, 
                  executable_path=r"C:\Users\alexs\Projects\lls\chromedriver.exe")

    # You will have to download the chromedriver.exe file and change this path to
    # point at it.
   
    driver.get(url)
    # Visit the page with the driver!

    time.sleep(1.5)
    # !!! We need to give the page time to load!
    if scroll:
        for i in range(20):
            driver.execute_script("window.scrollTo(0, document.body.scrollHeight);\
                                  var lenOfPage=document.body.scrollHeight;\
                                  return lenOfPage;")
            time.sleep(1)
            # This process literally scrolls the page all the way down to make sure that all of the content
            # gets read and stored. We won't always need to do this (see below), so we add an option to control it.
    
    data = driver.page_source.encode("utf-8")
    # Takes the driver response and grabs the page source code including the javascript. This puts things in
    # a form that BS can handle.
    
    soup = BeautifulSoup(data, 'html.parser')
    # Makes the soup object!
    # Note: a soup object is essentially a version of the source code that has the nested tag HTML structure the
    # same way as if you look at a page in a browser inspector.
    return soup

def extract_tournament_info(tr_tag, division='club-open'):
    """Extract the tournament type and ultiarchive link from a tag."""
    tournament_type = tr_tag.find(class_="tournament-type").text
    links = ["https://ultiarchive.com" + tag['href'] for tag in tr_tag.find_all(href=True) if division in tag['href']]
    # Makes links to each tournament as long as the link contains the right division.
    return tournament_type, links

def soup_to_list_of_tournaments(soup, division='club-open'):
    """Turn an ultiarchive soup into a list of all ultiarchive tournament links."""
    all_links = []
    for tr_tag in soup.find_all("tr"):
        tournament_type, links = extract_tournament_info(tr_tag, division)
        if tournament_type == "usau":
            all_links.extend(links)
    return all_links

def year_to_tournaments(year=2019, division='club-open'):
    """ Gets links to each ultiarchive tournament page for a given year"""
    url = fr"https://ultiarchive.com/years/{year}"
    # This is the general form for the ultiarchive pages
    soup = jspage_to_soup(url)
    links = soup_to_list_of_tournaments(soup, division='club-open')
    return links

def ulti_to_usau(url):
    """Converts an Ultiarchive tournament link to the corresponding USAU tournament link."""
    soup = jspage_to_soup(url, scroll=False)
    return soup.find(href=True, rel="noopener")["href"]

def year_to_tournaments(year=2019, division='club-open'):
    """ Gets links to each ultiarchive tournament page for a given year"""
    url = fr"https://ultiarchive.com/years/{year}"
    # This is the general form for the ultiarchive pages
    soup = jspage_to_soup(url)
    links = soup_to_list_of_tournaments(soup, division='club-open')
    print(links[:10])
    print("Found links!")
    usau_links = []
    for link in tqdm(links, position=0):
        print(link)
        usau_links.append(ulti_to_usau(link))
    return usau_links

def get_tournament_name(soup):
    """Get the name of a tournament from a soup."""
    return list(soup.find(class_="breadcrumbs").children)[3].text

def get_subtype_container_lists(soup):
    """Gets list of all subtype containers from a tournament soup."""
    containers = []
    
    for subtype_container in soup.find_all(class_="global_table scores_table"): 
    # "Things that are stored as pool play, but may or may not ACTUALLY be pool play:"
        name = subtype_container.find("th", colspan="8").text
        containers.append([subtype_container, name])

    for subtype_container in soup.find_all(class_="mod_slide alt_slide"):
    # "Things that are stored as bracket play, but may or may not ACTUALLY be bracket play:"
        name = subtype_container.find(href="#").text
        containers.append([subtype_container, name])
    
    return containers

def subtype_name_to_type(subtype_name):
    """Determines whether a given subtype container name corresponds to pool or bracket play."""
    if not any([keyword in subtype_name.lower() for keyword in ["st", "nd", "rd", "th", "crossover"]])\
        and "pool" in subtype_name.lower():
        return "Pool"
    else:
        return "Bracket"

def subtype_lists_to_type_lists(subtype_lists):
    """Adds game types to list of subtype containers."""
    for lst in subtype_lists:
        lst.insert(1, subtype_name_to_type(lst[1]))
    return subtype_lists

def get_type_container_lists(soup):
    """Gets list of subtype containers with types from a tournament soup."""
    subtype_lists = get_subtype_container_lists(soup)
    return subtype_lists_to_type_lists(subtype_lists)

def type_subtype_lists_to_bracket_position_lists(type_subtype_lists):
    """Gets a list of bracket position containers from list of subtype containers."""
    containers = []
    
    for lst in type_subtype_lists:
        
        positions = lst[0].find_all(class_="bracket_col")
        
        # Find the bracket position containers inside of a given subtype container. If there are none, 
        # just return the original container with None for the bracket position.
        if not positions:
            containers.append([*lst, None])
            continue
        # If there are bracket position containers, iterate through them and make a new list for each.
        for bracket_position in positions:
            pos_name = bracket_position.find("h4", class_="col_title").text
            containers.append([bracket_position, *lst[1:], pos_name])

    return containers

def bracket_position_lists_to_games(bracket_position_lists):
    """Gets list of games from list of bracket position containers."""
    games = []
    for bracket_position in bracket_position_lists:
        
        for game in bracket_position[0].find_all(attrs={"data-game":True}) + bracket_position[0].find_all(attrs={"data-relation":True}):
            games.append([game, *bracket_position[1:]])
    
    return games

def get_team_name_and_seed(raw_string):
    """Converts a raw team string to team name and seed number."""
    try:
        return re.search(r"\n?([^\(]*) \((\d*)", raw_string).groups()
    except:
        # Sometimes there is no listed name for the team, just a seed. To handle this,
        # we're just calling the team "Unknown (seed)", and including the seed number as usual.
        if re.match(r"\n?\(\d+\)", raw_string):
            stripped = raw_string.replace('\n', '') # removes newline characters
            return f"Unknown {stripped}", re.match(r"\n?.*(\d+)", raw_string).groups()[0]
        else:
            print(repr(raw_string))
            return raw_string, None

def get_teams_and_results(game_container):
    """Gets the team names, seeds, and scores from a game container."""
    home = game_container.find(attrs={"data-type":"game-team-home"}).text

    H, HSd = get_team_name_and_seed(home)
    HS = game_container.find(attrs={"data-type":"game-score-home"}).text

    away = game_container.find(attrs={"data-type":"game-team-away"}).text

    A, ASd = get_team_name_and_seed(away)
    AS = game_container.find(attrs={"data-type":"game-score-away"}).text
    
    return H,HSd,HS,A,ASd,AS

def game_to_df_row(game_list, TN, yr): # (Tournament Name), Year
    """Converts a game list to a list of dataframe row lists."""
    game_container, GT, GST, BP = game_list
    
    if "data-game" in game_container.attrs: #HANDLES POOL FORMATTING
        raw_date = game_container.find(attrs={"data-type":"game-date"}).text
        py_date = time.strptime(raw_date, "%a %m/%d") 
        # Uses the time package to interpret a string a date. The string "%a %m/%d"
        # tells Python to expect the format "<3-letter day abbreviation> <month in digits>/<day in digits>",
        # for example, "Thu 9/25".
        D = time.strftime(f"%m/%d/{yr}",py_date)
        # Converts the time to a uniform string format, for us, "month/day/year". Since we don't
        # get the year directly from the container, we pass it in as an argument to the function like
        # the tournament name.
        
        try:
            T = game_container.find(attrs={"data-type":"game-time"}).text
        except:
            T = None
        try:
            F = game_container.find(attrs={'data-type"game-field"':True}).text
        except:
            F = None
            
        H,HSd,HS,A,ASd,AS = get_teams_and_results(game_container)
    
    else:
        raw_date = game_container.find(class_="date", recursive=True).text
        try:
            py_date = time.strptime(raw_date, "%m/%d/%Y %I:%M %p") 
            D = time.strftime(f"%m/%d/{yr}",py_date)
            T = time.strftime(f"%I:%M %p",py_date)
        except:
            py_date = time.strptime(raw_date[:raw_date.find(" ")], "%m/%d/%Y") 
            D = time.strftime(f"%m/%d/{yr}",py_date)
            T = None
        try:
            F = game_container.find(class_="location").text
        except:
            F = None
       
        H,HSd,HS,A,ASd,AS = get_teams_and_results(game_container)
    

    
    return [TN,GT,GST,BP,D,T,F,H,HSd, HS,A, ASd,AS]

def read_tournament_from_soup(soup, year):
    """Get a list of game rows from a given tournament soup."""
    tournament_name = get_tournament_name(soup)
    print(tournament_name)
    type_subtype_lists = get_type_container_lists(soup)
    bracket_position_lists = type_subtype_lists_to_bracket_position_lists(type_subtype_lists)
    games_list = bracket_position_lists_to_games(bracket_position_lists)
    game_rows = [game_to_df_row(game, tournament_name, 2019) for game in games_list]
    return game_rows

def read_year(year=2019, division="club-open", links=None):
    """Read a year into a dataframe of games."""
    if links is None:
        print("Getting tournaments...")
        links = year_to_tournaments(year, division)
    game_rows_full = []
    print("Reading tournaments...")
    for link in tqdm(links, position=0):
        tournament_data = requests.get(link)
        tournament_soup = BeautifulSoup(tournament_data.text, 'html.parser')
        game_rows = read_tournament_from_soup(tournament_soup, year)
        game_rows_full.extend(game_rows)
    df = pd.DataFrame(game_rows_full, columns=["TN","GT","GST","BP","D","T","F","H","HSd", "HS","A", "ASd","AS"])
    return df

def get_teams(df):
    """Get a list of all teams in dataframe."""
    teams = df[["H","A"]].melt()["value"].unique()
    index_to_name = dict(enumerate(teams))
    name_to_index = dict([(index_to_name[i], i) for i in index_to_name])
    return teams, index_to_name, name_to_index

def clean_df(df):
    """Cleans dataframe to remove undesirable rows."""
    df = df.fillna("None")
    
    df["HS"] = df["HS"].replace(["W","L","F"], np.nan)
    df["AS"] = df["AS"].replace(["W","L","F"], np.nan)
    # We ignore games that don't have listed score differentials.
    df.dropna(inplace=True)
    df["HS"] = df["HS"].astype(int)
    df["AS"] = df["AS"].astype(int)
    
    df = df[~((df["HS"]==0) & (df["AS"] == 0))]
    # ignore games that were not played (both scores 0)
    df = df.reset_index(drop=True)
    return df

def make_incidence_matrix(df, zerosum=False):
    """Makes incidence matrix and score vector from dataframe."""
    teams, index_to_name, name_to_index = get_teams(df)
    m = len(df)
    n = len(teams)
    A = np.zeros((m,n))
    b = np.zeros((m,1))
    
    for i, row in df.iterrows():
        home = name_to_index[row["H"]]
        away = name_to_index[row["A"]]
        A[i, home] = 1
        A[i, away] = -1
        b[i] = row["HS"] - row["AS"]
        b[i] = b[i] + np.sign(b[i])
        ### GIVES THE WINNING TEAM ONE EXTRA POINT!
    return A,b

def get_sort(ratings, index_to_name):
    """Sorts a rating vector into (team name, rating) tuples."""
    team_ratings = [(index_to_name[i], j.item()) for i,j in enumerate(ratings)]
    return sorted(team_ratings, key=itemgetter(1), reverse=True)  
 
def display_sort_list(sort_list):
    """Displays ranked list of teams from sorted tuple list."""
    for i, tup in enumerate(sort_list):
        print(f"{i+1}. {tup[0]} | {tup[1]*100:.0f}")
        # Ratings multiplied by 100 for readability
    
def find_team(df, team):
    "Utility function: returns all games that a given team played in."
    return df[np.any(df[["H","A"]] == team, axis=1)]