# `BeautifulSoup`

This homework will build upon what we covered in class, and also give you some practice problems.

## Continued Learning

The first thing we'll do in this notebook is build upon our in lecture content.

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen

### Reading Data from Tables and Comments

Many of the tasks we've considered so far have focused on grabbing text data. Another invaluable data science use case of `BeautifulSoup` is to easily grab data presented in online tables.

Let's consider the following page on Michael Jordan's career from basketball-reference, <a href="https://www.basketball-reference.com/players/j/jordami01.html">https://www.basketball-reference.com/players/j/jordami01.html</a>.

How can we grab the statistics listed in his Per Game stats table? Let's find out.

First we'll make a soup object.

In [2]:
## First let's make a soup object
url = "https://www.basketball-reference.com/players/j/jordami01.html"
html = urlopen(url)
soup = BeautifulSoup(html,'html')

Using your web developer tools you can find out that the Per Game table is kept in an HTML `table` with the `id = "per_game"`. We can use the `find` function to grab this table.

In [3]:
## Now let's search the soup for the table we want
per_game = soup.find('table',{'id':"per_game"})

## print the results
print(per_game.prettify())

<table class="row_summable sortable stats_table" data-cols-to-freeze="1,3" id="per_game">
 <caption>
  Per Game Table
 </caption>
 <colgroup>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
  <col/>
 </colgroup>
 <thead>
  <tr>
   <th aria-label="If listed as single number, the year the season ended.★ - Indicates All-Star for league.Only on regular season tables." class="poptip sort_default_asc center" data-stat="season" data-tip="If listed as single number, the year the season ended.&lt;br&gt;★ - Indicates All-Star for league.&lt;br&gt;Only on regular season tables." scope="col">
    Season
   </th>
   <th aria-label="Player's age on February 1 of the season" class="poptip sort_default_asc center" data-stat="age" data-tip="Player's age on February 1 of the season" scope="col"

Let's do something easy first, find the list of seasons in the table.

On the way we'll learn something about HTML tables.

A table can have both a `thead` and `tbody`. `thead` is the header of the table, i.e. it contains information about the table and the column names.

In [4]:
print(per_game.thead.prettify())

<thead>
 <tr>
  <th aria-label="If listed as single number, the year the season ended.★ - Indicates All-Star for league.Only on regular season tables." class="poptip sort_default_asc center" data-stat="season" data-tip="If listed as single number, the year the season ended.&lt;br&gt;★ - Indicates All-Star for league.&lt;br&gt;Only on regular season tables." scope="col">
   Season
  </th>
  <th aria-label="Player's age on February 1 of the season" class="poptip sort_default_asc center" data-stat="age" data-tip="Player's age on February 1 of the season" scope="col">
   Age
  </th>
  <th aria-label="Team" class="poptip sort_default_asc center" data-stat="team_id" data-tip="Team" scope="col">
   Tm
  </th>
  <th aria-label="League" class="poptip sort_default_asc center" data-stat="lg_id" data-tip="League" scope="col">
   Lg
  </th>
  <th aria-label="Position" class="poptip sort_default_asc center" data-stat="pos" data-tip="Position" scope="col">
   Pos
  </th>
  <th aria-label="Games" clas

`tbody` contains the body of the table.

In [5]:
print(per_game.tbody)

<tbody><tr class="full_table" id="per_game.1985"><th class="left" data-stat="season" scope="row"><a href="/players/j/jordami01/gamelog/1985/">1984-85</a><span class="sr_star"></span></th><td class="center" data-stat="age">21</td><td class="left" data-stat="team_id"><a href="/teams/CHI/1985.html">CHI</a></td><td class="left" data-stat="lg_id"><a href="/leagues/NBA_1985.html">NBA</a></td><td class="center" data-stat="pos">SG</td><td class="right" data-stat="g"><strong>82</strong></td><td class="right" data-stat="gs">82</td><td class="right" data-stat="mp_per_g">38.3</td><td class="right" data-stat="fg_per_g">10.2</td><td class="right" data-stat="fga_per_g">19.8</td><td class="right" data-stat="fg_pct">.515</td><td class="right" data-stat="fg3_per_g">0.1</td><td class="right" data-stat="fg3a_per_g">0.6</td><td class="right" data-stat="fg3_pct">.173</td><td class="right" data-stat="fg2_per_g">10.1</td><td class="right" data-stat="fg2a_per_g">19.2</td><td class="right" data-stat="fg2_pct">.

You may notice the `tr`, `th`, and `td` HTML elements in the `tbody` above. `tr` creates a new row in the table, `th` is the leading entry of a row or column, and the `td` refers to a non-leading cell entry. basketball-reference is nice in the sense that they give their table elements a `data-stat` which makes searching their tables quite easy.

Now we can get a list of the seasons.

Here's how we can get the season from the first row.

In [6]:
## We look at the body of the table
## then find all the rows
## then take the text from the th entry.
per_game.tbody.find_all('tr')[0].th.text

'1984-85'

In [7]:
## Now you try writing a script to get all the seasons.
## Do you get all of the seasons listed in the table?


## Sample Answer
for tr in per_game.tbody.find_all('tr'):
    if tr.th:
        print(tr.th.text)
    else:
        print(tr.td.text)




1984-85
1985-86
1986-87
1987-88
1988-89
1989-90
1990-91
1991-92
1992-93
1993-94
1994-95
1995-96
1996-97
1997-98
1998-99
1999-00
2000-01
2001-02
2002-03


Hopefully the last example showed you that scraping data from the world wide web can be a bit messy. Let's see how you can get around another messy issue.

Let's try getting another table from his royal airness's stat page.

The Adjusted Shooting Table looks interesting, use the developer tools to see where it is stored then search the soup for it. Are you able to find it in the soup?

In [8]:
## You Code

## Sample Answer
print(soup.find('table',{'id':"adj_shooting"}))

None


Did you struggle to find the table you were interested in?

This website is designed so that this table doesn't load until someone scrolls down to it, or clicks on the link at the top of the page. In the html code we retrieved from the website it is stored as a comment until we want to read it. We can grab comments like so.

In [9]:
## We import Comment from bs4
from bs4 import Comment

## then we find all of the comments in the soup
comments = soup.find_all(string=lambda text: isinstance(text, Comment))

## an example comment
print(comments[50])



<div class="table_container" id="div_game_highs">
    
    <table class="sortable stats_table" id="game_highs" data-cols-to-freeze="1,3">
    <caption> Game Highs Table</caption>
    
   <colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>
   <thead>
      
      <tr class="over_header">
         <th aria-label="" data-stat="highs" colspan="23" class=" over_header center" >Game Highs</th>
      </tr>
            
      <tr>
         <th aria-label="If listed as single number, the year the season ended.&#x2605; - Indicates All-Star for league.Only on regular season tables." data-stat="season" scope="col" class=" poptip sort_default_asc center" data-tip="If listed as single number, the year the season ended.<br>&#x2605; - Indicates All-Star for league.<br>Only on regular season tables." data-over-header="Game Highs" >Season</th>
         <th aria-label="Player's age on February 1 of the season" data-sta

In [10]:
## then we just search the soup for the comment we're
## most interested in
for comment in comments:
    if "adj_shooting" in comment:
        table = comment

In [11]:
table

'\n\n<div class="table_container" id="div_adj_shooting">\n    \n    <table class="row_summable sortable stats_table" id="adj_shooting" data-cols-to-freeze="1,3">\n    <caption>Adjusted Shooting Table</caption>\n    \n   <colgroup><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col><col></colgroup>\n   <thead>\n      \n      <tr class="over_header">\n         <th aria-label="" data-stat="_blank" colspan="7" class=" over_header center" >&nbsp;</th><th></th>\n         <th aria-label="" data-stat="_basic" colspan="8" class=" over_header center" >Jordan Shooting %</th><th></th>\n         <th aria-label="" data-stat="_league" colspan="8" class=" over_header center" >League Shooting %</th><th></th>\n         <th aria-label="" data-stat="_adj" colspan="8" class=" over_header center" >League-Adjusted</th><th></th>\n         <th aria-label="" data-stat="_add" colspan="2

Now that you have the table try to search it for Jordan's FG percentage each season.

Are you able to get them, or do you encounter errors?

In [12]:
## You code

## Sample Solution
table.tbody.find_all('tr')

AttributeError: 'Comment' object has no attribute 'tbody'

You should have received an error. Why? Because a comment is not the same as a soup object. Let's rectify that now.

In [None]:
table = BeautifulSoup(table,'html')

In [None]:
table.tbody.find_all('td',{'data-stat':"fg_pct"})

There we go! Now you try, find the League's FG percentage so we can compare.

In [None]:
## You Code

## Sample Answer
table.tbody.find_all('td',{'data-stat':"lg_fg_pct"})

Great! Now you have even more knowledge on `BeautifulSoup`. Time for some Practice Problems.

## Practice Problems

#### Easier

1. Write a script to get the scores from all of the Cleveland Browns games from this site, <a href="https://www.pro-football-reference.com/teams/cle/2019.htm">https://www.pro-football-reference.com/teams/cle/2019.htm</a>
2. Try to scrape the beer names, and beer types from the following link, <a href="https://untappd.com/w/lineage-brewing/193720/beer">https://untappd.com/w/lineage-brewing/193720/beer</a>. <i>Note you may get an error when doing this, if so try to understand what the error is and why it happened.</i>
3. Write a script to get the title, author, and publishing date from the first article from 538's politics section, <a href="https://fivethirtyeight.com/politics/features/">https://fivethirtyeight.com/politics/features/</a>.

#### More Difficult

4. Write a script to get the scores from every Cleveland Browns game from 2000 to 2019 from <a href="https://www.pro-football-reference.com/">pro-football-reference.com</a>. Store them as a csv file with columns: year, game_num, opposing_team, browns_score, opp_score
5. Write a script to get the title, author, and publishing date from the first 5 pages of articles from 538's politics section, <a href="https://fivethirtyeight.com/politics/features/">https://fivethirtyeight.com/politics/features/</a>. 

In [None]:
## You Code

## 1. SAMPLE SOLUTION

url = "https://www.pro-football-reference.com/teams/cle/2019.htm"
html = urlopen(url)
soup = BeautifulSoup(html,'html')

## using the web inspector we can see the scores are stored in
## the games table
## within pts_off and pts_def data-stat tds
browns_scores = []
opp_scores = []

for tr in soup.find('table',{'id':'games'}).tbody.find_all('tr'):
    if tr.find('td',{'data-stat':"pts_off"}).text != '':
        browns_scores.append(tr.find('td',{'data-stat':"pts_off"}).text)
        opp_scores.append(tr.find('td',{'data-stat':"pts_def"}).text)
        
browns_scores

In [None]:
## You Code

## 2 SAMPLE SOLUTION
## At one point this question was supposed to give you an access error 
## because untappd blocked access to users to keep people from taking 
## their data to build beer apps. However, now it seems you can get access 
## without a problem. So you should be able to copy and paste the code from class.
url = "https://untappd.com/w/lineage-brewing/193720/beer"
html = urlopen(url)
soup = BeautifulSoup(html,'html')

beers = []
for beer in soup.find_all('div',{'class':"beer-item"}):
    beers.append(beer.find('p',{'class':"name"}).text)
    
beers

In [None]:
## You Code

## 3. Sample Solution

html = urlopen("https://fivethirtyeight.com/politics/features/")
soup = BeautifulSoup(html)

a = soup.find_all('a',{'class':"post-thumbnail"})[0]
post_url = a['href']
post_html = urlopen(post_url)
post_soup = BeautifulSoup(post_html,'html')

print(post_soup.find('h1',{'class':"article-title article-title-single entry-title"})
                            .text.replace("\n","").replace("\t",""))
print(post_soup.find('p',{'class':"single-metadata single-byline vcard"}).text.replace("By ",""))
print(post_soup.find('p',{'class':"topic single-topic"}).time.text)

In [None]:
## You Code

## 4. Sample Solution
import pandas as pd
import matplotlib.pyplot as plt

seasons = []
browns_scores = []
opp_scores = []

for season in range(2000,2020):
    url = "https://www.pro-football-reference.com/teams/cle/" + str(season) + ".htm"
    html = urlopen(url)
    soup = BeautifulSoup(html)
    for tr in soup.find('table',{'id':'games'}).tbody.find_all('tr'):
        if tr.find('td',{'data-stat':"pts_off"}).text != '':
            seasons.append(str(season))
            browns_scores.append(int(tr.find('td',{'data-stat':"pts_off"}).text))
            opp_scores.append(int(tr.find('td',{'data-stat':"pts_def"}).text))
            
scores = pd.DataFrame({'Season':seasons,'Browns_Score':browns_scores,'Opponent_Score':opp_scores})

plt.figure(figsize=(14,10))

plt.plot(range(2000,2020),scores.groupby('Season').Browns_Score.mean(),'o',color = 'brown',label="Browns")
plt.plot(range(2000,2020),scores.groupby('Season').Opponent_Score.mean(),'o',color = 'black',label="Opponents")

plt.xticks(range(2000,2020))

plt.xlabel("Season", fontsize=16)
plt.ylabel("Average Score", fontsize=16)

plt.text(2000,27.5,"The Browns were Bad :(",fontsize=16)
plt.legend()

plt.show()

In [None]:
## You Code

## 5. Sample Solution
articles = []
authors = []
dates = []


for i in range(1,6):
    print("Working on page",i)
    page_url = "https://fivethirtyeight.com/politics/features/page/" + str(i) + "/"
    page_html = urlopen(page_url)
    page_soup = BeautifulSoup(page_html,'html')
    
    for a in page_soup.find_all('a',{'class':"post-thumbnail"}):
        post_url = a['href']
        post_html = urlopen(post_url)
        post_soup = BeautifulSoup(post_html,'html')
        
        articles.append(post_soup.find('h1',{'class':"article-title article-title-single entry-title"})
                            .text.replace("\n","").replace("\t",""))
        authors.append(post_soup.find('p',{'class':"single-metadata single-byline vcard"}).text.replace("By ",""))
        dates.append(post_soup.find('p',{'class':"topic single-topic"}))
        
# the first four articles      
articles[:4]

This notebook was written for the Erd&#337;s Institute C&#337;de Data Science Boot Camp by Matthew Osborne, Ph. D., 2021.

Redistribution of the material contained in this repository is conditional on acknowledgement of Matthew Tyler Osborne, Ph.D.'s original authorship and sponsorship of the Erdős Institute as subject to the license (see License.md)