###In this notebook, we will go through a more advanced example of an HTML web-page, that contains more than one page itself, and has data in a table.

### Goal of this scraping exercise is to:
- Extract data successfully
- Potentially load the data into a CSV file

In [1]:
from bs4 import BeautifulSoup as bs
import requests

In [4]:
site = requests.get('https://www.scrapethissite.com/pages/forms/')

In [6]:
soup = bs(site.content)
print(soup.prettify())



<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping
  </title>
  <link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components." name="description"/>
  <link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
  <link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
  <link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
  <meta con

In [13]:
print(soup.title.get_text())

Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping


In [26]:
# get intro text for site

lead = soup.find('p',attrs={'class': 'lead'})
print(lead.get_text(strip=True))

# alternate
lead = soup.select_one('p.lead')
print(lead.get_text(strip=True))

Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components.
                            Take a look at how pagination and search elements change the URL as your browse. Build a web scraper that can conduct searches and paginate through the results.
Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components.
                            Take a look at how pagination and search elements change the URL as your browse. Build a web scraper that can conduct searches and paginate through the results.


### Creating list to store data in a structured format

In [45]:
data = []

In [48]:
teams = soup.select('table tr.team')

for team in teams:
  name = team.select_one('td.name')
  year = team.select_one('td.year')
  wins = team.select_one('td.wins')
  losses = team.select_one('td.losses')
  goals_for = team.select_one('td.gf')
  goals_against = team.select_one('td.ga')
  print(name.get_text(strip=True), year.get_text(strip=True), wins.get_text(strip=True), losses.get_text(strip=True))

  data.append({
            'name': name.get_text(strip=True),
            'year': int(year.get_text(strip=True)),
            'wins': int(wins.get_text(strip=True)),
            'losses': int(losses.get_text(strip=True)),
            'goals_for': int(goals_for.get_text(strip=True)),
            'goals_against': int(goals_against.get_text(strip=True)),
        })

Boston Bruins 1990 44 24
Buffalo Sabres 1990 31 30
Calgary Flames 1990 46 26
Chicago Blackhawks 1990 49 23
Detroit Red Wings 1990 34 38
Edmonton Oilers 1990 37 37
Hartford Whalers 1990 31 38
Los Angeles Kings 1990 46 24
Minnesota North Stars 1990 27 39
Montreal Canadiens 1990 39 30
New Jersey Devils 1990 32 33
New York Islanders 1990 25 45
New York Rangers 1990 36 31
Philadelphia Flyers 1990 33 37
Pittsburgh Penguins 1990 41 33
Quebec Nordiques 1990 16 50
St. Louis Blues 1990 47 22
Toronto Maple Leafs 1990 23 46
Vancouver Canucks 1990 28 43
Washington Capitals 1990 37 36
Winnipeg Jets 1990 26 43
Boston Bruins 1991 36 32
Buffalo Sabres 1991 31 37
Calgary Flames 1991 31 37
Chicago Blackhawks 1991 36 29


In [49]:
data

[{'name': 'Boston Bruins', 'year': 1990, 'wins': 44, 'losses': 24},
 {'name': 'Buffalo Sabres', 'year': 1990, 'wins': 31, 'losses': 30},
 {'name': 'Calgary Flames', 'year': 1990, 'wins': 46, 'losses': 26},
 {'name': 'Chicago Blackhawks', 'year': 1990, 'wins': 49, 'losses': 23},
 {'name': 'Detroit Red Wings', 'year': 1990, 'wins': 34, 'losses': 38},
 {'name': 'Edmonton Oilers', 'year': 1990, 'wins': 37, 'losses': 37},
 {'name': 'Hartford Whalers', 'year': 1990, 'wins': 31, 'losses': 38},
 {'name': 'Los Angeles Kings', 'year': 1990, 'wins': 46, 'losses': 24},
 {'name': 'Minnesota North Stars', 'year': 1990, 'wins': 27, 'losses': 39},
 {'name': 'Montreal Canadiens', 'year': 1990, 'wins': 39, 'losses': 30},
 {'name': 'New Jersey Devils', 'year': 1990, 'wins': 32, 'losses': 33},
 {'name': 'New York Islanders', 'year': 1990, 'wins': 25, 'losses': 45},
 {'name': 'New York Rangers', 'year': 1990, 'wins': 36, 'losses': 31},
 {'name': 'Philadelphia Flyers', 'year': 1990, 'wins': 33, 'losses': 37

## Now that we have an idea of the data, we can loop over all the pages to extract data from all 25 pages.

In [51]:
import pandas as pd

NUM_PAGES = 25
i = 1
data = []

for i in range(NUM_PAGES+1):
  site = requests.get(f'https://www.scrapethissite.com/pages/forms/?page_num={i}')
  soup = bs(site.content)

  teams = soup.select('table tr.team')

  for team in teams:
    name = team.select_one('td.name')
    year = team.select_one('td.year')
    wins = team.select_one('td.wins')
    losses = team.select_one('td.losses')
    goals_for = team.select_one('td.gf')
    goals_against = team.select_one('td.ga')
    print(name.get_text(strip=True), year.get_text(strip=True), wins.get_text(strip=True), losses.get_text(strip=True))

    data.append({
              'name': name.get_text(strip=True),
              'year': int(year.get_text(strip=True)),
              'wins': int(wins.get_text(strip=True)),
              'losses': int(losses.get_text(strip=True)),
              'goals_for': int(goals_for.get_text(strip=True)),
              'goals_against': int(goals_against.get_text(strip=True)),
          })

df = pd.DataFrame(data)


Boston Bruins 1990 44 24
Buffalo Sabres 1990 31 30
Calgary Flames 1990 46 26
Chicago Blackhawks 1990 49 23
Detroit Red Wings 1990 34 38
Edmonton Oilers 1990 37 37
Hartford Whalers 1990 31 38
Los Angeles Kings 1990 46 24
Minnesota North Stars 1990 27 39
Montreal Canadiens 1990 39 30
New Jersey Devils 1990 32 33
New York Islanders 1990 25 45
New York Rangers 1990 36 31
Philadelphia Flyers 1990 33 37
Pittsburgh Penguins 1990 41 33
Quebec Nordiques 1990 16 50
St. Louis Blues 1990 47 22
Toronto Maple Leafs 1990 23 46
Vancouver Canucks 1990 28 43
Washington Capitals 1990 37 36
Winnipeg Jets 1990 26 43
Boston Bruins 1991 36 32
Buffalo Sabres 1991 31 37
Calgary Flames 1991 31 37
Chicago Blackhawks 1991 36 29
Detroit Red Wings 1991 43 25
Edmonton Oilers 1991 36 34
Hartford Whalers 1991 26 41
Los Angeles Kings 1991 35 31
Minnesota North Stars 1991 32 42
Montreal Canadiens 1991 41 28
New Jersey Devils 1991 38 31
New York Islanders 1991 34 35
New York Rangers 1991 50 25
Philadelphia Flyers 1991 32

In [55]:
df['win_percentage'] = df['wins'] / (df['wins'] + df['losses'])

# Sort by win percentage
df_sorted = df.sort_values(by='win_percentage', ascending=False)

In [56]:
df_sorted.head(10)

Unnamed: 0,name,year,wins,losses,goals_for,goals_against,win_percentage
126,Detroit Red Wings,1995,62,13,325,181,0.826667
382,Detroit Red Wings,2005,58,16,305,209,0.783784
521,Washington Capitals,2009,54,15,318,233,0.782609
260,Colorado Avalanche,2000,52,16,270,192,0.764706
292,Detroit Red Wings,2001,51,17,251,187,0.75
99,Detroit Red Wings,1994,33,11,180,117,0.75
486,San Jose Sharks,2008,53,18,257,204,0.746479
550,Vancouver Canucks,2010,54,19,262,185,0.739726
464,Boston Bruins,2008,53,19,274,196,0.736111
321,Dallas Stars,2002,46,17,245,169,0.730159


In [61]:

top_teams_per_year = df.loc[df.groupby('year')['wins'].idxmax()]
# sorting by year just to keep things neat
top_teams_per_year = top_teams_per_year.sort_values(by='year')

print(top_teams_per_year[['year', 'name', 'wins']])

     year                 name  wins
3    1990   Chicago Blackhawks    49
33   1991     New York Rangers    50
58   1992  Pittsburgh Penguins    56
81   1993     New York Rangers    52
99   1994    Detroit Red Wings    33
126  1995    Detroit Red Wings    62
150  1996   Colorado Avalanche    49
178  1997         Dallas Stars    49
204  1998         Dallas Stars    51
247  1999      St. Louis Blues    51
260  2000   Colorado Avalanche    52
292  2001    Detroit Red Wings    51
332  2002      Ottawa Senators    52
352  2003    Detroit Red Wings    48
382  2005    Detroit Red Wings    58
405  2006       Buffalo Sabres    53
442  2007    Detroit Red Wings    54
464  2008        Boston Bruins    53
521  2009  Washington Capitals    54
550  2010    Vancouver Canucks    54
570  2011     New York Rangers    51


In [62]:
df.to_csv('nhl_data.csv', index=False)