## Hockey Scraping Activity
Use Selium to go to home page: https://www.scrapethissite.com/pages/forms/ 

Automate entry into the search box for "Detroit Red Wings"

Scrape the table headers and data and create a pandas dataframe

Answer: How many wins did the Detroit Red Wings have from 1990-2011?

In [1]:
import pandas as pd
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

Here's an analogy:

Think of the service object as a car engine. It runs independently and manages various technical aspects, but it doesn't actually drive the car.
The webdriver object is like the driver's seat and steering wheel. It interacts with the engine (service object) to control the car (browser) and navigate the road (webpages).
In Summary:

The service object manages the technical aspects of running ChromeDriver.
The webdriver object allows you to interact with and control the Chrome browser.
Both work together to enable web automation with Selenium.

In [2]:
# Prepare the ChromeDriver service instance
# chrome_executable = Service(executable_path="./chromedriver", log_path=None)

# driver = webdriver.Chrome()

# if you are not able to get the correct version of you chrome browser you can also use this workaround

# from webdriver_manager.chrome import ChromeDriverManager
# # Iniate the driver object

# driver = webdriver.Chrome(service=Service(executable_path = ChromeDriverManager().install(),log_path=None))

from selenium import webdriver
from selenium.webdriver.firefox.service import Service
firebox_executable = Service(executable_path="./geckodriver", log_path = None)
driver = webdriver.Firefox()

The version of firefox cannot be detected. Trying with latest driver version


In [3]:
driver.get('https://www.scrapethissite.com/pages/forms/')

In [4]:
# Locate the Search box and copy its xpath
xpath = '/html/body/div/section/div/div[4]/div/form/input[1]'

search_box = driver.find_element(by = 'xpath', value = xpath)
search_box

<selenium.webdriver.remote.webelement.WebElement (session="3e0036bb-8343-4524-bc5e-6742dfe7a2df", element="ba78e71e-bb2d-418b-8020-5bc934160627")>

In [5]:
# Enter detroit red wings in the search box
search_box.send_keys('detroit red wings')

In [6]:
# Submit your entry
search_box.submit()

In [7]:
html = driver.page_source
html

'<html lang="en"><head>\n    <meta charset="utf-8">\n    <title>Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping</title>\n    <link rel="icon" type="image/png" href="/static/images/scraper-icon.png">\n\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <meta name="description" content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components.">\n\n    <link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" rel="stylesheet" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" crossorigin="anonymous">\n    <link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css">\n    <link rel="stylesheet" type="text/css" href="/static/css/styles.css">\n\n    \n<meta

In [8]:
# make soup object from html
soup = BeautifulSoup(html)
print(soup.prettify()[:500])

<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping
  </title>
  <link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components." name="descri


In [9]:
table = soup.find('table')
table

<table class="table">
<tbody><tr>
<th>
                            Team Name
                        </th>
<th>
                            Year
                        </th>
<th>
                            Wins
                        </th>
<th>
                            Losses
                        </th>
<th>
                            OT Losses
                        </th>
<th>
                            Win %
                        </th>
<th>
                            Goals For (GF)
                        </th>
<th>
                            Goals Against (GA)
                        </th>
<th>
                            + / -
                        </th>
</tr>
<tr class="team">
<td class="name">
                            Boston Bruins
                        </td>
<td class="year">
                            1990
                        </td>
<td class="wins">
                            44
                        </td>
<td class="losses">
                      

In [10]:
rows = table.find_all('tr')
len(rows)

26

In [11]:
# The first row is the header row
header_row = rows[0]
header_row

<tr>
<th>
                            Team Name
                        </th>
<th>
                            Year
                        </th>
<th>
                            Wins
                        </th>
<th>
                            Losses
                        </th>
<th>
                            OT Losses
                        </th>
<th>
                            Win %
                        </th>
<th>
                            Goals For (GF)
                        </th>
<th>
                            Goals Against (GA)
                        </th>
<th>
                            + / -
                        </th>
</tr>

In [12]:
# Extract the column names from the header row
column_names = header_row.find_all('th')
column_names

[<th>
                             Team Name
                         </th>,
 <th>
                             Year
                         </th>,
 <th>
                             Wins
                         </th>,
 <th>
                             Losses
                         </th>,
 <th>
                             OT Losses
                         </th>,
 <th>
                             Win %
                         </th>,
 <th>
                             Goals For (GF)
                         </th>,
 <th>
                             Goals Against (GA)
                         </th>,
 <th>
                             + / -
                         </th>]

In [13]:
# Make a list of column names with text only
list_of_column_names = [column_name.text.strip() for column_name in column_names]
list_of_column_names

['Team Name',
 'Year',
 'Wins',
 'Losses',
 'OT Losses',
 'Win %',
 'Goals For (GF)',
 'Goals Against (GA)',
 '+ / -']

In [14]:
# Now get the data for a single row
data_row_1 = rows[1].find_all('td')
data_row_1


[<td class="name">
                             Boston Bruins
                         </td>,
 <td class="year">
                             1990
                         </td>,
 <td class="wins">
                             44
                         </td>,
 <td class="losses">
                             24
                         </td>,
 <td class="ot-losses">
 </td>,
 <td class="pct text-success">
                             0.55
                         </td>,
 <td class="gf">
                             299
                         </td>,
 <td class="ga">
                             264
                         </td>,
 <td class="diff text-success">
                             35
                         </td>]

In [15]:
# Get text only
data_row_1_text = [cell.text.strip() for cell in data_row_1]
data_row_1_text

['Boston Bruins', '1990', '44', '24', '', '0.55', '299', '264', '35']

In [16]:
# Now we will create a loop to extract the data from each row

# Instantiate an empty list
row_data = []

# Now make a loop to repeat for each row
for data_row in rows[1:]: # Start at index 1 to skip the header row
    data_text = [cell.text.strip() for cell in data_row.find_all("td")]
    row_data.append(data_text)

# display result
row_data

[['Boston Bruins', '1990', '44', '24', '', '0.55', '299', '264', '35'],
 ['Buffalo Sabres', '1990', '31', '30', '', '0.388', '292', '278', '14'],
 ['Calgary Flames', '1990', '46', '26', '', '0.575', '344', '263', '81'],
 ['Chicago Blackhawks', '1990', '49', '23', '', '0.613', '284', '211', '73'],
 ['Detroit Red Wings', '1990', '34', '38', '', '0.425', '273', '298', '-25'],
 ['Edmonton Oilers', '1990', '37', '37', '', '0.463', '272', '272', '0'],
 ['Hartford Whalers', '1990', '31', '38', '', '0.388', '238', '276', '-38'],
 ['Los Angeles Kings', '1990', '46', '24', '', '0.575', '340', '254', '86'],
 ['Minnesota North Stars',
  '1990',
  '27',
  '39',
  '',
  '0.338',
  '256',
  '266',
  '-10'],
 ['Montreal Canadiens', '1990', '39', '30', '', '0.487', '273', '249', '24'],
 ['New Jersey Devils', '1990', '32', '33', '', '0.4', '272', '264', '8'],
 ['New York Islanders', '1990', '25', '45', '', '0.312', '223', '290', '-67'],
 ['New York Rangers', '1990', '36', '31', '', '0.45', '297', '265',

In [17]:
# Make a dataframe from the row data and the list of column names
df = pd.DataFrame(row_data, columns=list_of_column_names)
df.head()

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -
0,Boston Bruins,1990,44,24,,0.55,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25


In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Team Name           25 non-null     object
 1   Year                25 non-null     object
 2   Wins                25 non-null     object
 3   Losses              25 non-null     object
 4   OT Losses           25 non-null     object
 5   Win %               25 non-null     object
 6   Goals For (GF)      25 non-null     object
 7   Goals Against (GA)  25 non-null     object
 8   + / -               25 non-null     object
dtypes: object(9)
memory usage: 1.9+ KB


In [19]:
df['Wins'] = pd.to_numeric(df['Wins'], errors='coerce')

In [20]:
# What is the total number of wins for Detroit from 1990-2011?
df['Wins'].sum()

862