## Hockey Scraping Activity
Use Selium to go to home page: https://www.scrapethissite.com/pages/forms/ 

Automate entry into the search box for "Detroit Red Wings"

Scrape the table headers and data and create a pandas dataframe

Answer: How many wins did the Detroit Red Wings have from 1990-2011?

In [22]:
import pandas as pd
import time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

Here's an analogy:

Think of the service object as a car engine. It runs independently and manages various technical aspects, but it doesn't actually drive the car.
The webdriver object is like the driver's seat and steering wheel. It interacts with the engine (service object) to control the car (browser) and navigate the road (webpages).
In Summary:

The service object manages the technical aspects of running ChromeDriver.
The webdriver object allows you to interact with and control the Chrome browser.
Both work together to enable web automation with Selenium.

In [23]:
# Prepare the ChromeDriver service instance
# chrome_executable = Service(executable_path="./chromedriver", log_path=None)

# driver = webdriver.Chrome()

# if you are not able to get the correct version of you chrome browser you can also use this workaround

from webdriver_manager.chrome import ChromeDriverManager
# Iniate the driver object

driver = webdriver.Chrome(service=Service(executable_path = ChromeDriverManager().install(),log_path=None))

In [24]:
driver.get('https://www.scrapethissite.com/pages/forms/')

In [25]:
# Locate the Search box and copy its xpath
xpath = '/html/body/div/section/div/div[4]/div/form/input[1]'

search_box = driver.find_element(by = 'xpath', value = xpath)
search_box

<selenium.webdriver.remote.webelement.WebElement (session="83a66e0669f363db7b9deb0c2809d0d7", element="E29374B155E80467909E3DBEE6E99DDC_element_15")>

In [26]:
# Enter detroit red wings in the search box
search_box.send_keys('detroit red wings')

In [27]:
# Submit your entry
search_box.submit()

In [28]:
html = driver.page_source
html

'<html lang="en"><head>\n    <meta charset="utf-8">\n    <title>Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping</title>\n    <link rel="icon" type="image/png" href="/static/images/scraper-icon.png">\n\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <meta name="description" content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components.">\n\n    <link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" rel="stylesheet" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" crossorigin="anonymous">\n    <link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css">\n    <link rel="stylesheet" type="text/css" href="/static/css/styles.css">\n\n    \n<meta

In [9]:
# make soup object from html
soup = BeautifulSoup(html)
print(soup.prettify()[:500])

<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping
  </title>
  <link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components." name="descri


In [10]:
table = soup.find('table')
table

<table class="table">
<tbody><tr>
<th>
                            Team Name
                        </th>
<th>
                            Year
                        </th>
<th>
                            Wins
                        </th>
<th>
                            Losses
                        </th>
<th>
                            OT Losses
                        </th>
<th>
                            Win %
                        </th>
<th>
                            Goals For (GF)
                        </th>
<th>
                            Goals Against (GA)
                        </th>
<th>
                            + / -
                        </th>
</tr>
<tr class="team">
<td class="name">
                            Detroit Red Wings
                        </td>
<td class="year">
                            1990
                        </td>
<td class="wins">
                            34
                        </td>
<td class="losses">
                  

In [11]:
rows = table.find_all('tr')
len(rows)

22

In [12]:
# The first row is the header row
header_row = rows[0]
header_row

<tr>
<th>
                            Team Name
                        </th>
<th>
                            Year
                        </th>
<th>
                            Wins
                        </th>
<th>
                            Losses
                        </th>
<th>
                            OT Losses
                        </th>
<th>
                            Win %
                        </th>
<th>
                            Goals For (GF)
                        </th>
<th>
                            Goals Against (GA)
                        </th>
<th>
                            + / -
                        </th>
</tr>

In [13]:
# Extract the column names from the header row
column_names = header_row.find_all('th')
column_names

[<th>
                             Team Name
                         </th>,
 <th>
                             Year
                         </th>,
 <th>
                             Wins
                         </th>,
 <th>
                             Losses
                         </th>,
 <th>
                             OT Losses
                         </th>,
 <th>
                             Win %
                         </th>,
 <th>
                             Goals For (GF)
                         </th>,
 <th>
                             Goals Against (GA)
                         </th>,
 <th>
                             + / -
                         </th>]

In [14]:
# Make a list of column names with text only
list_of_column_names = [column_name.text.strip() for column_name in column_names]
list_of_column_names

['Team Name',
 'Year',
 'Wins',
 'Losses',
 'OT Losses',
 'Win %',
 'Goals For (GF)',
 'Goals Against (GA)',
 '+ / -']

In [15]:
# Now get the data for a single row
data_row_1 = rows[1].find_all('td')
data_row_1


[<td class="name">
                             Detroit Red Wings
                         </td>,
 <td class="year">
                             1990
                         </td>,
 <td class="wins">
                             34
                         </td>,
 <td class="losses">
                             38
                         </td>,
 <td class="ot-losses">
 </td>,
 <td class="pct text-danger">
                             0.425
                         </td>,
 <td class="gf">
                             273
                         </td>,
 <td class="ga">
                             298
                         </td>,
 <td class="diff text-danger">
                             -25
                         </td>]

In [16]:
# Get text only
data_row_1_text = [cell.text.strip() for cell in data_row_1]
data_row_1_text

['Detroit Red Wings', '1990', '34', '38', '', '0.425', '273', '298', '-25']

In [17]:
# Now we will create a loop to extract the data from each row

# Instantiate an empty list
row_data = []

# Now make a loop to repeat for each row
for data_row in rows[1:]: # Start at index 1 to skip the header row
    data_text = [cell.text.strip() for cell in data_row.find_all("td")]
    row_data.append(data_text)

# display result
row_data

[['Detroit Red Wings', '1990', '34', '38', '', '0.425', '273', '298', '-25'],
 ['Detroit Red Wings', '1991', '43', '25', '', '0.537', '320', '256', '64'],
 ['Detroit Red Wings', '1992', '47', '28', '', '0.56', '369', '280', '89'],
 ['Detroit Red Wings', '1993', '46', '30', '', '0.548', '356', '275', '81'],
 ['Detroit Red Wings', '1994', '33', '11', '', '0.688', '180', '117', '63'],
 ['Detroit Red Wings', '1995', '62', '13', '', '0.756', '325', '181', '144'],
 ['Detroit Red Wings', '1996', '38', '26', '', '0.463', '253', '197', '56'],
 ['Detroit Red Wings', '1997', '44', '23', '', '0.537', '250', '196', '54'],
 ['Detroit Red Wings', '1998', '43', '32', '', '0.524', '245', '202', '43'],
 ['Detroit Red Wings', '1999', '48', '22', '2', '0.585', '278', '210', '68'],
 ['Detroit Red Wings', '2000', '49', '20', '4', '0.598', '253', '202', '51'],
 ['Detroit Red Wings', '2001', '51', '17', '4', '0.622', '251', '187', '64'],
 ['Detroit Red Wings', '2002', '48', '20', '4', '0.585', '269', '203', '

In [18]:
# Make a dataframe from the row data and the list of column names
df = pd.DataFrame(row_data, columns=list_of_column_names)
df.head()

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -
0,Detroit Red Wings,1990,34,38,,0.425,273,298,-25
1,Detroit Red Wings,1991,43,25,,0.537,320,256,64
2,Detroit Red Wings,1992,47,28,,0.56,369,280,89
3,Detroit Red Wings,1993,46,30,,0.548,356,275,81
4,Detroit Red Wings,1994,33,11,,0.688,180,117,63


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21 entries, 0 to 20
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Team Name           21 non-null     object
 1   Year                21 non-null     object
 2   Wins                21 non-null     object
 3   Losses              21 non-null     object
 4   OT Losses           21 non-null     object
 5   Win %               21 non-null     object
 6   Goals For (GF)      21 non-null     object
 7   Goals Against (GA)  21 non-null     object
 8   + / -               21 non-null     object
dtypes: object(9)
memory usage: 1.6+ KB


In [20]:
df['Wins'] = pd.to_numeric(df['Wins'], errors='coerce')

In [21]:
# What is the total number of wins for Detroit from 1990-2011?
df['Wins'].sum()

986