## Introduction:

This is a beginner project that will help to understand the basics of web scrapping as explained by Alex Freberg.
I will then use this understanding to scrape multiple pages on the website, transform and load the data into a database (ETL) in my next project.

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
# get the url
url = 'https://www.scrapethissite.com/pages/forms/'

In [4]:
# send a request to the url to ensure the mage is working
page = requests.get(url)
page

<Response [200]>

Response 200 means the page is valid and has contents in it.
Anything outside response 200 will not give us what we need.

In [7]:
# parse the information on the page in html
soup = BeautifulSoup(page.text, 'html')
soup

<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8"/>
<title>Hockey Teams: Forms, Searching and Pagination | Scrape This Site | A public sandbox for learning web scraping</title>
<link href="/static/images/scraper-icon.png" rel="icon" type="image/png"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="Browse through a database of NHL team stats since 1990. Practice building a scraper that handles common website interface components." name="description"/>
<link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.5/css/bootstrap.min.css" integrity="sha256-MfvZlkHCEqatNoGiOXveE8FIwMzZg4W85qfrfIFBfYc= sha512-dTfge/zgoMYpP7QbHy4gWMEGsbsdZeCXz7irItjcC3sPUFtf0kuFbDz/ixG7ArTxmDjLXDmezHubeNikyKGVyQ==" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Lato:400,700" rel="stylesheet" type="text/css"/>
<link href="/static/css/styles.css" rel="stylesheet" type="text/css"/>
<meta content="noindex" name="robot

The above are all the information on the website in html codes. This will enable us know which tags to
use for the scrapping.

In [10]:
# after inspecting the page, let us call the particular tag we want from it
soup.find('table')

<table class="table">
<tr>
<th>
                            Team Name
                        </th>
<th>
                            Year
                        </th>
<th>
                            Wins
                        </th>
<th>
                            Losses
                        </th>
<th>
                            OT Losses
                        </th>
<th>
                            Win %
                        </th>
<th>
                            Goals For (GF)
                        </th>
<th>
                            Goals Against (GA)
                        </th>
<th>
                            + / -
                        </th>
</tr>
<tr class="team">
<td class="name">
                            Boston Bruins
                        </td>
<td class="year">
                            1990
                        </td>
<td class="wins">
                            44
                        </td>
<td class="losses">
                            2

From this, we can see that our column names are within the 'th' tags, the rows are in the 'tr' tags while the 
actual data for each row are in the 'td' tags.
These are the tags we are interested in.

In [18]:
# we can also write it as and name it
table = soup.find('table', class_ = 'table')
table

<table class="table">
<tr>
<th>
                            Team Name
                        </th>
<th>
                            Year
                        </th>
<th>
                            Wins
                        </th>
<th>
                            Losses
                        </th>
<th>
                            OT Losses
                        </th>
<th>
                            Win %
                        </th>
<th>
                            Goals For (GF)
                        </th>
<th>
                            Goals Against (GA)
                        </th>
<th>
                            + / -
                        </th>
</tr>
<tr class="team">
<td class="name">
                            Boston Bruins
                        </td>
<td class="year">
                            1990
                        </td>
<td class="wins">
                            44
                        </td>
<td class="losses">
                            2

In [21]:
# getting the column names for the table
table_col_names = table.find_all('th')
table_col_names

[<th>
                             Team Name
                         </th>,
 <th>
                             Year
                         </th>,
 <th>
                             Wins
                         </th>,
 <th>
                             Losses
                         </th>,
 <th>
                             OT Losses
                         </th>,
 <th>
                             Win %
                         </th>,
 <th>
                             Goals For (GF)
                         </th>,
 <th>
                             Goals Against (GA)
                         </th>,
 <th>
                             + / -
                         </th>]

In [23]:
# the result above comes in a list, so we can loop through it
# using list comprehension
# for every name in table_col_names, return the name of the cols in text and not html
col_names = [ name.text for name in table_col_names]
col_names

['\n                            Team Name\n                        ',
 '\n                            Year\n                        ',
 '\n                            Wins\n                        ',
 '\n                            Losses\n                        ',
 '\n                            OT Losses\n                        ',
 '\n                            Win %\n                        ',
 '\n                            Goals For (GF)\n                        ',
 '\n                            Goals Against (GA)\n                        ',
 '\n                            + / -\n                        ']

We can see that we have extracted our column names but they still have unwanted characters.
We will need to apply the strip() function to remove them.

In [24]:
# apply the strip function
col_names.strip()

AttributeError: 'list' object has no attribute 'strip'

We can see that this gave us an error. What this means is that we cannot apply 
the strip attribute to a list. We can only do that on texts or objects.
The solution? We will put it directly in our list comprehension.

In [25]:
col_names = [ name.text.strip() for name in table_col_names]
col_names

['Team Name',
 'Year',
 'Wins',
 'Losses',
 'OT Losses',
 'Win %',
 'Goals For (GF)',
 'Goals Against (GA)',
 '+ / -']

In [26]:
# we will import the pandas dataframe and start putting our data together
import pandas as pd

In [27]:
# set the column names for the dataframe
df = pd.DataFrame(columns = col_names)
df

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -


In [35]:
# we need to extract the 'td' tags that are within the 'tr' tags
# first find the 'tr' tags
row_info = table.find_all('tr')
row_info

[<tr>
 <th>
                             Team Name
                         </th>
 <th>
                             Year
                         </th>
 <th>
                             Wins
                         </th>
 <th>
                             Losses
                         </th>
 <th>
                             OT Losses
                         </th>
 <th>
                             Win %
                         </th>
 <th>
                             Goals For (GF)
                         </th>
 <th>
                             Goals Against (GA)
                         </th>
 <th>
                             + / -
                         </th>
 </tr>,
 <tr class="team">
 <td class="name">
                             Boston Bruins
                         </td>
 <td class="year">
                             1990
                         </td>
 <td class="wins">
                             44
                         </td>
 <td class="losses">
          

In [38]:
# then extract thee 'td' tags and put them in a list
for row in row_info:
    row_data = row.find_all('td')
    print(row_data)


[]
[<td class="name">
                            Boston Bruins
                        </td>, <td class="year">
                            1990
                        </td>, <td class="wins">
                            44
                        </td>, <td class="losses">
                            24
                        </td>, <td class="ot-losses">
</td>, <td class="pct text-success">
                            0.55
                        </td>, <td class="gf">
                            299
                        </td>, <td class="ga">
                            264
                        </td>, <td class="diff text-success">
                            35
                        </td>]
[<td class="name">
                            Buffalo Sabres
                        </td>, <td class="year">
                            1990
                        </td>, <td class="wins">
                            31
                        </td>, <td class="losses">
           

We have succeeded in getting all the 'td' data but it still doesn't look okay for a dataframe.
We need to add to the code above so that it loops through and puts these observations in a list.

In [39]:
# continue with the code by putting the row_data into a list 
for row in row_info:
    row_data = row.find_all('td')
    observations = [data.text.strip() for data in row_data]
    print(observations)

[]
['Boston Bruins', '1990', '44', '24', '', '0.55', '299', '264', '35']
['Buffalo Sabres', '1990', '31', '30', '', '0.388', '292', '278', '14']
['Calgary Flames', '1990', '46', '26', '', '0.575', '344', '263', '81']
['Chicago Blackhawks', '1990', '49', '23', '', '0.613', '284', '211', '73']
['Detroit Red Wings', '1990', '34', '38', '', '0.425', '273', '298', '-25']
['Edmonton Oilers', '1990', '37', '37', '', '0.463', '272', '272', '0']
['Hartford Whalers', '1990', '31', '38', '', '0.388', '238', '276', '-38']
['Los Angeles Kings', '1990', '46', '24', '', '0.575', '340', '254', '86']
['Minnesota North Stars', '1990', '27', '39', '', '0.338', '256', '266', '-10']
['Montreal Canadiens', '1990', '39', '30', '', '0.487', '273', '249', '24']
['New Jersey Devils', '1990', '32', '33', '', '0.4', '272', '264', '8']
['New York Islanders', '1990', '25', '45', '', '0.312', '223', '290', '-67']
['New York Rangers', '1990', '36', '31', '', '0.45', '297', '265', '32']
['Philadelphia Flyers', '1990',

This is a list of observations.With this list, we still need to modify our code such that it appends each observation
into the dataframe (df), that we created above.

In [40]:
# append into our dataframe df using index location - loc
for row in row_info:
    row_data = row.find_all('td')
    observations = [data.text.strip() for data in row_data]
    
    # find the length of the dataframe because that will be the length of our observations list
    length = len(df)
    
    # find the index location, let the length of the dataframe be the length of the index
    # for each index location input each observation
    df.loc[length] = observations

ValueError: cannot set a row with mismatched columns

We had an error. What could the problem be?
If you look at the list of observations that we printed, you will notice that the
first one is an empty list. We have to exclude that from our list.

In [41]:
# exclude the empty list
for row in row_info[1:]:
    row_data = row.find_all('td')
    observations = [data.text.strip() for data in row_data]
    
    # find the length of the dataframe because that will be the length of our observations list
    length = len(df)
    
    # find the index location, let the length of the dataframe be the length of the index
    # for each index location input each observation
    df.loc[length] = observations

In [45]:
# call our dataframe
df.head(10)

Unnamed: 0,Team Name,Year,Wins,Losses,OT Losses,Win %,Goals For (GF),Goals Against (GA),+ / -
0,Boston Bruins,1990,44,24,,0.55,299,264,35
1,Buffalo Sabres,1990,31,30,,0.388,292,278,14
2,Calgary Flames,1990,46,26,,0.575,344,263,81
3,Chicago Blackhawks,1990,49,23,,0.613,284,211,73
4,Detroit Red Wings,1990,34,38,,0.425,273,298,-25
5,Edmonton Oilers,1990,37,37,,0.463,272,272,0
6,Hartford Whalers,1990,31,38,,0.388,238,276,-38
7,Los Angeles Kings,1990,46,24,,0.575,340,254,86
8,Minnesota North Stars,1990,27,39,,0.338,256,266,-10
9,Montreal Canadiens,1990,39,30,,0.487,273,249,24


In [46]:
df.shape

(25, 9)

There you have it! Our dataframe.

### Conclusion:
We have been able to scrape data from our web page by learning the basics.
We will continue the next time with scraping multiple pages from this same website.


Thank you!