# Web Scraping Wikipedia Tables using BeautifulSoup and Python

Wiki:
https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums

#### Collecting web page data

In [1]:
# Install requests and beautifulsoup4
# $ pip install requests
# $ pip install beautifulsoup4

In [2]:
# Import the installed modules
import requests
from bs4 import BeautifulSoup
from urllib import request 

In [3]:
# To get the data from the web page we will use requests API's get() method
url = "https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums"
page = requests.get(url)

In [4]:
# page is a requests.models.Response object if we use Requests module
# page is a string object if we use urllib.Request module
type(page)

requests.models.Response

A list of the response status code:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [5]:
# It is always good to check the http response status code from Requests 
print(page.status_code)   # This should print 200
# print(page[:800])

200


In [6]:
# Now we have collected the data from the web page by using content() method from Requests module
print(page.content[:800])

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of NCAA Division I FBS football stadiums - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XiQY6QpAAEIAAHQIpZgAAAEO","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_NCAA_Division_I_FBS_football_stadiums","wgTitle":"List of NCAA Divisi'


In [7]:
# create a bs4 object and use the prettify method from bs4
# This will print data in format like inspecting the web page.
soup = BeautifulSoup(page.content, 'html.parser') # the input of the BeautifulSoup should be string object or bytes
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of NCAA Division I FBS football stadiums - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XiQY6QpAAEIAAHQIpZgAAAEO","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_NCAA_Division_I_FBS_football_stadiums","wgTitle":"List of NCAA Division I FBS football stadiums","wgCurRevisionId":930475985,"wgRevisionId":930475985,"wgArticleId":9897546,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserG

As of now we know that our table is in tag "table" and class "wikitable".   
So, first we will extract the data in table tag using find method of bs4 object.   
This method returns a bs4 object  

In [8]:
# find all the tables on the web
tables = soup.find_all('table', {'class':'wikitable sortable'})
# find tag, table, and class (using dictionary), wikitable

In [9]:
# The type of tb should be a bs4.element.Tag object for a single table
# The type of tb should be a bs4.element.Tag object for mutiple tables
type(tables)

bs4.element.ResultSet

####  < table > tag:
The < table > tag defines an HTML table.  
An HTML table consists of the < table > element and one or more < tr >, < th >, and < td > elements.  
The < tr > element defines a table row,  
The < th > element defines a table header,  
and the < td > element defines a table cell.  

A more complex HTML table may also include < caption >, < col >, < colgroup >, < thead >, < tfoot >, and < tbody > elements.  

##### Example:   
< table>  
....< tr>  
........< th>Month< /th>  
........< th>Savings< /th>  
....< / tr>  
....< tr>  
........< td> January< / td>  
........< td> 100< / td>  
....< / tr>  
....< tr>  
........< td>February< /td>  
........< td> 80< /td>  
....< /tr>  
< /table>  

In [10]:
# Search through the tables for the one with the headings we want.
for table in tables:
    # find the '<th>' tag -- contains header
    ths = table.find_all('th')
    # store table headers for each table
    # ths is a list of <th> tags
    # iterate the <th> tags and get the text of the tag
    headings = [th.text.strip() for th in ths]
    # list the wanted table headers 
    if headings[1:7] == ['Stadium', 'City', 'State', 'Team', 
                        'Conference', 'Capacity']:
        break

In [11]:
headings

['Image',
 'Stadium',
 'City',
 'State',
 'Team',
 'Conference',
 'Capacity',
 'Record1',
 'Built',
 'Expanded 2',
 'Surface']

In [12]:
#### Checking #### 
# Extract the columns we want 
# Use the table that we just stored from last forloop for finding the  <tr> tag: defines a table row
for trs in table.find_all('tr'):
    # find all the <td> tag: defines a table cell
    tds = trs.find_all('td')
    if not tds:     # if tds is not None
        continue
    # get the text from <td> tag and store in a tuple
    print([td.text.strip() for td in tds[1:7]])

['Aggie Memorial Stadium', 'Las Cruces', 'NM', 'New Mexico State', 'Independent', '30,343[1]']
['Alamodome', 'San Antonio', 'TX', 'UTSA', 'C-USA', '65,000']
['Alaska Airlines Field at Husky Stadium', 'Seattle', 'WA', 'Washington', 'Pac-12', '70,500[2]']
['Albertsons Stadium', 'Boise', 'ID', 'Boise State', 'Mountain West', '36,387[3]']
['Allen E. Paulson Stadium', 'Statesboro', 'GA', 'Georgia Southern', 'Sun Belt', '25,000']
['Aloha Stadium', 'Honolulu', 'HI', 'Hawaiʻi', 'Mountain West', '50,000[5]']
['Alumni Stadium', 'Chestnut Hill', 'MA', 'Boston College', 'ACC', '44,500[6]']
['Amon G. Carter Stadium', 'Fort Worth', 'TX', 'TCU', 'Big 12', '45,000[7]']
['Apogee Stadium', 'Denton', 'TX', 'North Texas', 'C-USA', '30,850']
['Arizona Stadium', 'Tucson', 'AZ', 'Arizona', 'Pac-12', '56,029[9]']
['Arthur L. Williams Stadium', 'Lynchburg', 'VA', 'Liberty', 'Independent (2018)', '25,000[11]']
['Autzen Stadium', 'Eugene', 'OR', 'Oregon', 'Pac-12', '54,000']
['Bagwell Field at Dowdy–Ficklen Stad

In [13]:
# Create an empty list
rows=list()
# Extract the columns we want 
# Use the table that we just stored from last forloop for finding the  <tr> tag: defines a table row
for trs in table.find_all('tr'):
    # find all the <td> tag: defines a table cell
    tds = trs.find_all('td')
    if not tds:     # if tds is not None
        continue
    # get the text from <td> tag and store in variable by row
    rows.append([td.text.strip() for td in tds[1:7]])   

In [14]:
rows

[['Aggie Memorial Stadium',
  'Las Cruces',
  'NM',
  'New Mexico State',
  'Independent',
  '30,343[1]'],
 ['Alamodome', 'San Antonio', 'TX', 'UTSA', 'C-USA', '65,000'],
 ['Alaska Airlines Field at Husky Stadium',
  'Seattle',
  'WA',
  'Washington',
  'Pac-12',
  '70,500[2]'],
 ['Albertsons Stadium',
  'Boise',
  'ID',
  'Boise State',
  'Mountain West',
  '36,387[3]'],
 ['Allen E. Paulson Stadium',
  'Statesboro',
  'GA',
  'Georgia Southern',
  'Sun Belt',
  '25,000'],
 ['Aloha Stadium', 'Honolulu', 'HI', 'Hawaiʻi', 'Mountain West', '50,000[5]'],
 ['Alumni Stadium',
  'Chestnut Hill',
  'MA',
  'Boston College',
  'ACC',
  '44,500[6]'],
 ['Amon G. Carter Stadium', 'Fort Worth', 'TX', 'TCU', 'Big 12', '45,000[7]'],
 ['Apogee Stadium', 'Denton', 'TX', 'North Texas', 'C-USA', '30,850'],
 ['Arizona Stadium', 'Tucson', 'AZ', 'Arizona', 'Pac-12', '56,029[9]'],
 ['Arthur L. Williams Stadium',
  'Lynchburg',
  'VA',
  'Liberty',
  'Independent (2018)',
  '25,000[11]'],
 ['Autzen Stadium', 

In [15]:
import pandas as pd
df=pd.DataFrame(rows, columns= headings[1:7])

In [16]:
df

Unnamed: 0,Stadium,City,State,Team,Conference,Capacity
0,Aggie Memorial Stadium,Las Cruces,NM,New Mexico State,Independent,"30,343[1]"
1,Alamodome,San Antonio,TX,UTSA,C-USA,65000
2,Alaska Airlines Field at Husky Stadium,Seattle,WA,Washington,Pac-12,"70,500[2]"
3,Albertsons Stadium,Boise,ID,Boise State,Mountain West,"36,387[3]"
4,Allen E. Paulson Stadium,Statesboro,GA,Georgia Southern,Sun Belt,25000
...,...,...,...,...,...,...
125,Veterans Memorial Stadium at Larry Blakeney Field,Troy,AL,Troy,Sun Belt,30402
126,Waldo Stadium,Kalamazoo,MI,Western Michigan,MAC,"30,200[153]"
127,Warren McGuirk Alumni Stadium,Hadley,MA,UMass,Independent,17000
128,Wayne Day Family Field at Carter–Finley Stadium,Raleigh,NC,NC State,ACC,"57,583[155]"


In [17]:
def normalize_string(text):
    """
    Expect: a string contain '[]'; ex: 70,500[2]
    Modifies: extract text before '[]';
    Returns: strings; ex: 70,500
    """   
    if '[' in text:
        index=text.find('[')
        result=text[:index]
    else:
        result=text
    return result

In [18]:
df['Capacity']=df['Capacity'].apply(normalize_string)

In [19]:
df

Unnamed: 0,Stadium,City,State,Team,Conference,Capacity
0,Aggie Memorial Stadium,Las Cruces,NM,New Mexico State,Independent,30343
1,Alamodome,San Antonio,TX,UTSA,C-USA,65000
2,Alaska Airlines Field at Husky Stadium,Seattle,WA,Washington,Pac-12,70500
3,Albertsons Stadium,Boise,ID,Boise State,Mountain West,36387
4,Allen E. Paulson Stadium,Statesboro,GA,Georgia Southern,Sun Belt,25000
...,...,...,...,...,...,...
125,Veterans Memorial Stadium at Larry Blakeney Field,Troy,AL,Troy,Sun Belt,30402
126,Waldo Stadium,Kalamazoo,MI,Western Michigan,MAC,30200
127,Warren McGuirk Alumni Stadium,Hadley,MA,UMass,Independent,17000
128,Wayne Day Family Field at Carter–Finley Stadium,Raleigh,NC,NC State,ACC,57583


In [20]:
def normalize_number(strings):
    """
    Expect: a string contain '$','--', or ','
    Modifies: remove '$' and ','; convert '--' to '0'
    Returns: an integer
    """
    if ',' in strings:
        strings=strings.replace(',','')
    else:
        strings=0
    return int(strings)

In [21]:
df['Capacity']=df['Capacity'].apply(normalize_number)

In [22]:
# output a list in cells version of csv
df.to_csv('football_stadium.csv', index=False, encoding='utf-8')