# Web Scraping Wikipedia Tables using BeautifulSoup and Python

Wiki:
https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_FBS_football_stadiums

#### Collecting web page data

In [1]:
# Install requests and beautifulsoup4
# $ pip install requests
# $ pip install beautifulsoup4

In [2]:
# Import the installed modules
import requests
from bs4 import BeautifulSoup
from urllib import request 

In [3]:
# To get the data from the web page we will use requests API's get() method
url = "https://en.wikipedia.org/wiki/List_of_NCAA_Division_I_institutions"
page = requests.get(url)

In [4]:
# page is a requests.models.Response object if we use Requests module
# page is a string object if we use urllib.Request module
type(page)

requests.models.Response

A list of the response status code:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Status

In [5]:
# It is always good to check the http response status code from Requests 
print(page.status_code)   # This should print 200
# print(page[:800])

200


In [6]:
# Now we have collected the data from the web page by using content() method from Requests module
print(page.content[:800])

b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>List of NCAA Division I institutions - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XiM@4QpAADkAAEgmAtwAAAAE","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_NCAA_Division_I_institutions","wgTitle":"List of NCAA Division I institutions"'


In [7]:
# create a bs4 object and use the prettify method from bs4
# This will print data in format like inspecting the web page.
soup = BeautifulSoup(page.content, 'html.parser') # the input of the BeautifulSoup should be string object or bytes
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   List of NCAA Division I institutions - Wikipedia
  </title>
  <script>
   document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XiM@4QpAADkAAEgmAtwAAAAE","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_NCAA_Division_I_institutions","wgTitle":"List of NCAA Division I institutions","wgCurRevisionId":933196664,"wgRevisionId":933196664,"wgArticleId":3604099,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories"

As of now we know that our table is in tag "table" and class "wikitable".   
So, first we will extract the data in table tag using find method of bs4 object.   
This method returns a bs4 object  

In [8]:
# find all the tables on the web
tables = soup.find_all('table', {'class':'wikitable'})
# find tag, table, and class (using dictionary), wikitable

In [9]:
# The type of tb should be a bs4.element.Tag object for a single table
# The type of tb should be a bs4.element.Tag object for mutiple tables
type(tables)

bs4.element.ResultSet

####  < table > tag:
The < table > tag defines an HTML table.  
An HTML table consists of the < table > element and one or more < tr >, < th >, and < td > elements.  
The < tr > element defines a table row,  
The < th > element defines a table header,  
and the < td > element defines a table cell.  

A more complex HTML table may also include < caption >, < col >, < colgroup >, < thead >, < tfoot >, and < tbody > elements.  

##### Example:   
< table>  
....< tr>  
........< th>Month< /th>  
........< th>Savings< /th>  
....< / tr>  
....< tr>  
........< td> January< / td>  
........< td> 100< / td>  
....< / tr>  
....< tr>  
........< td>February< /td>  
........< td> 80< /td>  
....< /tr>  
< /table>  

In [10]:
# Search through the tables for the one with the headings we want.
for table in tables:
    # find the '<th>' tag -- contains header
    ths = table.find_all('th')
    # store table headers for each table
    # ths is a list of <th> tags
    # iterate the <th> tags and get the text of the tag
    headings = [th.text.strip() for th in ths]
    # list the wanted table headers 
    if headings[:6] == ['School', 'Common Name', 'Team', 'City', 
                        'State', 'Type']:
        break

In [11]:
headings

['School',
 'Common Name',
 'Team',
 'City',
 'State',
 'Type',
 'Primary Conference',
 'Abilene Christian University',
 'University of Akron',
 'University of Alabama',
 'Alabama Agricultural and Mechanical University',
 'University of Alabama at Birmingham',
 'Alabama State University',
 'University at Albany, SUNY',
 'Alcorn State University',
 'American University',
 'Appalachian State University',
 'University of Arizona',
 'Arizona State University',
 'University of Arkansas',
 'University of Arkansas at Little Rock',
 'University of Arkansas at Pine Bluff',
 'Arkansas State University',
 'Auburn University',
 'Austin Peay State University',
 'Ball State University',
 'Baylor University',
 'Belmont University',
 'Bethune-Cookman University',
 'Binghamton University',
 'Boise State University',
 'Boston College',
 'Boston University',
 'Bowling Green State University',
 'Bradley University',
 'Brigham Young University',
 'Brown University',
 'Bryant University',
 'Bucknell Univers

In [12]:
#### Checking #### 
# Extract the columns we want 
# Use the table that we just stored from last forloop for finding the  <tr> tag: defines a table row
for trs in table.find_all('tr'):
    # find all the <td> tag: defines a table cell
    tds = trs.find_all('td')
    if not tds:     # if tds is not None
        continue
    # get the text from <td> tag and store in a tuple
    print([td.text.strip() for td in tds[0:5]])

['Abilene Christian', 'Wildcats', 'Abilene', 'Texas', 'Private/Churches of Christ']
['Akron', 'Zips', 'Akron', 'Ohio', 'State']
['Alabama', 'Crimson Tide', 'Tuscaloosa', 'Alabama', 'State']
['Alabama A&M', 'Bulldogs and Lady Bulldogs', 'Huntsville', 'Alabama', 'State']
['UAB', 'Blazers', 'Birmingham', 'Alabama', 'State']
['Alabama State', 'Hornets and Lady Hornets', 'Montgomery', 'Alabama', 'State']
['Albany', 'Great Danes', 'Albany', 'New York', 'State']
['Alcorn State', 'Braves and Lady Braves', 'Lorman', 'Mississippi', 'State']
['American', 'Eagles', 'Washington', 'District of Columbia', 'Private/Methodist']
['Appalachian State', 'Mountaineers', 'Boone', 'North Carolina', 'State']
['Arizona', 'Wildcats', 'Tucson', 'Arizona', 'State']
['Arizona State', 'Sun Devils', 'Tempe', 'Arizona', 'State']
['Arkansas', 'Razorbacks and Razorback women[A 1]', 'Fayetteville', 'Arkansas', 'State']
['Little Rock[A 2]', 'Trojans', 'Little Rock', 'Arkansas', 'State']
['Arkansas–Pine Bluff', 'Golden Lio

In [13]:
# Create an empty list
rows=list()
# Extract the columns we want 
# Use the table that we just stored from last forloop for finding the  <tr> tag: defines a table row
for trs in table.find_all('tr'):
    # find all the <td> tag: defines a table cell
    tds = trs.find_all('td')
    if not tds:     # if tds is not None
        continue
    # get the text from <td> tag and store in variable by row
    rows.append([td.text.strip() for td in tds[0:5]])   

In [14]:
import pandas as pd
df=pd.DataFrame(rows, columns= headings[1:6])

In [15]:
df

Unnamed: 0,Common Name,Team,City,State,Type
0,Abilene Christian,Wildcats,Abilene,Texas,Private/Churches of Christ
1,Akron,Zips,Akron,Ohio,State
2,Alabama,Crimson Tide,Tuscaloosa,Alabama,State
3,Alabama A&M,Bulldogs and Lady Bulldogs,Huntsville,Alabama,State
4,UAB,Blazers,Birmingham,Alabama,State
...,...,...,...,...,...
345,Wright State,Raiders,Fairborn,Ohio,State
346,Wyoming,Cowboys and Cowgirls,Laramie,Wyoming,State
347,Xavier,Musketeers,Cincinnati,Ohio,Private/Catholic
348,Yale,Bulldogs,New Haven,Connecticut,Private/Non-Sectarian


In [16]:
headings[:6]

['School', 'Common Name', 'Team', 'City', 'State', 'Type']

In [17]:
df=df.assign(School=headings[7:])

In [18]:
df=df[['School', 'Common Name', 'Team', 'City', 'State', 'Type']]

In [19]:
df

Unnamed: 0,School,Common Name,Team,City,State,Type
0,Abilene Christian University,Abilene Christian,Wildcats,Abilene,Texas,Private/Churches of Christ
1,University of Akron,Akron,Zips,Akron,Ohio,State
2,University of Alabama,Alabama,Crimson Tide,Tuscaloosa,Alabama,State
3,Alabama Agricultural and Mechanical University,Alabama A&M,Bulldogs and Lady Bulldogs,Huntsville,Alabama,State
4,University of Alabama at Birmingham,UAB,Blazers,Birmingham,Alabama,State
...,...,...,...,...,...,...
345,Wright State University,Wright State,Raiders,Fairborn,Ohio,State
346,University of Wyoming,Wyoming,Cowboys and Cowgirls,Laramie,Wyoming,State
347,Xavier University,Xavier,Musketeers,Cincinnati,Ohio,Private/Catholic
348,Yale University,Yale,Bulldogs,New Haven,Connecticut,Private/Non-Sectarian


In [20]:
def normalize_string(text):
    """
    Expect: a string contain '[]'; ex: Saint Francis (PA)[A 37]
    Modifies: extract text before '[]';
    Returns: strings; ex: Saint Francis (PA)
    """   
    if '[' in text:
        index=text.find('[')
        result=text[:index]
    else:
        result=text
    return result

In [21]:
df=df.applymap(normalize_string)

In [22]:
# output a list in cells version of csv
df.to_csv('divisionI.csv', index=False)