# Web Scraping: Part II - Scraping the Clemson Tigers Roster

```
Prepared for CPSC4300/6300 Section 001 -- Applied Data Science
Xizhou Feng
Clemson University
Fall 2020
```

# Overview

In this notebook, we use the Blog by Brendan Martin, *Ultimate Guide to Web Scraping with Python Part 1: Requests and BeautifulSoup* (https://www.learndatasci.com/tutorials/ultimate-guide-web-scraping-w-python-requests-and-beautifulsoup/) as a template to create our notebook.

# Examine the Structure of the Clemson Tigers Roster

Open the [ESPN's Clemson Tigers Roster webpage](https://www.espn.com/college-football/team/roster/_/id/228/clemson-tigers "ESPN's Clemson Tigers Roster") in your browser and examine its contents and structure using the "Developer Tools" of your browser.

You may see a level 1 heading **Clemson Tigers Roste** followed by three tables captioned with "Offense", "Defense", and "Special Team".

# Download Web pages from Website

As pointed out by Brendan in his blog, every time you load a web page,  you're making a request to a server and an incorrect code may submit thousands of requests a second to a website, which can cause damages to the owner. With this in mind, you want to be very careful with how we program scrapers to avoid crashing sites and causing damage. 

Normally, you should follow the following two guidelines:

+ Always check the website's policy (sites/robots.txt) about web crawling using robots.
+ Every time we scrape a website we want to attempt to make only one request per page. 

## Check The Sites's Robots Policy

Open http://www.espn.com/robots.txt in your browser and skim the contents.

```
# robots.txt for www.espn.com

User-agent: *
Disallow: */_/group/
Disallow: */_/scoreboard/
Disallow: */_/week/
Disallow: */_/year/20
Disallow: */admin/
Disallow: */boxscore?/
Disallow: */cat/
... more stuff ..
```

A good news is that team roster is allowed (not among the list of disallowed pages) to scape at the ESPN site.

## Make Web Requests

We you interact with a website, you communicate the web servers using HTTP the protocol. The HTTP protocol specifies a set of methods such as `GET` and `POST` for you to talk with the webserver. The `requests` library is the de facto standard for making HTTP requests in Python. It abstracts the complexities of making requests behind a simple API. If you are not familar with the concepts around HTTP request, you may read a guide provided by Alex Ronquill (https://realpython.com/python-requests/).

The code below shows how we make a request to get the Clemson Tigers Roste web page.

In [2]:
import requests
from requests.exceptions import HTTPError

url = 'https://www.espn.com/college-football/team/roster/_/id/228/clemson-tigers'

try: 
    response = requests.get(url)
except HTTPError as http_err:
    print('HTTP error occurred: {}'.format(http_err)) 
except Exception as err:
    print('Other error occurred: {}'.format(err))
else:
    print('Requesting {} is successful'.format(url))

Requesting https://www.espn.com/college-football/team/roster/_/id/228/clemson-tigers is successful


In [3]:
print(type(response))
print(response)

<class 'requests.models.Response'>
<Response [200]>


You can list the attributes and methods of a response object using the `dir` function, the `__dict__` or `__attrs__` attributes of the object.

In [4]:
dir(response)

['__attrs__',
 '__bool__',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__enter__',
 '__eq__',
 '__exit__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__nonzero__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_content',
 '_content_consumed',
 '_next',
 'apparent_encoding',
 'close',
 'connection',
 'content',
 'cookies',
 'elapsed',
 'encoding',
 'headers',
 'history',
 'is_permanent_redirect',
 'is_redirect',
 'iter_content',
 'iter_lines',
 'json',
 'links',
 'next',
 'ok',
 'raise_for_status',
 'raw',
 'reason',
 'request',
 'status_code',
 'text',
 'url']

In [5]:
response.__dict__.keys()

dict_keys(['_content', '_content_consumed', '_next', 'status_code', 'headers', 'raw', 'url', 'encoding', 'history', 'reason', 'cookies', 'elapsed', 'request', 'connection'])

In [6]:
response.__attrs__

['_content',
 'status_code',
 'headers',
 'url',
 'history',
 'encoding',
 'reason',
 'cookies',
 'elapsed',
 'request']

In [7]:
response.content

b'\n        <!doctype html>\n        <html lang="en">\n            <head>\n                <meta charSet="utf-8" />\n\n                <!-- ESPNFITT | 87b4cdf71558 | 4076 | 955e0eb0ecd8a0301101a5f70bd91686dd88caf8 | Thu, 27 Aug 2020 07:20:44 GMT -->\n                \n                <title data-react-helmet="true">Clemson Tigers Roster | ESPN</title>\n                <meta data-react-helmet="true" name="description" content="Visit ESPN to view the Clemson Tigers team roster for the current season"/><meta data-react-helmet="true" name="keywords" content="College Football, Football, Clemson Tigers, Roster"/><meta data-react-helmet="true" property="fb:app_id" content="116656161708917"/><meta data-react-helmet="true" property="og:site_name" content="ESPN"/><meta data-react-helmet="true" property="og:url" content="https://www.espn.com/college-football/team/roster/_/id/228/clemson-tigers"/><meta data-react-helmet="true" property="og:title" content="Clemson Tigers Roster | ESPN"/><meta data-

## Save the Webpage to a Local File

To avoid repeatedly requesting a same page, we may save the web page to a local file. 

In [8]:
def save_html(html, path):
    with open(path, 'wb') as f:
        f.write(html)
        
save_html(response.content, 'clemson_tiger_roster.html')

On linux, we can check the contenst of a text file using tools like `cat`, `head`, `tail`, `less`, `wc`, etc. You can run a linux command by puting a '!' character in frond of the command. For example.

In [9]:
!ls -al *.html

-rw------- 1 xizhouf cuuser 374085 Aug 27 03:20 clemson_tiger_roster.html


In [10]:
!head -n 5 clemson_tiger_roster.html


        <!doctype html>
        <html lang="en">
            <head>
                <meta charSet="utf-8" />


## Read HTML File

In [11]:
def read_html(path):
    with open(path, 'rb') as f:
        return f.read()
    
html = read_html('clemson_tiger_roster.html')

You can check the type and the contents of the html object by print both of them.

In [12]:
print(type(html))
print(html)

<class 'bytes'>
b'\n        <!doctype html>\n        <html lang="en">\n            <head>\n                <meta charSet="utf-8" />\n\n                <!-- ESPNFITT | 87b4cdf71558 | 4076 | 955e0eb0ecd8a0301101a5f70bd91686dd88caf8 | Thu, 27 Aug 2020 07:20:44 GMT -->\n                \n                <title data-react-helmet="true">Clemson Tigers Roster | ESPN</title>\n                <meta data-react-helmet="true" name="description" content="Visit ESPN to view the Clemson Tigers team roster for the current season"/><meta data-react-helmet="true" name="keywords" content="College Football, Football, Clemson Tigers, Roster"/><meta data-react-helmet="true" property="fb:app_id" content="116656161708917"/><meta data-react-helmet="true" property="og:site_name" content="ESPN"/><meta data-react-helmet="true" property="og:url" content="https://www.espn.com/college-football/team/roster/_/id/228/clemson-tigers"/><meta data-react-helmet="true" property="og:title" content="Clemson Tigers Roster | ES

It is easy to see that the html object has the same value and data type as the response.content object.

# Using Beautiful Soup to Extrac the Clemson Tigers Roster

## Make the Soup

In [13]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')

## Find the Headings

In [14]:
soup.find_all('h1')

[<h1 class="ClubhouseHeader__Name ttu flex items-start n2"><span class="ClubhouseHeader__Rank clr-gray-05 pr2 n5">1</span><span class="flex flex-wrap"><span class="db pr3 nowrap fw-bold">Clemson</span><span class="db">Tigers</span></span></h1>,
 <h1 class="headline headline__h1 dib">Clemson Tigers Roster</h1>]

In [15]:
h1 = soup.find('h1', string='Clemson Tigers Roster')
h1

<h1 class="headline headline__h1 dib">Clemson Tigers Roster</h1>

In [16]:
tables = soup.find_all('table')

In [17]:
type(tables)

bs4.element.ResultSet

In [18]:
len(tables)

3

In [19]:
soup.find_all("caption")

[]

## Find the Offense Players Table

In [20]:
soup.find_all("caption", string="Offense")

[]

In [21]:
soup.find_all("caption", string="Offense")[0].find_next("table")

IndexError: list index out of range

In [None]:
t = soup.find_all("caption", string="Offense")[0].find_next("table")

### Extract the table's column names

In [None]:
t.thead

In [None]:
t.thead
type(t.thead)

In [None]:
i = 1
for c in t.find_all('th'):
    print("{}\t{}".format(i, c.get_text()))
    i += 1

In [None]:
headers = [c.get_text() for c in t.find_all('th')]
headers

### Extract the players' data

In [None]:
def is_player_row(tag):
    if tag.name == 'tr' and not tag.find("table") and tag.find(class_="Table2__td"):
        return True
    else:
        return False
    
i = 1
for c in t.tbody.find_all(is_player_row):
    tds = c.find_all('td')
    cols = []
    for td in tds:
        cols.append(td.get_text())
    print("{}\t{}".format(i, cols))
    i += 1

## Save the Extracted Data

We can substitute the `print` function with file's `write` method in the above code snippet. But here, we first save the data into a list and then write the list to files.

In [None]:
players = []
i = 1
for c in t.tbody.find_all(is_player_row):
    tds = c.find_all('td')
    cols = []
    for td in tds:
        cols.append(td.get_text())
    players.append(cols)
    i += 1

In [None]:
players

In [None]:
def write_csv(headers, rows, path):
    with open(path, "w") as f:
        f.write(','.join(headers[1:]))
        f.write('\n')
        for row in rows:
            f.write(','.join(row[1:]))
            f.write('\n')
write_csv(headers, players, "clemson_tigers_roster.csv")

In [None]:
!head -n 4 clemson_tigers_roster.csv

In [None]:
def write_csv(headers, rows, path):
    with open(path, "w") as f:
        f.write(','.join(headers[1:]))
        f.write('\n')
        for row in rows:
            row = [c.replace(',', ' ') for c in row]
            f.write(','.join(row[1:]))
            f.write('\n')
write_csv(headers, players, "clemson_tigers_roster.csv")

# Perform Simple Data Analysis Using pandas

## Read the Data into a pandas DataFrame

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
roster = pd.read_csv("clemson_tigers_roster.csv")

## Get Some Basic Descriptive Statistics

In [None]:
roster.head(4)

In [None]:
roster.describe()

## Plot the Data

In [None]:
plt.style.use('seaborn-whitegrid')
plt.scatter(roster['POS'], roster['WT'])
plt.title("Position vs Weight")
plt.xlabel("POS")
plt.ylabel("WT")