# An Introduction to Web Scraping

How you can use python to extract or scrape any type of content from web pages.
In this tutorial we will learn how to extract simple text elements, images, and tables from different web pages.

New 3rd party packages we will use:

* **requests**: http://docs.python-requests.org/en/latest/user/quickstart/
  * The requests package can crawl (load) webpages and download (scrape) their contents
* **Beautiful Soup**: http://www.crummy.com/software/BeautifulSoup/bs4/doc/
  * The Beautiful Soup package can transform scraped web content into an object that can be parsed and analyzed

In [1]:
# regex's is short for Regular Expressions.
# Regex's are strings that you can use to perform almost any kind of
# pattern matching.
# The re package contains the python functions used for regex pattern
# matching.
import re

import requests

# Beautiful Soup version 4.x
import bs4

# ipython notebook-specific library to display images and other media inline
from IPython.display import display, Image

# ipython notebook-specific library to render HTML code
from IPython.display import HTML

### Making a request

In [2]:
response = requests.get("http://bootcamp-form.herokuapp.com/") # Returns a Reponse object

print(response) # Response status code
print(response.status_code)
print(response.url)

<Response [200]>
200
http://bootcamp-form.herokuapp.com/


### Most common HTTP status codes:
* 200, **OK**. Request was successful
* 303, **See Other**. Page redirected to another URL. Your web browser automatically fetches the new URL but web crawlers do not usually do this unless you specify it.
* 401 **Unauthorized**. The URL requires authentication (e.g. password) which was not provided or was incorrect.
* 404, **Not Found**. The URL does not exist
* 500 **Internal Server Error**. The server is having _unexpected_ problems and the web page is down.
* 503 **Service Unavailable**. The web page is down, likely for server maintenance.

More codes: http://en.wikipedia.org/wiki/List_of_HTTP_status_codes

In [3]:
# The raw html content, i.e., the web page's source code
# is accessible from the Response.text variable
print(response.text)

<html>
  <head>    <title>Big Data Initiative</title>
    <!-- Bootstrap Files -->
    <link href="bootstrap/css/bootstrap.min.css" rel="stylesheet" />
    
    <!-- Custom CSS Files -->
    <link href="bootstrap/css/ooga.css" rel="stylesheet" />
    
    <!-- Le Javascript -->
    <script type="text/javascript" src="bootstrap/js/bootstrap.min.js"></script>
    <script type="text/javascript" src="bootstrap/js/jquery.min.js"></script>
  </head>
  <body>
    <nav class="navbar navbar-default" role="navigation">
      <div class="container-fluid">
        <div class = "collapse navbar-collapse" id = "navbar-menu">
          <ul class="nav navbar-nav">
          	<li class = "active"><a class = "menu-button" href="index.php"> Main </a></li>
          	<li class = "active"><a class = "menu-button" href = "student_add.php">Register</a></li>
  		    <li class = "active"><a class = "menu-button" href="student_id.php">Check My Enrollment </a></li>
  		  </ul>
  		  <ul class="nav navbar-nav nav

---
## Detour: A (very brief) intro to HTML

HTML is a markup language for describing web documents. It stands for **H**yper **T**ext **M**arkup **L**anguage. HTML, together with CSS (**C**ascading **S**tyle **S**heets for _styling_ web documents) and Javascript (for _animating_ web documents), is the language that is used to construct web pages.

HTML documents are built using a series of HTML _tags_. Each tag describes a different type of content. Web pages are built by putting together different tags.

General HTML tag structure:

```html
<tagname tag_attribute1="attribute1value1 attribute1value2" tag_attribute2="attribute2value1">tag contents</tagname>
```
* Tags (usually) have both a start (or opening) tag, <tagname> and an end (or closing) tag, </tagname>
* Tags can also have attributes which are declared _inside_ the opening tag.
* The actual tag _content_ goes inbetween the opening and closing tags.

Tags can be contained (nested) inside other tags, which defines relationships between them:

```html
<parent>
  <brother></brother>
  <sister>
    <grandson></grandson>
  </sister>
</parent>
```

* `<parent>` is the _parent_ tag of `<brother>` and `<sister>`
* `<brother>` and `<sister>` are the _children_ or _direct descendant_ tags of `<parent>`
* `<brother>`, `<sister>`, and `<grandson>` are the _descendant_ tags of `<parent>`
* `<brother>` and `<sister>` are _sibling_ tags

Here's a very simple web document:

```html
<!DOCTYPE html>
<html>
  <head>
    <title>Page Title</title> 
  </head>

  <body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
  </body>
</html> 
```

When you access any URL, your browser (Chrome, Firefox, Safari, IE, etc.) is actually reading a document such this one and using the tags in the document to decide how to render the page for you.

In [4]:
first_html = """
<!DOCTYPE html>
<html>
  <head>
    <title>Page Title</title>
  </head>
  
  <body>
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
  </body>

</html> 
"""
HTML(first_html)

### Let's look at what the different tags mean:

```html
<!-- This is how you write a comment in HTML. Comments will not appear in the browser -->

<!-- This line simply identifies the document type to be HTML-->
<!DOCTYPE html>
<!-- Content between <html> and </html> tags define everything about the document-->
<html>
  <!-- Tags inside the <head> provide information about the document -->
  <head>
    <!-- Like the <title> tag which provides a title that appears in the browser's title and tab bars -->
    <title>Page Title</title>
  </head>
  
  <!-- Anything inside the <body> tags describes visible page content -->
  <body>
    <!-- The <h1> defines a header. The number defines the size of the header. -->
    <!-- There are 6 levels of headers: <h1> to <h6> -->
    <!-- The higher the number, the lower the font used to display it. -->
    <h1>My First Heading</h1>
    <!-- The <p> represents a paragraph.-->
    <p>My first paragraph.</p>
  </body>
</html>
```

### Other important HTML tags

**Diferent levels of headers**

```html
<h1>This is heading 1</h1>
<h2>This is heading 2</h2>
<h3>This is heading 3</h3>
<h4>This is heading 4</h4>
<h5>This is heading 5</h5>
<h6>This is heading 6</h6> 
```

**Links**
```html
<a href="http://www.website.com">Click to go to website.com</a>
```

**Images**
```html
<!-- Notice that the image tag has no closing tag and no content outside the opening tag -->
<img src="smiley.gif">
```

**Lists**
```html
<ul>
  <li>One Element</li>
  <li>Another Element</li>
</ul>

<ol>
  <li>First Ordered Element</li>
  <li>Second Ordered Element</li>
</ol>
```

**Tables**
```html
<table>
  <!-- An HTML table is defined as a series of rows (<tr>) -->
  <!-- The individual cell (<td>) contents are nested inside rows -->
  <tr>
    <!-- The <tr> tag defines column headers -->
    <th>First Header</th>
    <th>Second Header</th>
  </tr>
  <tr>
    <td>Row 2, Col 1</td>
    <td>Row 2, Col 2</td>
  </tr>
  <tr>
    <td>Row 3, Col 1</td>
    <td>Row 3, Col 2</td>
  </tr>
</table>
```

In [5]:
more_tags = """
<html>
<head>
  <title>More HTML Tags</title>
</head>
<body>
  <h1>This is heading 1</h1>
  <h2>This is heading 2</h2>
  <h3>This is heading 3</h3>
  <h4>This is heading 4</h4>
  <h5>This is heading 5</h5>
  <h6>This is heading 6</h6>

  <br>
  
  <a href="http://www.website.com">Click to go to website.com</a>

  <p><img src="smiley.gif"></p>

  <ul>
    <li>One Element</li>
    <li>Another Element</li>
  </ul>

  <ol>
    <li>First Ordered Element</li>
    <li>Second Ordered Element</li>
  </ol>

  <table>
    <!-- An HTML table is defined as a series of rows (<tr>) -->
    <!-- The individual cell (<td>) contents are nested inside rows -->
    <tr>
      <!-- The <tr> tag defines a column headers -->
      <th>First Header</th>
      <th>Second Header</th>
    </tr>
    <tr>
      <td>Row 2, Col 1</td>
      <td>Row 2, Col 2</td>
    </tr>
    <tr>
    <td>Row 3, Col 1</td>
    <td>Row 3, Col 2</td>
  </tr>
  </table>
</body>
</html>
"""

HTML(more_tags)

First Header,Second Header
"Row 2, Col 1","Row 2, Col 2"
"Row 3, Col 1","Row 3, Col 2"


For more about HTML: http://www.w3schools.com/html/html_intro.asp

---
### Back to web scrapping

In [6]:
response = requests.get("http://bootcamp-form.herokuapp.com/")

### Selecting the text from the Bootcamp's front page

In [8]:
# First we turn the document into a "soup"
soup = bs4.BeautifulSoup(response.text)

# The text from the front page is declared as headers of different levels
bootcamp_text = soup.find_all("h1")        # finds all <h1> tags in the document

# Remaining pieces of text
bootcamp_text.extend( soup.find_all("h3") )
bootcamp_text.extend( soup.find_all("h4") )

for header in bootcamp_text:
    print(header.text)

Big Data Initiative
Welcome to the registration page for the NU's Spring 2015 Programming Bootcamp 
The bootcamp is organized by  Luis Amaral , Professor of Chemical & Biological Engineering and co-Director of the Northwestern Institute on Complex Systems
The bootcamp is free of charge


In [9]:
# Beautiful Soup converts HTML tags into its own "Tag" objects
print(type(bootcamp_text[0]))

<class 'bs4.element.Tag'>


In [10]:
# These objects have several useful attributes
print(bootcamp_text[0].text)
print(bootcamp_text[0].name)

Big Data Initiative
h1


### Selecting the logos from the Bootcamp's front page

In [11]:
logos = soup.find_all("img")
print(logos)

[<img alt="Amaral Lab" class="amaral_logo" src="amaral-logo-white.png" style="height:30px"/>, <img alt="NICO" class="nico_logo" src="nico_logo.gif"/>]


In [12]:
# The html tag's attributes are also stored
print(logos[0].attrs)
print(logos[0]["class"])
print(logos[0]["src"])

{'src': 'amaral-logo-white.png', 'alt': 'Amaral Lab', 'class': ['amaral_logo'], 'style': 'height:30px'}
['amaral_logo']
amaral-logo-white.png


In [13]:
# You can also use Beautiful Soup to find one specific element
# You can also specify attributes to make the search more precise
nico_logo = soup.find("img", class_="nico_logo")
print(nico_logo)

# Or equivalently
# print(soup.find("img", alt="NICO"))
# print(soup.find("img", src="nico_logo.gif"))

# The "src" attribute represents a relative path of the image from the current URL
# To get the actual image we must prepend the web page's URL
display( Image(url=response.url + "/" + nico_logo["src"]) )

<img alt="NICO" class="nico_logo" src="nico_logo.gif"/>


### Selecting elements by their position in a web document (or page)

In [14]:
# let's look at the Bootcamp page again
print(soup.prettify())     # prettify() adds indentation to the HTML

<html>
 <head>
  <title>
   Big Data Initiative
  </title>
  <!-- Bootstrap Files -->
  <link href="bootstrap/css/bootstrap.min.css" rel="stylesheet"/>
  <!-- Custom CSS Files -->
  <link href="bootstrap/css/ooga.css" rel="stylesheet"/>
  <!-- Le Javascript -->
  <script src="bootstrap/js/bootstrap.min.js" type="text/javascript">
  </script>
  <script src="bootstrap/js/jquery.min.js" type="text/javascript">
  </script>
 </head>
 <body>
  <nav class="navbar navbar-default" role="navigation">
   <div class="container-fluid">
    <div class="collapse navbar-collapse" id="navbar-menu">
     <ul class="nav navbar-nav">
      <li class="active">
       <a class="menu-button" href="index.php">
        Main
       </a>
      </li>
      <li class="active">
       <a class="menu-button" href="student_add.php">
        Register
       </a>
      </li>
      <li class="active">
       <a class="menu-button" href="student_id.php">
        Check My Enrollment
       </a>
      </li>
     </ul>
    

Notice how the images are nested inside `<a>` tags which in turn are nested inside `<li>` tags, nested inside `<ul>`:

```html
<ul class="nav navbar-nav navbar-right">
  <li class="active">
    <a class="menu-button" href="http://amaral-lab.org/">
      <img alt="Amaral Lab" class="amaral_logo" src="amaral-logo-white.png" style="height:30px"/>
        Amaral Lab
    </a>
  </li>
  <li class="active">
    <a class="menu-button" href="http://www.nico.northwestern.edu/index.html">
      <img alt="NICO" class="nico_logo" src="nico_logo.gif"/>
    </a>
  </li>
</ul>
```

In [15]:
amaral_logo = soup.find("img", class_="amaral_logo")

# You can navigate from one Tag to any of its relatives
print(amaral_logo.parent.prettify())

print()

print(amaral_logo.parent.parent.parent.prettify())

<a class="menu-button" href="http://amaral-lab.org/">
 <img alt="Amaral Lab" class="amaral_logo" src="amaral-logo-white.png" style="height:30px"/>
 Amaral Lab
</a>

<ul class="nav navbar-nav navbar-right">
 <li class="active">
  <a class="menu-button" href="http://amaral-lab.org/">
   <img alt="Amaral Lab" class="amaral_logo" src="amaral-logo-white.png" style="height:30px"/>
   Amaral Lab
  </a>
 </li>
 <li class="active">
  <a class="menu-button" href="http://www.nico.northwestern.edu/index.html">
   <img alt="NICO" class="nico_logo" src="nico_logo.gif"/>
  </a>
 </li>
</ul>



In [16]:
# You can even navigate using tag names
amaral_logo.parent.parent.parent.li.next_sibling

'\n'

In [17]:
# You can even navigate using tag names
nico_li = amaral_logo.parent.parent.parent.li.next_sibling.next_sibling

print(nico_li.prettify())

<li class="active">
 <a class="menu-button" href="http://www.nico.northwestern.edu/index.html">
  <img alt="NICO" class="nico_logo" src="nico_logo.gif"/>
 </a>
</li>



Now we can finally extract the other logo

In [18]:
nico_logo = nico_li.a.img

display(Image(url=response.url + "/" + nico_logo["src"]))

## Scraping Exercises

## Soccer Premier League [scores](http://en.wikipedia.org/wiki/1992%E2%80%9393_FA_Premier_League#League_table)

Write a function `get_league_table` that returns the html `<table>...</table>` element with the final league scores for a given year. Then use the provided `html_table_to_df` to convert the table to a pandas DataFrame.

In [None]:
def html_table_to_list(html_table):
    """
    Takes an html <table>...</table> BeautifulSoup element
    and converts it to an equivalent python list.
    """
    table_rows = html_table.find_all("tr")
    
    # If the html table has headers then we store this 
    # fact to properly add the DataFrame labels
    has_headers = bool(table_rows[0].find_all("th"))
    
    table_list = []
    for row in table_rows:
        table_list.append(
            # Because table cells can have other tags inside them,
            # it is easier to get all the text inside the row
            # and manually remove any newline characters.
            # Note that the newlines are from the html code itself.
            
            # "\xa0" is a whitespace character in the Latin-1 encoding
            # which Beautiful Soup encodes incorrectly using utf-8
            row.text.replace("\xa0", " ").strip("\n").split("\n")
        )
    
    return table_list

In [None]:
def get_league_table(url):
    """
    Searchers `url` for the html table with the league results
    and returns it as a Beautiful Soup Tag object
    """
    # Get the text from the url

    # Turn it into a "soup"

    
    # Hint: Check the source code from wikipedia.
    # Does the league scores element have any
    # attributes we can use to find it?
    # What about the parent of the league scores element??
    
    
    

In [None]:
# Test your code here

# 1992 Premier league scores
root_url = "http://en.wikipedia.org/wiki/1992%E2%80%9393_FA_Premier_League"

html_table = get_league_table(root_url)
html_table_to_df(html_table)


## Scraping song lyrics from [AZLyrics](http://www.azlyrics.com)

Create a function `get_song_lyrics` that scrape the lyrics of a song from its page, given the song's url.
You can use the provided `get_artist_songs` to get a list of song lyrics and to get inspiration for how to write your scraper.

In [None]:
root_url = "http://www.azlyrics.com"

def get_artist_songs(artist_name):
    """
    Given an artist's name, crawls AZLyrics.com for that artist's songs.
    Returns a list of links to each song's lyrics' page.
    """
    # This line removes any strange characters (e.g. @#$%^&*, etc)
    # and white spaces from the artist's name and converts it to
    # lower case
    artist_name = re.sub("[\s\W]+", "", artist_name).lower()
    
    # artist page url is of the form:
    # http://www.azlyrics.com/[Artist Initial]/[Artist Name].html
    artist_url = "/" + artist_name[0] + "/" + artist_name + ".html"
    
    response = requests.get(root_url + artist_url)
    soup = bs4.BeautifulSoup(response.text)
    
    songs = []
    song_elements = soup.find("div", id="listAlbum")
    for song_link in song_elements.find_all("a", target="_blank"):
        songs.append(song_link.attrs.get("href")[3:])
    
    return(songs)


def get_song_lyrics(song_url):
    """
    Given a song's url, crawl's AZLyrics for that song's lyric
    and returns it as a tuple of strings:
    (song title, song lyrics)
    """
    pass

In [None]:
# Test your code here

# Pick an artist
artist_name = ""
artist_songs = get_artist_songs(artist_name)

#lyrics = get_song_lyrics(artist_songs[])

# print(artist_name + "\n")
# print(lyrics[0])
# print("="*len(lyrics[0]))
# print(lyrics[1])