<DIV ALIGN=CENTER>

# Introduction to Social Media: Web
## Professor Robert J. Brunner
  
</DIV>  
-----
-----


## Introduction

With the explosive growth of the world wide web, a number of data sets have been _published_ by simply posting them online. In some cases, these data can be directly accessed via an API. However, in many cases, especially for small, hand-crafted data, they are presented in HTML-styled text. As a result, a data scientist is expected to be able to obtain and process web-accessible data. In this IPython Notebook, we explore web-accessible data by first programmatically acquiring web-accessible data before processing it to generate new results. 

-----

## Structured Text Parsing

To parse structured text, like an XML or an HTML document, we can use the Python [Beautiful Soup][bs] library. This library uses an XML/HTML parser to build a DOM tree, and Beautiful Soup then provides traversal methods to access and modify the DOM for a specific document. Beautiful Soup has been extremely popular for the ease with which it allows web scraping, for example, you can pull data out of an HTML table. But it is more powerful than this, as it allows you to easily parse and manipulate any XML document.

To use Beautiful Soup, we first need to import the library, and then create a BeautifulSoup object that provides access to the parsed data. Document elements, like `body` or `table` are directly accessed from the parsed tree; and element attributes or data can be easily extracted, deleted, or replaced. If required, new data can also be added to an existing document, allowing for the dynamic creation of a new document. These capabilities are demonstrated in the following code cells.

-----
[bs]: http://www.crummy.com/software/BeautifulSoup/

In [1]:
# Webpage we wish to explore
url = 'https://courses.illinois.edu/schedule/2016/spring/INFO'

In [2]:
from IPython.display import HTML
HTML('<iframe src={0} width=800 height=400>     </iframe>'.format(url))

In [3]:
# Grab the webpage as HTML

import requests
page = requests.get(url)

html = page.content

-----

## Student Activity

In the preceding cells, we used the `requests` library to programmatically obtain a webpage. In the rest of this Notebook, we will use this webpage, and linked webpages to build an application that processes web-accessible data. Now that you have run the Notebook, change using different URLs, either from the Course Explorer, or other website. Ensure you can retrieve the web data, and learn to explore other types of web-accessible data. 

We now turn to the BeautifulSoup parsing library to process the web-accessible data. We first extract part of the text content.

-----

In [4]:
# We use BeautofulSoup version 4
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'lxml')

# Now lets print out the start of the HTMl file
print(soup.prettify()[:80])

<!DOCTYPE html>
<html lang="en">
 <head>
  <title>
   Course Explorer
  </title>


-----

Most modern web browsers can provide a parsed view of a webpage, as shown below. You can use google to learn how to do this with your specific web browser.

![image](images/html-table.png)
  
As shown in the preceding image, the Webpage is structured with specific components that can are hierarchically arranged. The specific example shown in the image is an HTML table that contains the course information. In this case, we will want to programmatically find this table, and subsequently extract the table data. We can use BeautifulSoup to do all of this. First, we find all tables, and display the beginning of the first table (since there is only one in this case) to find identifying tags to more easily find this table (which we can do by using the table `id` attribute).

-----

In [5]:
tables = soup.find_all('table')

print('Document contains {0} HTML Table(s).'.format(len(tables)))

Document contains 1 HTML Table(s).


In [6]:
print(tables[0].prettify()[:185])

<table class="table table-striped table-bordered table-condensed" id="default-dt">
 <thead>
  <tr>
   <th>
    COURSE NUMBER
   </th>
   <th>
    COURSE TITLE
   </th>
  </tr>
 </thead>


----

We can use BeautifulSoup to find the table directly now by using the `id` attribute. This approach is recommended, since these HTML pages are almost assuredly generated by a program. As a result, the tables holding our data of interest will probably be constructed in the same manner, enabling our application to operate on any course webpage.

Once we have found the correct table, we can grab every row with data (via the `tr` element) before extracting the data (via the `td` elements). BeautifulSoup allows us to find every occurrence and then iterate through the results. Finally, we generate and display a text table containing the relevant information.

----

In [7]:
tb = soup.find(id='default-dt').tbody

print(tb.prettify()[:290])

<tbody>
 <tr>
  <td>
   INFO 102
  </td>
  <td>
   <a href="/schedule/2016/spring/INFO/102">
    Little Bits to Big Ideas
   </a>
  </td>
 </tr>
 <tr>
  <td>
   INFO 199
  </td>
  <td>
   <a href="/schedule/2016/spring/INFO/199">
    Undergraduate Open Seminar
   </a>
  </td>
 </tr>
 <tr>



In [8]:
print('{0:8s}: {1:s}'.format('Course', 'Description'))
print(40*'-')
for row in tb.find_all('tr'):
    tdx = [val for val in row.find_all('td')]
    course = tdx[0].contents[0].strip()
    name = tdx[1].a.contents[0].strip()
    print('{0:8s}: {1:s}'.format(course, name))

Course  : Description
----------------------------------------
INFO 102: Little Bits to Big Ideas
INFO 199: Undergraduate Open Seminar
INFO 202: Social Aspects Info Tech
INFO 303: Writing Across Media
INFO 325: Social Media and Global Change
INFO 326: New Media, Culture & Society
INFO 390: Special Topics
INFO 399: Individual Study
INFO 490: Special Topics
INFO 491: Ugrad Bioinformatics Seminar
INFO 500: Orientation Seminar
INFO 510: Research Practicum
INFO 590: Advanced Special Topics
INFO 591: Grad Bioinformatics Seminar
INFO 597: Individual Study
INFO 599: Thesis Research


-----

### Processing Web Hierarchies

Section information is listed on a separate page, for example, [INFO 102][is], which provides information about the location, instructor, and date/time for each section. A portion of this webpage is shown below:

![INFO 102 Sections](images/info102-sections.png)

-----

Looking at this webpage, one would think the data are once again in a table. Thus to grab the data, we can employ the same approach we used for scraping the courses. Note that over the next few code cells we use a single course, INFO 102, along with its webpage to explore how to best scrape this information.

-----
[is]: https://courses.illinois.edu//schedule/2016/spring/INFO/102

In [9]:
# URL for our course's webpage
course_url = 'https://courses.illinois.edu/schedule/2016/spring/INFO/102'

# Grab HTML from webpage
html = requests.get(course_url).content

#Parse HTML
new_soup = BeautifulSoup(html, 'lxml')

In [10]:
# Grab the tables
tables = new_soup.find_all('table')

print('Document contains {0} HTML Table(s).'.format(len(tables)))

Document contains 2 HTML Table(s).


In [11]:
# Only one table, so display its header and grab its body content
print(20*'-')
print(tables[0].thead)
print(20*'-')
print(tables[0].tbody)

--------------------
<thead>
<tr>
<th>Course</th>
<th>Section</th>
<th>CRN</th>
<th>Date</th>
<th>Day</th>
<th>Start Time</th>
<th>End Time</th>
<th>Room</th>
<th>Exam Type</th>
</tr>
</thead>
--------------------
None


-----

The body of the only table in this webpage is empty, thus how do we grab the data that is displayed? In this case, a more careful perusal of the webpage source (either via a browser tool or by displaying the entire parsed HTML), demonstrates that the data is encoded as a JSON document in a JavaScript variable. The JavaScript is run when the page is generated, to complete the table. For example, the following screenshot demonstrates this JavaScript variable:

![JavaScript JSON variable](images/info102-javascript.png)

-----

To obtain these data, we will first need to parse the webpage to grab the `script` element. Next, we will parse the value of the `sectionDataObj` JavaScript variable to extract out the JSON data. Finally, we will pull the relevant information from the JSON data. 

-----

In [12]:
script_tag = new_soup.find(type='text/javascript')

print(script_tag.prettify()[:1076])

<script type="text/javascript">
 var sectionDataObj = [{"status":"<span class=\"hide\">5<\/span><span class=\"sr-only\">section unknown<\/span><img src=\"\/static\/images\/sectionUnknown.png\" title=\"Unknown\" alt=\"Unknown\"\/>","crn":"63226","type":"<div class=\"app-meeting\">Laboratory<\/div>","section":"<div class=\"app-meeting\">AB1<\/div>","time":"<span class=\"hide\">1600<\/span><div class=\"app-meeting\">04:00PM - 05:50PM<\/div>","day":"<div class=\"app-meeting\">T      <\/div>","location":"<div class=\"app-meeting\">G7 Foreign Languages Building<\/div>","instructor":"<div class=\"app-meeting\">Padua, D<br \/><\/div>","availability":"UNKNOWN","credit":null,"sectionTitle":null,"sectionDescription":null,"courseDescription":null,"sectionDegreeNotes":null,"courseDegreeNotes":"Quant Reasoning I course.","specialApproval":"","approvalCode":null,"sectionFee":"","sectionDateRange":null,"courseDateRange":null,"partOfTerm":"1","info":"Registration will be restricted to officially declar

-----

Now the we have the script element, we can extract the relevant data from the contents of the `script` element. To do this, we first grab the element's contents, and use a regular expression to find the JSON data. Since the JSON data is enclosed between `{` and `}`, we build a regular expression that finds a group of characters `( ... )` between these curly braces `(\{ ... \})`. The simplest approach is to indicate we will match all characters except the closing curly brace by using `[^}]`. We have to escape the opening and ending curly braces, since they would be otherwise parsed differently by the regular expression engine. In the end, our final regular expression is `r'(\{[^}]+\})'`.

-----

In [13]:
script_txt = script_tag.contents[0]

import re

pattern = re.compile(r'(\{[^}]+\})')

match = re.search(pattern, script_txt)

if match:
    print(match.group(0))

{"status":"<span class=\"hide\">5<\/span><span class=\"sr-only\">section unknown<\/span><img src=\"\/static\/images\/sectionUnknown.png\" title=\"Unknown\" alt=\"Unknown\"\/>","crn":"63226","type":"<div class=\"app-meeting\">Laboratory<\/div>","section":"<div class=\"app-meeting\">AB1<\/div>","time":"<span class=\"hide\">1600<\/span><div class=\"app-meeting\">04:00PM - 05:50PM<\/div>","day":"<div class=\"app-meeting\">T      <\/div>","location":"<div class=\"app-meeting\">G7 Foreign Languages Building<\/div>","instructor":"<div class=\"app-meeting\">Padua, D<br \/><\/div>","availability":"UNKNOWN","credit":null,"sectionTitle":null,"sectionDescription":null,"courseDescription":null,"sectionDegreeNotes":null,"courseDegreeNotes":"Quant Reasoning I course.","specialApproval":"","approvalCode":null,"sectionFee":"","sectionDateRange":null,"courseDateRange":null,"partOfTerm":"1","info":"Registration will be restricted to officially declared Informatics minors until Nov. 16th at noon.","corequ

-----

The data extracted by the regular expression group is a string that hold the JSON data. To extract out the relevant information, we need to convert this string into a JSON document. After which we can access each element by using a dictionary key. For example, the following code cells create a JSON document, prints out the JSON keys, and extracts the `instructor` value.

-----

In [14]:
import json

json_txt = json.loads(match.group(0))

print(json_txt.keys())

dict_keys(['instructor', 'status', 'sectionDateRange', 'sectionFee', 'crn', 'availability', 'courseDegreeNotes', 'time', 'info', 'day', 'sectionDescription', 'courseDescription', 'type', 'location', 'section', 'courseDateRange', 'credit', 'restricted', 'corequest', 'approvalCode', 'specialApproval', 'sectionDegreeNotes', 'partOfTerm', 'sectionTitle'])


In [15]:
print(json_txt['instructor'])

<div class="app-meeting">Padua, D<br /></div>


-----

Unfortunately, we are not yet done, since our JSON document contains HTML markup. We could use BeautifulSoup to parse the HTML `div` element, but in this case, it will be easier to simply use an XML parser directly to extract the text contents of the HTML element. We can use the `lxml` library to create an HTML document, and apply an [_XPATH_][wx] processing directive to pull the text from the element. In this case, we tell the XML parser to find a `div` element by using the `//div/` pattern. We can extract the text contents from this element by using the text method `text()`. Combined, we have our XPATH directive: `xpath('//div/text()')`. We demonstrate this in the following code cell, where we extract the text contents and display the result.

-----
[wx]: https://en.wikipedia.org/wiki/XPath

In [16]:
from lxml import html

content = html.fromstring(json_txt['instructor']).xpath('//div/text()')
print('Course Instructor: {0}'.format(content[0]))

Course Instructor: Padua, D


In [17]:
import json
from lxml import html

for match in re.finditer(pattern, script_txt):
    data = match.group(0)
    json_txt = json.loads(data)
    
    print(html.fromstring(json_txt['section']).xpath('//div/text()'))
    print(html.fromstring(json_txt['day']).xpath('//div/text()'))
    print(html.fromstring(json_txt['time']).xpath('//div/text()'))  
    print(html.fromstring(json_txt['instructor']).xpath('//div/text()'))
    print(html.fromstring(json_txt['location']).xpath('//div/text()'))
    print()

['AB1']
['T      ']
['04:00PM - 05:50PM']
['Padua, D']
['G7 Foreign Languages Building']

['AB2']
['W      ']
['04:00PM - 05:50PM']
['Padua, D']
['G7 Foreign Languages Building']

['AB3']
['R      ']
['04:00PM - 05:50PM']
['Padua, D']
['G7 Foreign Languages Building']

['AL1']
['MWF    ']
['09:00AM - 09:50AM']
['Padua, D']
['0216 Siebel Center for Comp Sci']



-----

## Web Processing Application

We can now put everything together to build a complete web parsing application. In this case, given a root page (course listings for a given department), we can traverse the course webpage hierarchy to build a list of courses, and their associated metadata such as location, days of the week, meeting times, and instructors. This same approach can be used to _scrape_ and parse data from other websites. Before doing so, however, you should always check:

1. the websites _terms of use_, which may limit what you can do with any data obtained from the website, and
2. the availability of an API for accessing data from the website, for example twitter.

-----

In [18]:
import re
import json
from lxml import html

# Extract div element text
def parse_xml(jt, key_str):
    
    html_txt  = html.fromstring(jt[key_str])
    value = html_txt.xpath('//div/text()')
    
    if value:
        return value[0]
    else: 
        return 'N/A'

# Parse course webpage
def get_section_info(url):
    html = requests.get(url).content
    new_soup = BeautifulSoup(html, 'lxml')
    script_tag = new_soup.find(type='text/javascript')
    script_txt = script_tag.contents[0]
    
    for match in re.finditer(pattern, script_txt):
        data = match.group(0)
        json_txt = json.loads(data)
        
        sct = parse_xml(json_txt, 'section')
        day = parse_xml(json_txt, 'day')
        tme = parse_xml(json_txt, 'time')
        ins = parse_xml(json_txt, 'instructor')
        loc = parse_xml(json_txt, 'location')
        
        print('  {0:4s} {1:6s} {2:17s} {3:20s} {4}'.format(sct, day, tme, ins, loc))

# Main parsing loop        
print('{0:8s}: {1:s}'.format('Course', 'Description'))
for row in tb.find_all('tr'):
    tdx = [val for val in row.find_all('td')]
    course = tdx[0].contents[0].strip()
    c_url = 'https://courses.illinois.edu/{0}'.format(tdx[1].a['href'])
    name = tdx[1].a.contents[0].strip()
    print(85*'-')
    print('{0:8s}: {1:s}'.format(course, name))
    get_section_info(c_url)

Course  : Description
-------------------------------------------------------------------------------------
INFO 102: Little Bits to Big Ideas
  AB1  T       04:00PM - 05:50PM Padua, D             G7 Foreign Languages Building
  AB2  W       04:00PM - 05:50PM Padua, D             G7 Foreign Languages Building
  AB3  R       04:00PM - 05:50PM Padua, D             G7 Foreign Languages Building
  AL1  MWF     09:00AM - 09:50AM Padua, D             0216 Siebel Center for Comp Sci
-------------------------------------------------------------------------------------
INFO 199: Undergraduate Open Seminar
  N/A  n.a    ARRANGED          N/A                  n.a
-------------------------------------------------------------------------------------
INFO 202: Social Aspects Info Tech
  AD1  R       02:00PM - 02:50PM Liu, C               127 English Building
  AD2  R       03:00PM - 03:50PM Takazawa, A          1030 Foreign Languages Building
  AD3  R       04:00PM - 04:50PM Takazawa, A          102

-----

## Student Activity

In the preceding cells, we parsed a course web page to build a list of courses and their relevant metadata. Now that you have run the Notebook, go back and make the following changes to see how the results change.

1. Change the department from `INFO` to a different value, such as `ASTR` or `CS`. Does the application still work correctly?
2. Try generating an HTML-styled version of the output. Can you display the result in this Notebook?
3. Can you modify the application to include links to the course description in the course explorer?
4. Modify the application to build a custom course schedule for a given set of courses

-----