# Advance Data Parsing | XML / HTML / KML / etc.
---

Beautiful Soup is a very powerful HTML/XML parser library for Python. The Beautiful Soup website provides the best description of its use-case:  

"You didn't write that awful page. You're just trying to get some data out of it. Beautiful Soup is here to help. Since 2004, it's been saving programmers hours or days of work on quick-turnaround screen scraping projects."  

We will be using Beautiful Soup to extract the data we want out of various XML-like files that might have some data we don't need, or as a convenient way to work with the data that is in an other-wise unruly format.

For a deep understanding of Beautiful Soup, you will need both experience using it and to have read the documentation in its entirety. The full documentation can be found here:  

https://www.crummy.com/software/BeautifulSoup/bs4/doc/


## Scraping HTML
--- 

We will start by looking at how to explore an HTML page to extract an unordered list. 

Then we will look at a parsing a simple HTML table into a pandas data frame.  

Finally we will begin, and leave the finishing for an exercise, looking at an example of using Beautiful Soup to scrape data from an HTML table on a web page. We will have some code in this example to facilitate downloading the HTML page, and pulling some information out of the table on that page. 


### An Unordered List  
---

This example contains a very simple list, showing some of the initial basic syntax needed for Beautiful Soup.

Source: http://www.w3schools.com/html/html_lists.asp

In [None]:
from bs4 import BeautifulSoup

# Defining an HTML String
html = """
<ul style="list-style-type:circle">
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ul>"""

# Parse the file into Soup
soup = BeautifulSoup(html, "html")
# If you uncomment the below line, you will see that 
# Beautiful Soup will automatically try to fix
# some of the HTML for us. 

# print(soup)

drinks = []

for item in soup.findAll('li'):
    drinks.append(item.text)
    
print(drinks)

### A Simple Table
---
The next example involves pulling the data out of a fairly simple HTML table. For simplicity, we will "prepare" the data for a pandas data frame, but won't actually instantiate a data frame at this time.

Table Source: http://www.w3schools.com/html/html_tables.asp

<span style="background:yellow;">**NOTE**: As you read through this example please un-comment and re-comment various `print()` statements.</span>   
Keep re-running the cell as you do to study the stages of the data and objects.

In [None]:
from bs4 import BeautifulSoup
from pprint import pprint
import pandas as pd

# THE simulated HTML table data
html = """
<!DOCTYPE html>
<html>
<head>
<style>
table {
    font-family: arial, sans-serif;
    border-collapse: collapse;
    width: 100%;
}

td, th {
    border: 1px solid #dddddd;
    text-align: left;
    padding: 8px;
}

tr:nth-child(even) {
    background-color: #dddddd;
}
</style>
</head>
<body>

<table>
  <tr>
    <th>Company</th>
    <th>Contact</th>
    <th>Country</th>
  </tr>
  <tr>
    <td>Alfreds Futterkiste</td>
    <td>Maria Anders</td>
    <td>Germany</td>
  </tr>
  <tr>
    <td>Centro comercial Moctezuma</td>
    <td>Francisco Chang</td>
    <td>Mexico</td>
  </tr>
  <tr>
    <td>Ernst Handel</td>
    <td>Roland Mendel</td>
    <td>Austria</td>
  </tr>
  <tr>
    <td>Island Trading</td>
    <td>Helen Bennett</td>
    <td>UK</td>
  </tr>
  <tr>
    <td>Laughing Bacchus Winecellars</td>
    <td>Yoshi Tannamuri</td>
    <td>Canada</td>
  </tr>
  <tr>
    <td>Magazzini Alimentari Riuniti</td>
    <td>Giovanni Rovelli</td>
    <td>Italy</td>
  </tr>
</table>

</body>
</html>"""

soup = BeautifulSoup(html, "html")
# print(soup.table) # Just to show how easy it is to extract the table from the full page. 

table = soup.table
# Pull out the column names
# We know we want the first row, thus we can do the following:

col_names = []
for th_cell in table.tr.findAll('th'):
    # print(th_cell.string)
    col_names.append(th_cell.string)

# print(col_names) # To see the column names we extracted

# This is a dictionary comprehension.
# It creates a dictionary of key:value pairs
# The keys are the column names we extracted earlier
# 
data = {key: [] for key in col_names}

# print(data) # To see the dictionary we just made.

for row in table.findAll('tr')[1:]: #Pull all of the rows, and disregard the first one.
    data_cells = row.findAll('td')
    data['Company'].append(data_cells[0].string)
    data['Contact'].append(data_cells[1].string)
    data['Country'].append(data_cells[2].string)

# pprint(data) # Pretty print the dictionary.

pd.DataFrame(data) #Now we have a data frame to use!

### A Live Web Page with Real Data
---

For the last, and pretty intensive, challenge, you should complete the code below. 

What this code snippet does is download a webpage from a live website which contains a table we want to ultimately use in a data frame. The code necessary for downloading the HTML has been provided for you.  

Your challenge is to read through the provided code, play with it uncommenting, and changing it to explore Beautiful Soup and the data we are using. I have left my thoughts and various "development" statements in the code, so you can see what a "typical" development process is like for this sort of task. The ultimate goal is to take the HTML table and get the data into a pandas data frame so that we can easily manipulate it there.  

This is a relatively intensive and challenging task, so give yourself some time and don't get discouraged. 


For those new to inspecting web pages: [How to Inspect a Web  Page Source for Scraper Development](https://web.dsa.missouri.edu/static/PDF/AnalyzingHTMLwithTheWebInspector.pdf)


<span style="background:yellow;">**NOTE**: AGAIN! ... As you read through this example please un-comment and re-comment various `print()` statements.</span>   
Keep re-running the cell as you do to study the stages of the data and objects.


<span style="background:yellow;">**NOTE**: We created an easier version of the webpage to use for parsing.  To try the more difficult check the commented url code to switch.


In [None]:
import requests
from bs4 import BeautifulSoup
from pprint import pprint

# --------- BEGIN GET THE HTML CONTENT ------------
# ----The below url is a easier version of an actual webpage ---- #
url = 'https://indigo.sgn.missouri.edu/static/mirror_sites/www.basketball-reference.com/draft/basketballdata.html'
# ----The below url is an actual webpage but the table is more difficult to parse ---- #
#url = 'https://indigo.sgn.missouri.edu/static/mirror_sites/www.basketball-reference.com/draft/NBA_2015.html'
r = requests.get(url)

# check the status_code to make sure we retrieved the HTML successfully.
# print(r.status_code)
# print(r.content) # Uncomment this line to see the raw source.

soup = BeautifulSoup(r.content, "html")
# print(soup) # uncomment this line to see how BS4 cleans up the HTML for us. 
# --------- END GET THE HTML CONTENT ------------

# --------- BEGIN BS4 DATA EXPLORATION ------------

# We know from quick inspection of the webpage in-browser that 
# the table we want has an id of 'stats'

table = soup.find(id='stats')
# Since there is only one table on the page, 
# All of the below would are equivalent.
# soup.table == soup.find('table') == soup.find(id='stats')

# We use this to explore the table.
# The code retrieves the second row from the table.
header_row = table.find_all('tr')[1]
#print(header_row)

# After reviewing the above row, you'll notice that the "aria-label" attribute is
# a good candidate for DataFrame column names. So, let's pull those into an array

names = []

# With exploration, we discover that we want the first 16 aria-labels exactly how they are,
#but will need to process the data-tip attribute for the others

# NOTE: This is not the simplest way to do this. 
# I chose to use the limit parameter to show it's use,
# since this section is on Beautiful Soup.
# Adding all of the tips and using slicing might be clearer to some. 

for each in header_row.findAll('th', limit=20):
    # print(each.attrs['aria-label']) # to see the label to verify we grabbed what we wanted.
    names.append(each.attrs['aria-label'])

# print(names) # Again, just confirming we got what we wanted.

# We find all of the tags the have a tip attribute,
# And then we use a "slice" to only grab the ones we
# have not already processed.

for each in header_row.findAll('th')[20:]:
    # This is a good time to point out that data carpentry is usually
    # not pretty. If not for Beautiful Soup, the *entire* script would be 
    # like this.
    # print(each.attrs['data-tip'])
    names.append(each.attrs['data-tip'].split('><')[0][3:-3]) 

# print(names) # If we want to verify that the names were extracted properly


# --------- BEGIN PULLING DATA OUT OF TABLE INTO DICTIONARY ------------
# --- Finish the code below this line ---
pprint(names)

## Save Your Notebook