<div style="background-color:dodgerblue;height:100px;"><br>
<center><h1><font style="color:white">The Path to Self-Discovery: Life Is Messy</font></h1></center>
</div>

# Finding Structure in Semi-Structured Data

In [None]:
from IPython.display import HTML

The search result data we have been given comes in the HTML format.  This format does have structure, but it is not as pretty and easy to use as many.  So, let's start by working with a sample set to understand this structure.

In [None]:
HTML(filename='My Activity - Sample.html')

Now this looks nice and organized as an HTML page, but it looks a little weirder if we just read the text straight up.

In [None]:
open('./MyActivity - Copy.html', encoding='iso8859-1').read()

## Exercise 1
### Welcome to your first exercise!  Remember, if you get stuck, use those around you.  And Google is your friend!

Find the HTML tag we can use to separate out our records.  Hint: this is easier to do if you just open up the browser and <i>inspect</i> what you're looking at.

<a href='./MyActivity - Copy.html'>Click here to view page</a>

# Introducing BeautifulSoup

<a href="https://www.crummy.com/software/BeautifulSoup/"><i>From the Beautiful Soup website.</i></a>

Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

1. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn't take much code to write an application
2. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't detect one. Then you just have to specify the original encoding.
3. Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.

BeautifulSoup starts with a very simple step: Let's make some soup!

In [None]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('./My Activity - Sample.html', 'r', encoding='iso8859-1').read(), 'html.parser')

BeautifulSoup allows us to grab elements of an HTML document in a variety of ways.

We can grab items by their tag:

In [None]:
soup.title

In [None]:
soup.title.name

In [None]:
soup.title.string

In [None]:
soup.title.parent.name

In [None]:
soup.p

In [None]:
soup.p['class']

In [None]:
soup.a

In [None]:
soup.find('p','mdl-typography--title')

In [None]:
soup.find_all('a')

## Exercise 2
Use BeautifulSoup functions to do the following:
1.  Count the number of divs in the html document.
2.  Find the first link element in the document (`<a>`).
3.  Find the actual link in that link tag.

# Pick up the Parsing!
Another neat option BeautifulSoup gives you is a method to spiffy up the appearance of your HTML so that it is easier to read:

<i> Now let's prettify!</i>  See how the prettify command makes your HTML much easier to read.

In [None]:
print(soup.body.prettify())

Now let's recall from our first exercise that it is a `<div>` tag with the class `"outer-cell"` that will help us find each of our records.  Then we will have to parse them.  Let's do this with the first instance of this element.

In [None]:
record1 = soup.body.find("div", "outer-cell")
print(record1.prettify())

Now let's take a prettier look at just this piece:

In [None]:
HTML(data=str(record1))

There are several pieces of information we want to extract from this data: the activity (Visited), the url (https://productforums...), the date (Jan 31...) and the Products used (Search).  Let's see how we can extract each of these.

Our first few pieces of data are contained in a `<div>` element with the class "content-cell".

In [None]:
section1 = record1.find("div", "content-cell")
print(section1.prettify())

We see that each of our pieces of data are located in the text portion between tags.  Thankfully, BeautifulSoup has an easy way to pull these out!:

`.strings`

In [None]:
print(list(section1.strings))

Thanks to Google (Search.. it's everywhere!), I know that that weird `\xa0` actually represents a space in the latin-1 encoding, so I can just remove it.

In [None]:
activity_type, target, datestamp = list(section1.strings)
activity_type = activity_type.replace('\xa0', '')
print('Activity Type:', activity_type)
print('Target:', target)
print('Datestamp:', datestamp)

The next set of data we want to grab is our product information.  Looking at our html content, we can see that it too is in a `"content-cell"`, but also one of class `"mdl-typography--caption"`

In [None]:
record1.find("div", "content-cell", "mdl-typography--caption")

Shoot!  That gives me my first record again.  What we find here is that when I express the two classes this way, they become an 'or' option, rather than an 'and'.

A way we can achieve that 'and' criteria is using the css selector.

In [None]:
record1.select("div.content-cell.mdl-typography--caption")

In [None]:
section2 = record1.select("div.content-cell.mdl-typography--caption")[0]
print(list(section2.strings))

In [None]:
product = list(section2.strings)[1].replace('\u2003','')

Now let's look at it all together!

In [None]:
HTML(data=str(record1))

In [None]:
print('Activity Type:', activity_type)
print('Target:', target)
print('Datestamp:', datestamp)
print('Product:', product)

We did it!  We were able to extract each piece of data from the HTML.  Now let's put this into a function so that we can apply it more easily.

In [None]:
def parse_record(element):
    section1 = element.find("div", "content-cell")
    activity_type, target, datestamp = list(section1.strings)
    activity_type = activity_type.replace('\xa0', '').strip()
    target = target.strip()
    datestamp = datestamp.strip()
    
    section2 = element.select("div.content-cell.mdl-typography--caption")[0]
    product = list(section2.strings)[1].replace('\u2003','').strip()
    
    data = {
        'Activity Type': activity_type,
        'Target': target,
        'Datestamp': datestamp,
        'Product': product
    }
    
    return data

parse_record(record1)

## Exercise 3a
Use BeautifulSoup to find all of the records.  

For each record, use the `parse_record` function to add the resulting dictionary to a list of dictionaries, `my_records`.

If your solution is correct, the final two `print` statements in the cell below should say `True`.

In [None]:
from operator import itemgetter

my_records = []

#DO YOUR WORK HERE
    
print('---Tests---')
print(len(my_records)==3)
print(sorted(my_records, key=itemgetter('Datestamp'))[0]['Datestamp']=='Feb 8, 2017, 12:32:36 AM')

## <u> Challenge Exercise</u>:  Exercise 3b
Looking at our records, you will see that one of the records has location data in it, which we have not extracted in our function.  Examine the record and modify the `parse_record` function so that it will extract the location data as well.  Below is just that HTML for you to examine.

Hint: If you know how to use the `re` library in Python, I'd try employing the power of `re.search` using the pattern: `'\?q=([^\s]+?),([^\s]+)'`

In [None]:
location_cell = soup.body.find_all("div", "outer-cell")[-1]
HTML(data=str(location_record))

In [None]:
print(location_cell.prettify())

In [None]:

# DO YOUR WORK HERE


# Section Conclusion

In this lesson, you learned how to:
* Load an HTML file into BeautifulSoup
* Use BeautifulSoup to search an HTML document
* Use BeautifulSoup to extract data from an HTML document

### It is important to remember that our parsing of this data is based on an assumption that there is a repeated structure to each record.  But, as you saw before, we missed a data element when we only looked at the first record.  Assumptions are necessary to data analysis, but always be ready for your assumptions to be broken.