# Looking Back at the Past

In order to move forward, we need to remember where we have been.

When we left off, we had a list of records based off of our sample data.

** As we said in the last lesson, all of this parsing is based off of assumptions.  Changes were made to the function below to account for some new conditions.**

In [3]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('./My Activity - Sample.html', 'r', encoding='iso8859-1').read(), 'html.parser')

In [52]:
import re

location_pattern = '\?q=([^\s]+?),([^\s]+)'

def parse_record(element):
    ### need to properly parse out the search target info.  first line, split on the visited or serached for, take the text from rest
    section1 = element.find("div", "content-cell")
    
    if section1.find(text=re.compile('Visited|Searched')):
        section1_strings = list(section1.strings)
        activity_type = None
        target = None
        datestamp = None
        
        if len(section1_strings)==3:
            activity_type, target, datestamp = section1_strings
        elif len(section1_strings)==2:
            activity_type, target = section1_strings[0].split('\xa0')
            datestamp = section1_strings[-1]
        else:
            print('---Unknown----')
            section1_strings
            return None
        activity_type = activity_type.replace('\xa0', '').strip()

        section2 = element.select("div.content-cell.mdl-typography--caption")[0]
        section_strings = list(section2.strings)
        product = section_strings[1].replace('\u2003','').strip()
        lat = None
        lon = None

        if 'Locations:' in section_strings:
            location_url = section2.find("a",{"href":re.compile("https://google.com/maps\?q=")}).text
            match = re.search(location_pattern, location_url)
            if match:
                lat, lon = match.groups()

        data = {
            'Activity Type': activity_type,
            'Target': target,
            'Datestamp': datestamp,
            'Product': product,
            'Lat':lat,
            'Lon':lon
        }

        return data
    
    else:
        return None

my_records = []

for record in soup.find_all('div', 'outer-cell'):
    new_record = parse_record(record)
    if new_record:
        my_records.append(new_record)
    
for record in my_records:
    print(record)

{'Target': 'https://productforums.google.com/forum/', 'Lon': None, 'Activity Type': 'Visited', 'Product': 'Search', 'Lat': None, 'Datestamp': 'Jan 31, 2018, 10:54:50 PM'}
{'Target': 'http://www.adobe.com/creativecloud.html', 'Lon': None, 'Activity Type': 'Visited', 'Product': 'Search', 'Lat': None, 'Datestamp': 'Feb 8, 2017, 12:32:39 AM'}
{'Target': 'adobe creative cloud', 'Lon': '-80.186310', 'Activity Type': 'Searched for', 'Product': 'Search', 'Lat': '25.800819', 'Datestamp': 'Feb 8, 2017, 12:32:36 AM'}


# Exercise 1

We ran this function over a sample dataset, but now we need to do it over the whole shabang!  Using everything you learned from the last lesson (and cheating with the example above), parse the records out of the full data file.  This is where you can use your own, or the dummy dataset we provided!

In [63]:
full_records = []

# YOUR WORK HERE


In [64]:
new_soup = BeautifulSoup(open('./MyActivity.html', 'r', encoding='iso8859-1').read(), 'html.parser')

for record in new_soup.find_all('div', 'outer-cell'):
    new_record = parse_record(record)
    if new_record:
        full_records.append(new_record)

In [65]:
print(len(full_records))
print(full_records[0:10])

14774
[{'Target': 'https://productforums.google.com/forum/', 'Lon': None, 'Activity Type': 'Visited', 'Product': 'Search', 'Lat': None, 'Datestamp': 'Jan 31, 2018, 10:54:50 PM'}, {'Target': 'download "my activity" google', 'Lon': '-80.186813', 'Activity Type': 'Searched for', 'Product': 'Search', 'Lat': '25.794518', 'Datestamp': 'Jan 31, 2018, 10:54:45 PM'}, {'Target': 'https://support.google.com/accounts/answer?dupm_fake_answer_id=162744&hl=en', 'Lon': None, 'Activity Type': 'Visited', 'Product': 'Search', 'Lat': None, 'Datestamp': 'Jan 31, 2018, 10:53:48 PM'}, {'Target': 'google download my activity', 'Lon': '-80.186813', 'Activity Type': 'Searched for', 'Product': 'Search', 'Lat': '25.794518', 'Datestamp': 'Jan 31, 2018, 10:53:40 PM'}, {'Target': 'https://myactivity.google.com/', 'Lon': None, 'Activity Type': 'Visited', 'Product': 'Search', 'Lat': None, 'Datestamp': 'Jan 31, 2018, 10:53:04 PM'}, {'Target': 'google view my browsing history data', 'Lon': '-80.186813', 'Activity Type':

Now we have a looot more records to deal with.  And they certainly aren't the easiest to read or navigate.  What we really want is a pretty, matrix structure, so that we can look at both the rows and columns.  A list of dictionaries is not quite that.  So what do we do?

In [66]:
full_records.count(None)

0

# Introducing Pandas

In [67]:
import pandas as pd

In [68]:
df = pd.DataFrame(full_records)
df.head()

Unnamed: 0,Activity Type,Datestamp,Lat,Lon,Product,Target
0,Visited,"Jan 31, 2018, 10:54:50 PM",,,Search,https://productforums.google.com/forum/
1,Searched for,"Jan 31, 2018, 10:54:45 PM",25.794518,-80.186813,Search,"download ""my activity"" google"
2,Visited,"Jan 31, 2018, 10:53:48 PM",,,Search,https://support.google.com/accounts/answer?dup...
3,Searched for,"Jan 31, 2018, 10:53:40 PM",25.794518,-80.186813,Search,google download my activity
4,Visited,"Jan 31, 2018, 10:53:04 PM",,,Search,https://myactivity.google.com/
