# Level 2 - Beautiful Soup

---

# The Mission

Your company `SpiderLegion` has just signed a contract with an Analytics Company called `DashItUp`.
`DashItUp` is well known for it's dashboarding capabilities specializing in monitoring website metrics such as views, content shares, new users, and database errors!

While dashboards are nice, `DashItUp` is now wanting to spend some time on a new `summarize` feature. 
`DashItUp` wants to run web crawlers against their dashboards to fetch the `key metrics` and print them off as a single report.

## Key Metrics

* User Count
* Any _system errors_, how recent?
    * System errors can be one of the following: `Database error`, `CPU overload`, or `Out of memory`
* Bounce Rate
* Top and bottom countries by utility
* Most recent user names with links to their profiles
* Name of the user that owns the dashboard

`DashItUp` has _many_ websites that use the same template (they all look the same). 
They believe that if you can write a web crawler for one, they should be able to apply the same code to the other dashboards they own to get similar results.

---

## Fetch The Website Contents

`DashItUp` was kind enough to give us a website to test against.
The website content can be found in the `assets` folder called `website.html`.
We already have some code that is responsible for opening that file, reading it, and saving the contents to a variable called `website_contents`.

(Source HTML code is from the Analytics Template from the website https://www.w3schools.com/w3css/w3css_templates.asp)

In [None]:
with open("../assets/website.html") as website_file:
    website_contents = website_file.read()
    
website_contents

What a jumbled mess!
It is nearly impossible to understand what is going on here without some hardcore `HTML` understanding..
Unless we visualize it!

In [None]:
# In jupyter, you can visualize raw HTML using these two functions!
# It is essentially "embedding" the website content within the notebook
from IPython.core.display import display, HTML

display(HTML(website_contents))

---

## Get to Work!

### Import the tools needed

In [None]:
from bs4 import BeautifulSoup
import pandas as pd

---

## Create the Soup!

In [None]:
# code here

soup = BeautifulSoup(website_contents, "html.parser")
soup.title

---

## User Count

In [None]:
# code here

# The answer we are looking for lives within the 
# "da-dashboardCards" section of the website, so target that first
dashboard_cards_soup = soup.find("div", attrs={"class": "da-dashboardCards"})

dashboard_cards_soup

In [None]:
# grab all of the direct children
dashboard_cards = dashboard_cards_soup.findAll(recursive=False)

dashboard_cards

In [None]:
# Users is the LAST child.. Since it's a list, we can target that!
user_card = dashboard_cards[-1]

user_card

In [None]:
user_card.find("h3", attrs={"class": "da-dashboardCardMetric"}).text

---

## Any _system errors_, how recent?
System errors can be one of the following: 

* `Database error`
* `CPU overload`
* `Out of memory`

In [None]:
# code here

# The content we want lives within the "da-feeds" section of the website,
# so we can target that first!
feeds = soup.find("div", attrs={"class": "da-feeds"})

feeds

In [None]:
# since the result lives in a table, we can use pandas to extract it
# and put it in a dataframe for us!
import pandas as pd


dataframes = pd.read_html(str(feeds))
dataframes

In [None]:
feed_dataframe = dataframes[0]
feed_dataframe

In [None]:
# It may be easier if we clean up this dataframe a little first
feed_dataframe.columns = ["icon", "message", "minutes"]
feed_dataframe = feed_dataframe.drop(columns=["icon"])

feed_dataframe

In [None]:
# now, all we need to do is filter it down!
error_messages = [
    "Database error.",
    "CPU overload.",
    "Out of memory."
]

error_dataframe = feed_dataframe[feed_dataframe["message"].isin(error_messages)]
error_dataframe

---

## Bounce Rate

In [None]:
# code here

# This one is a bit simpler since the element has an "id" attribute
# We can use that ("da-bounceRateStat") to target the element directly
soup.find(id="da-bounceRateStat").text

In [None]:
# If you want to get rid of the newlines..
soup.find(id="da-bounceRateStat").text \
    .replace("\n", "") \
    .replace("%", "")

---

## Top and bottom countries by utility

In [None]:
# code here

# This one will be similar to the feeds! Pandas to the rescue.
# First, let's target the HTML that includes our table
country_utility_soup = soup.find(attrs={"class": "da-countryUtility"})
country_utility_soup

In [None]:
country_utility_tables = pd.read_html(str(country_utility_soup))
country_utility_table = country_utility_tables[0]
country_utility_table

In [None]:
# Now, we want to grab the top and bottom country by utility
# Let's see what the datatypes are!
country_utility_table.dtypes

In [None]:
# Ok, they are "object"/"text" types..
# We want to handle that before sorting.
country_utility_table["Utility"] = country_utility_table["Utility"] \
    .str.replace("%", "") \
    .astype("float64")

country_utility_table.dtypes

In [None]:
# Now, we can sort and reset the index
country_utility_table = country_utility_table \
    .sort_values(by="Utility", ascending=False) \
    .reset_index(drop=True)

country_utility_table

In [None]:
# Grab the top..
country_utility_table.head(1)

In [None]:
# Grab the bottom
country_utility_table.tail(1)

---

## Most recent user names with links to their profiles

In [None]:
# code here

# This one is a little trickier since it is not structured as a table, 
# and we are also trying to grab 2 things!
# This sounds like a job for the trusty for-loop

# Start by narrowing down the HTML to the area of interest, "da-recentUsers"
recent_users = soup.find(attrs={"class": "da-recentUsers"})

print(recent_users.prettify())

In [None]:
# Luckily, we can see a pattern here! 
# Let's assess the structure
"""

<div class="... da-recentUsers">
  ...
  <ul ...>  --------------------------------- start of the loop
    <li>  ----------------------------------- element within loop
      <a href="PROFILE URL HERE!!"> --------- element url!!
        ...
        <span ...>PROFILE NAME HERE!</span> - element name!!
      </a>
    </li>
    <li>...</li> ---------------------------- next element
    <li>...</li> ---------------------------- next element
  </ul>
</div>

"""

# SO!! The <ul> element makes up the "base" of our structure.
# Every child element within (<li>) represents a different recent user.
# Therefore, if we loop over the elements of the <ul> element,  
# we should be able to extract the URL and name per user.
print()

In [None]:
# Let's start by just trying to loop over the <li> elements, 
# then we can build off of that.
for list_element in recent_users.find("ul").findAll("li"):
    print("\n", str(list_element.prettify()))

In [None]:
# Ok, now let's try to grab the href from the <a> tag
for list_element in recent_users.find("ul").findAll("li"):
    a_tag = list_element.find("a")
    print(a_tag["href"])

In [None]:
# Let's try to grab the user name from the <span> tag
for list_element in recent_users.find("ul").findAll("li"):
    span_tag = list_element.find("span")
    print(span_tag.text)

In [None]:
# Now, put them together
# Let's try to grab the user name from the <span> tag
for list_element in recent_users.find("ul").findAll("li"):
    a_tag = list_element.find("a")
    span_tag = list_element.find("span")
    
    print(span_tag.text, a_tag["href"])

In [None]:
# And finally, add them to a list to be used later!
recent_user_info = []

for list_element in recent_users.find("ul").findAll("li"):
    a_tag = list_element.find("a")
    span_tag = list_element.find("span")
    
    recent_user_info.append([span_tag.text, a_tag["href"]])
    
recent_user_info  

---

## Name of the user that owns the dashboard

In [None]:
# code here

# This is a trick question! Try opening the "website.html" in your browser
# and see the "responsiveness" of the website.
# When the website gets below a certain width, the "menu" gets hidden!
# We can't see it in the notebook..
# But that doesn't mean it is not there.

# The info that we need lives within the "da-welcomeMenu" element
welcome_menu = soup.find(attrs={"class": "da-welcomeMenu"})
welcome_menu

In [None]:
welcome_menu.find("strong").text