# Lab and Homework 5
## The Web as a Platform I: HTML and Web Scraping

**In class**: *Thursday, March 5, 2020*  
**Homework Due** : *5 PM, Thursday, March 12, 2020*

# Learning Goals

To date we've covered the backend design of a distributed architecture: databases, RPC, REST APIs, etc. This week and next we switch to the frontend. Specifically, we will learn about the a suite of technologies such as HTML, the DOM and Javascript that make the web browser a ubiquitous platform. 

The focus this week is on HTML and tools for parsing and displaying web content. We'll discuss browser architecture and the DOM in class but the emphasis in this lab is on scraping web content to extract and process data.

The goals are:

* Learn about HTML and its syntactic structure.
* Learn about the utility, tools and techniques of data scraping.
* Learn how HTML files are parsed into a form that is amenable to easy analysis and extraction.
* Brush up on pandas Dataframes, particularly indexing and aggregation.

This notebook consists of two parts. 

The first part introduces basic concepts. We'll explore these concepts during class. The second part is for homework, which builds upon and extends the things you learned in class.

# Web Scraping

Acquiring data is the first step to doing anything useful in Data Science.

Unfortunately, the required data often isn't readily available in an easy-to-read zipfile or database, ready to be exploited. You generally have to find it, get it and shape it to your needs.

Fortunately, the web is a rich source of information and we will use we scraping to the get data we need. 

# Problem Statement

Let's say you're doing research on corporate governance in the Fortune 100 and want to determine compensation rates of executives and their potential conflicts. 

How might you go about getting the data for the companies in the list? Perhaps if you have a subscription to Bloomberg, then you could download a CVS. But if you don't, your best bet is scrape public pages on the web.

In this lab and homework we'll learn how to scrape Yahoo Finance pages to create our own data set ready for analysis. We'll do the scraping in class and the analysis for homework.

# Scraping Summarized

In principle, web scraping is simple and involves the following steps:

1. **Inspect** the web page. This will give you a sense the overall structure of the page and where the relevant information resides. To do this, you will open the web page in a browser and then view the page *source*. We explain how below.

1. **Retrieve** a web page over HTTP as text. This text will be formatted as HTML, the language of the World Wide Web. Your browser interprets the content of an HTML file and renders it to screen. (This is a gross over-simplification. Many sites today are *dynamic* apps written in Javascript. They retrieve data over an API and render it programmatically on screen. Still, a lot of information resides in *static* HTML and the techniques described here remain quite useful.)

1. **Parse** the retrieved HTML text into a form that can easily scanned and operated upon. Fortunately, we don't have to do this ourselves. Great and powerful Python libraries such as [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) and [Scrapy](https://scrapy.org) already exist. In this unit, we'll use Beautiful Soup. But I encourage you to take a look at Scrapy. It's in many ways more powerful and flexible.

1. **Search** the parsed HTML for the information we're interested in.

1. **Pull** the relevant data out of the HTML, reformat it into a form that meets our needs, and save it for later analysis. In this lab, we'll insert our data into a [Pandas](https://pandas.pydata.org) Dataframe. In future units, we'll likely save to a database.



# Part I: In Class Lab

For the in-class part of this unit, we are going to do the following:

1. Retrieve the stock symbols for the companies in the S&P 100. To do this, we'll scrape a Wikipedia page.
2. With the S&P symbols in hand, we'll retrieve the board of directors by scraping a page for each company on the Reuters Financial site. You will use the skills developed in scraping the Wikipedia page to get the boards.
3.You will next use your Pandas chops to compute the average age of each company board. Easy Peasy!

## Imports

These are the packages we'll use in this lab. Please use the [Installation]() notebook to set up all the modules we'll be using this semester. We'll add new packages in the Installation notebook as they are needed.

In [None]:
#Packages to install

# pretty printer
import pprint

# set up the pretty printer
pp = pprint.PrettyPrinter(indent=4)

# BeautifulSoup for scraping
from bs4 import BeautifulSoup

# for making HTTP requests
import requests

# Pandas/numpy for data manipulation
import pandas as pd
import numpy as np

## Retrieve the S&P 100 Stock Symbols

In order to investigate the board of directors at each company in the S&P 100, we're going to need their stock symbols. Here we'll scrape the [Wikipedia page](https://en.wikipedia.org/wiki/S%26P_100) for the S&P 100.

You should load the [Wiki](https://en.wikipedia.org/wiki/S%26P_100) page in a browser and study it's HTML source. (One easy way to do this is to use the developer tools built into your browser. See [here](https://developers.google.com/web/ilt/pwa/tools-for-pwa-developers) for instructions on opening the developer console on your particular browser.)

The listing of S&P companies is in a `<table>` element. We'll use BeautifulSoup to parse the HTML into a parse tree, a hierarchical representation of the page, which makes it much easier to scan for the elements we want. BeautifulSoup will scan the parse tree, find the table in question, and scan each row for the company and symbol. We'll put the extracted data in a Pandas DataFrame for later analysis and manipulation.

In [None]:
# The URL for the Wikipedia page we're scraping
WIKI_URL = 'https://en.wikipedia.org/wiki/S%26P_100'

# Retrieve the page
wiki_page = requests.get(WIKI_URL).text



We now have the wiki page in the variable `wiki_page`. We should print it directly the output will be unstructured and therefore quite hard to read.

Let's use Beautiful Soup to read the text into a parse tree and then render the parse tree to screen like this:

In [None]:
# parse the HTML text into a tree
soup = BeautifulSoup(wiki_page, 'html.parser')

# print the tree to screen
print(soup.prettify())

Whoa!!! There's a lot going on here. 

But if as you poke around, you find that the ticker symbols are in an HTML `<table>`:

```erb
<table class="wikitable sortable">
       <tbody>
        <tr>
         <th>
          Symbol
         </th>
         <th>
          Name
         </th>
        </tr>
        <tr>
         <td>
          AAPL
         </td>
         <td>
          <a href="/wiki/Apple_Inc." title="Apple Inc.">
           Apple Inc.
          </a>
         </td>
        </tr>
```

The body of the table has two columns. The symbol is in the left column and the company name on the right. 

Let's use BeautifulSoup to extract the table from the parse tree. 

We're looking for the table with CSS classes `wikitable` and `sortable`.

In [None]:
# extract the table containin the S&P companies
sandp_table = soup.find('table', {"class" : "wikitable sortable"})



With our table of S&P companies in hand, we can traverse it row by row to retrieve each company and its symbol. Here's how:

In [None]:
# snps array will hold an array of tuples of the form (Symbol, Name)
snps = []

# scan the table for each row ('tr' is the HTML tag for a table row)
for row in sandp_table.find_all('tr'):
    
    # scan the row for table cells ('td' is the tag for table data)
    cols = row.find_all('td')
    
    if len(cols) == 2: # skip the header row
        snps.append((cols[0].text.strip(), cols[1].text.strip()))

# convert the array of tuples into a Pandas DataFrame        
snps_df = pd.DataFrame(snps, columns=['Symbol', 'Name'])

snps_df


## Lab Problem 1

We now have the S&P 100 stock symbols. But we're far from done. What we want is information about the Executives of each company, which we'll do in this Lab.

To do this, iterate over the snps_df DataFrame created above and scrape the Yahoo Financials page for company boards.

As one example, here's the Yahoo Finance page for [Apple](https://finance.yahoo.com/quote/AAPL/profile?p=AAPL) (AAPL). Open the page in a browser and study the HTML source to find where the Key Executive information is located. Then flesh out the skeleton code below.

Organize each row of the table into the following columns:

1. **Symbol**: the ticker symbol of the company.
1. **Name**: the name of the executive.
1. **Title**: Title and role
1. **Pay**: Compensation, usually in Millions
1. **Age**: Executive age encoded as an `integer`. (We want to run aggregate functions on the age). Yahoo gives the data of birth; you will have to convert to an age.

Create a DataFrame called `df` with the above column names.

In [None]:

BASE_URL = 'https://finance.yahoo.com/quote/DIS/profile?p=AAPL'

symbol_array = snps_df['Symbol'].values

# board_members will hold an array of tuples, one for each board member
execs = []

# for simplicity only look at the first five companies in class
for sym in symbol_array[:5]:
# for (index, co) in snps_df.iterrows():
#    sym = co['Symbol']
    
# df = pd.DataFrame(execs, ...)

# return and print df



## Lab Problem 2

Notice that the `Symbol` column in the `df` DataFrame above has multiple rows for each company--one row for each board member.

We want to retieve and aggregate over individual companies. One way to do this is to create a Pandas *multiindex* on `Symbol` and `Name`. Create an `inplace` index, i.e., one that modifies the `df` DataFrame instead of returning a new one.

In [None]:
# Lab Problem 2
# Create a multiindex as described above

df

## Lab Problem 3

Use the indexed `df` to retrieve the entries for `GOOG`.

Hint: use `df.loc`

In [None]:
# Lab problem 3

# your code here

## Lab Problem 4

Compute the mean age of each executive team in the S&P 100. Which company has the oldest board? Which the youngest?

In [None]:
# Lab problem 4
# compute and print the mean age of each board
# your code here

In [None]:
# Find the boards with the maximum and minimum average ages

# Homework

*40 points total*

In the homework you will build upon the lab work to retrieve data about each executive and then perform some basic aggregations.

## Homework Problem 1

*20 points*

Retrieve the total compensation of each executive and put the results in a DataFrame of the following columns:

1. **Symbol**: The company stock symbol.
1. **Name**: Executive name.
1. **Total**: Total yearly compensation for the member.

To do this problem you'll use the `link`attribute in the `df` DataFrame from the lab.

You should decide how to index the DataFrame to best utilize it for subsequent problems.

In [None]:
## Homework problem 1
compensation_table = None;

# Your code here

compensation_table

## Homework Problem 2

*20 points*

Compute the mean compensation for each company and put the results in a DataFrame with the following columns:

1. **Symbol**: The company stock symbol
1. **Compensation**: Mean executive compensation
1. **Age**: Mean executive age

Notice that you're asked to include the mean age. This suggests that you will `join` two tables together. How will you index these two tables in order to compute the result elegantly and simply?

In [None]:
# Homework problem 2
company_compansation_table = None

# Your code here

company_compensation_table