# Session 6:  Web Scraping

## GET Requests

In [None]:
import pandas as pd
import numpy as np
import unicodedata

import matplotlib.pyplot as plt
import matplotlib.ticker
%matplotlib inline

import requests
from bs4 import BeautifulSoup

from IPython.core.display import display, HTML

As you have likely noticed when browsing the web, often when you search on a website, the page you are directed to will have your search term in the URL.  For example, if we search on Google for 'python', the URL we will be sent to is `https://www.google.com/search?q=python`.  Here we see that there is the name of the html file we are looking at: `https://www.google.com/search`, followed by the data we sent to Google `?q=python`.  Following the `?`, there will be any number of variable values that are passed as `VARIABLE=VALUE`.  This is known as a GET request.

### Example

In the following example, our goal is to scrape EDGAR 13F filings.  Given a company CIK, we wish to retreive all of their available disclosures.  To begin, we will start by just pulling a the most recent disclosure for a single company.

We begin by stritching together the URL as a string.  We manually insert the values into the GET request ourselves.

In [None]:
symbol = '0000102909'

base_url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=' + symbol +'&type=13F-HR&output=xml'

base_url

We then use the `requests` package to fetch the page for us.

In [None]:
r = requests.get(base_url)

We can check if the request was successful by looking at the status code.  `200` indicates success, whereas `404` would indicate that the page was not found.

In [None]:
r.status_code

The contents of the html file is just returned as one giant, messy string.

In [None]:
print(r.text)

`BeautifulSoup` is designed to take this mess, and turn it into a format that is easy to navigate.  Here, we download an xml file, but this package will also work with html or json files.

All of these file types have a nested structure.  We then can drill down into the file by treating the tags as attributes.  For example, `soup.companyinfo` will return everything between `<companyinfo>` and `</companyinfo>`.

In [None]:
soup = BeautifulSoup(r.text, 'lxml')

soup.companyinfo

We can then keep going with further references.

In [None]:
print(soup.companyinfo.city)
print(soup.companyinfo.city.text)

However, this notation only picks the first instance of a given tag.  If tags are repeated, we can put them all in a list with `.find_all()`.

In [None]:
links = soup.find_all('filinghref')
links

Now we have a list of all the search results.  Each link is a separate 13F filing for our company.  We now wish to load one of the results.  

In [None]:
url = links[0].text

r = requests.get(url)

soup = BeautifulSoup(r.text, 'lxml')

The page returned is a list of the different documents in the filing.  What we want is the largest xml file of the available options.  We therefore loop over each row of each table and find the biggest one that is xml.

In [None]:
# Loop over each table
for table in soup.find_all('table'):  # Loop over tables
    running_max = 0
    
    # Loop over each row in table
    for row in table.find_all('tr'):  
        
        # Create a list of all the cells on the row
        cols = row.find_all('td') 
        
        # Removes whitespace
        cols = [ele.text.strip() for ele in cols]
        
        # Check that the row isn't empty, and that the first and last cells of the row aren't empty
        if cols:
            if cols[0] and cols[-1]:
                
                # Locate the largest xml file
                if int(cols[-1]) > running_max and cols[2][-3:] == "xml":
                    xml_file = cols[2]
                    running_max = int(cols[-1])

We then fetch the xml file we found.

In [None]:
base_url = url[:url.rfind('/')]+'/'+xml_file
r = requests.get(base_url)

soup = BeautifulSoup(r.text, 'xml')

Each row of the xml file is called 'infoTable'.  We then loop over all the rows and store the data into a dataframe.

In [None]:
holdings = soup.find_all('infoTable')

myList = []

for holding in holdings:
    myList += [[holding.cusip.string, holding.value.string, holding.sshPrnamt.string]]

# Outputs CUSIP, Value in USD 000's, and Number of Shares
df = pd.DataFrame(myList, columns=['cusip', 'value', 'sshPrnamt'])

df.head(10)

We could then use our above code to loop over each filing, or over multiple companies.  However, be very respectful of the site you are scraping.  Be sure to include a ~1 second delay between page requests, or you risk bringing down the site or being blocked.

## POST Requests

Suppose I wish to scrape a page that is only available to users who have logged in.  Here, we will try a page from WRDS.

In [None]:
target_url = "https://wrds-web.wharton.upenn.edu/wrds/search/variableSearch.cfm"

r = requests.get(target_url)
print(r.status_code)
print(r.text)

HTML isn't easy to read, but Jupyter lets us render the output.

In [None]:
display(HTML(r.text))

We see that we don't get an error code, but the page that loads does say 'You must be logged-in to access that page'.  

### Loggin in

Not all sites send the necessary request information as GET requests (in the URL).  Sometimes, the data is sent in a separate file.  This is known as a POST request, and is how sensitive information like login details are sent.  

How do we know what this file looks like?  In Chrome, navigate to the login page of the site we wish to access.  Open developer tools with `Ctrl+Shift+I`.  Click on the `Network` tab and check the `Preserve log` box.  Now log in.

You will see a number of files load in the log.  These are all the html files, scripts, images, and stylesheets that the displayed page loaded.  You want to scroll to the top and click on the first file that loaded.  Click the `Headers` tab and look at the bottom for `Form Data`.  This is the contents of the submitted file that we need to replicate.

Below, we construct a dictionary with the data we wish to send in the request.  We will use this to sign into WRDS.

In [None]:
login = {
    'username': 'FILL',
    'password': 'FILL'
}

In [None]:
login_url = "https://wrds-web.wharton.upenn.edu/wrds/index.cfm"

Now we use `requests.post()` to send along our POST data.

In [None]:
t = requests.post(login_url, data=login)

Normally, when you sign into a web page, the website will return a 'cookie' to your computer that identifies you as logged in.  Then, every time you click on a new link, your browser will send this cookie back to the website to indicate you have permission.  

In [None]:
t.cookies.get_dict()

If we were to again try `requests.get(target_url)`, we would get the same error as before.  Even though we signed in, we did not send back our cookie.

### Sessions

Here, we will open an active `Session()` so that `requests` can keep track of this cookie for us.  Now, we use the `with` command to open a connection.  `with` defines a new variable `s` as a `requests.Session()` that is available within the block of code.  Any time we make a GET or POST request in this block, the session will send along our cookies.  Once we exit this code block, it will close the connection, signing us out the the site and cleaning up our cookies. 

In [None]:
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
    # Submitting credentials to login page
    p = s.post(login_url, data=login)

    # An authorised request.
    r = s.get(target_url)
    display(HTML(r.text))


We see that the above code successfully loaded the search page.  Now what if we want to run a search on this page?  If you try it, you will see that the URL does not change when search results are loaded.  Again, the search info was sent via POST.  We can again look at the Header information to see what was sent.

An example search is shown below.

In [None]:
search = {
    'search_term' : 'prcc',
    'libraries_to_search' : ['129', '137']
}

In [None]:
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as s:
    # Submitting credentials to login page
    p = s.post(login_url, data=login)

    # An authorised request.
    r = s.post(target_url, data=search)
    
display(HTML(r.text))

If we wanted to extract the information from the page, we would again parse it with `BeautifulSoup` and drill down into the fields we wanted by referencing tags.

In [None]:
results = BeautifulSoup(r.text, 'html.parser')
results.findAll('table')[0]

# SQL References

For SQL practice, I recommend [SQLZOO](http://sqlzoo.net/).  Getting SQL installed and setup is rather difficult, so this site allows you to practice in a mock environment that they host.  If you want to see the solutions, add `?answer=1` to the URL (a GET request!).

# Other References

[Cheat Sheets](https://www.datacamp.com/community/data-science-cheatsheets)