# Python Webscraping with XML Data Querying

#### Christian Rivera
- Github: https://github.com/crivera2013

**The purpose of this lesson is to teach how to use python for webscraping files and texts from the internet.  This has three main uses:**
1. **Automating previously manual tasks.**
2. **Parsing large amounts of parsing semi-structured data.**
3. **Getting used to data stored in Hash Table/Dictionary/JSON/NOSQL format: (each "row" in a dataset has a dynamic set of column attributes that can be independent from all the other "rows" in the set.**

For this lesson, we will be using the **Requests, Pandas, **and** BeautifulSoup** libraries

![alt text](https://www.explainxkcd.com/wiki/images/1/1a/api.png)

# Intro to HTML for Webscraping

### Load the packages

**Requests** allows python to retrieve information from websites and other computers using HTTP calls.  We saw this before with API's however, if the website is not an API, **requests.get** returns the full html file of the webpage.

**BeautifulSoup** is a python package that enables us to search through the HTML file for the information we want.


[**Requests/BeautifulSoup Tutorial**](https://www.dataquest.io/blog/web-scraping-tutorial-python/)

In [None]:
import requests
from bs4 import BeautifulSoup

<html>
    <head>
        <title>Beginner Beautiful Soup HTML Example </title>
    </head>
    <body>
        <h2> Double Click on this box to see html code!</h2>
        <p class="bold-paragraph">
            Double Click This Box!
            <a href="https://www.dataquest.io/blog/web-scraping-tutorial-python/" id="learn-link">Link to tutorial resource
            </a>
        </p>
        <p class="bold-paragraph extra-large">
            Seriously, Double Click The Box!
            <a href="http://www.smbc-comics.com/comic/2013-02-01" class="extra-large">I'm an 'a' tag and I create links
            </a>
        </p>
        <div id="my-first-div">
            <img src="https://www.explainxkcd.com/wiki/images/1/1a/api.png" width=150px>
            <p id='get-me' style="text-align: center;"> I'm a nested p tag with a centered style attribute nested within a div tag
            </p>
        </div>
    </body>
</html>

In [None]:
### your string can span multiple lines if it starts and ends with triple apostrophes (""")
sample_html="""<html>
    <head>
        <title>Beginner Beautiful Soup HTML Example </title>
    </head>
    <body>
        <h2> Double Click on this box to see html code!</h2>
        <p class="bold-paragraph">
            Double Click This Box!
            <a href="https://www.dataquest.io/blog/web-scraping-tutorial-python/" id="learn-link">Link to tutorial resource
            </a>
        </p>
        <p class="bold-paragraph extra-large">
            Seriously, Double Click The Box!
            <a href="http://www.smbc-comics.com/comic/2013-02-01" class="extra-large">I'm an 'a' tag and I create links
            </a>
        </p>
        <div id="my-first-div">
            <img src="https://cdn-images-1.medium.com/max/1200/1*Qi2ta02wgA4-otNFDaFrRw.png" width=150px>
            <p id='get-me' style="text-align: center;"> I'm a nested p tag with a centered style attribute nested within a div tag
            </p>
        </div>
    </body>
</html>
"""

#### Instead of using Requests to pull API data, you would have it pull regular HTML data as a string.  Afterwards you convert that string into XML format so that you can access data similar to how you would access data from your dictionaries like dog1 and dog2

HTML files consist of tags, nested tags and attributes/text/data stored in each of those tags.  We've seen something similar before with the python dictionaries / JSON objects attribute-value pairs.  Our sample HTML above has this structure:

<div style="font-weight:bold'">
<ul>
  <li>head
    <ul>
      <li>title</li>
    </ul>
  </li>
  <li>body
  	<ul>
      <li>h2</li>
      <li>p
      	<ul>
        	<li>a</li>
      	</ul>
      </li>
      <li>p
      	<ul>
        	<li>a</li>
      	</ul>
      </li>
      <li>div
      	<ul>
        	<li>img</li>
            <li>p</li>
      	</ul>
      </li>
    </ul>
</ul>
</div>               
[List of different types of HTML tags and their uses](https://developer.mozilla.org/en-US/docs/Web/HTML/Element)
### Now let's use BeautifulSoup

In [None]:
# convert string into XML format

soup = BeautifulSoup(sample_html,'lxml')
print(type(soup))


#### Get the Meta-data Title of the sample webpage.

BeautifulSoup uses dot notation when handling nested attributes and tags

In [None]:
print(soup.head.title)
print("\n")
print(soup.head.title.text)

In [None]:
# how about the body?
soup.body

## But what if we want data that is nested deep within an html file?

- **soup.find( tag_name, { attribute-key-values-pairs } )**

- **soup.find_all( tag_name, { attribute-key-values-pairs } )**

In [None]:
# find the first p tag in the script
p_tag = soup.find('p')
print(p_tag)

In [None]:
# return the text of that p tag
print(p_tag.text)

In [None]:
# find all the p_tags and put them in a list
p_tags = soup.find_all('p')
print("# of p tags: %s" %len(p_tags))
print(p_tags[0].text)

In [None]:
# find the p tag with the class: "bold-paragraph extra-large"
soup.find('p',{'class':"bold-paragraph extra-large"})

In [None]:
# find the a tag with the id: "learn-link"

a_tag = soup.find('a',{'id':'learn-link'})
a_tag

In [None]:
# extract the url from that a_tag

#atttributes of a tag are denoted with dictionary notation

a_tag['href']

## With the basic of BeautifulSoup down, let's do some scraping

### Example 1: Basic XML file
In your directory there is a file called "**cd_catalog.xml**" which we will read in and parse using BeautifulSoup

In [None]:
infile = open("cd_catalog.xml","r")
contents = infile.read()
soup = BeautifulSoup(contents,'xml')
print(soup)

In [None]:
# find all the Child tags of a given CD

for child in soup.find('CD').find_all():
    print(child.name)

In [None]:
# find the PRICE of the first CD listed

soup.find('CD').PRICE.text

In [None]:
#extract all the artists
for record in soup.find_all('CD'):
    print(record.ARTIST.text)

### Example #2: Basic Webscrape

**Real Example**:  Extracting transaction fee data from the Brazilian Stock Exchange

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup


In [None]:
# use requests.get to pull the HTML results 
url = "http://www.bmfbovespa.com.br/en_us/services/fee-schedules/listed-equities-and-derivatives/equities/equities-and-investment-funds-fees/spot/"

response = requests.get(url)

Let's look at a subsection of 1000 characters within results.

In [None]:
response.content[2000:3000]

So ugly.... How do we parse this file and get the data we want?  Google Chrome and BeautifulSoup do the trick.

**Copy the url above and go to the URL in your chrome browser.  Right click on the portion of the page with the data you want, and click "inspect" from the dropdown menu.**

## Right Click on webpage ->  then click "Inspect" 


![alt Text](http://3qilabs.com/wp-content/uploads/2014/08/Screen-Shot-2014-08-28-at-2.20.05-PM.png)

Google Chrome allows you to see the location of the data you want and the tag it is associated with.

Tags (shown in purple) are reference marks and an HTML file contains tags within tags within tags.  By making our way down this tree and searching for tags with specific attributes like "class" or "id", we can specify exactly the tag we are looking for and access the data found within.

First we convert our **response.content** into a tree/tag format using **BeautifulSoup**.




In [None]:
#                       #text,           #format we want to convert to.
webpage = BeautifulSoup(response.content,'lxml')


So the data we want is in a table and there are table tags.  So let's find all the tags that are tables.

In [None]:
tables = webpage.find_all('table')
print("Number of tables: %s" % (len(tables)))

**The table we want is the second table (index 1), so we specify that.**

In [None]:
secondTable = tables[1]
secondTable

This second table has a bunch of **td** tags that align **right** so we find all of them.

In [None]:
table_values =  secondTable.find_all('td', align = 'right')

print("Number of td tags: %s \n" % len(table_values))
print(table_values)

There are 8 **td** values (2 rows with 4 columns) and we want the last one in row 1.  So that index is **3** since we start at zero.

In [None]:
desired_value = table_values[3]

**We extract the text from the cell**

In [None]:
desired_value.text


**Finally we clean up the value to remove the percent sign and to convert into a float**

In [None]:
final_value = desired_value.text[:-1]
final_value = float(final_value)
print(final_value)

#### A trivial but still relevant example of a manual task that can be automated using python.  Let's try something more complicated.



# Webscraping SEC 10-K Filings

There is a python package for this specific use case, but let's try it ourselves using **Requests** and **BeatifulSoup** to show how 

So you can search for [10-Ks manually](https://www.sec.gov/cgi-bin/browse-edgar) and access them for free.  Checking the URL for the results, we notice that after the initial url, there is a **?** question mark and afterwards a bunch of arguments and inputs including company name and type of form.  This suggests a structure for how to query data.  Let's make it work for Nvidia.

In [None]:
url = 'https://www.sec.gov/cgi-bin/browse-edgar'

params = {
    'CIK':'NVDA',
    'action': 'getcompany',
    'type':'10-K'
        }

In [None]:
webpage = requests.get(url, params=params)

soup = BeautifulSoup(webpage.content,'lxml')

print(soup.title.text)

Observing the HTML from Google Chrome, we can see that Table 3 has the links to **Documents**.  We want that link.  Table 3 has class name '**tableFile2**' which we can use as a search parameter.

In [None]:
table =  soup.find('table', {'class':'tableFile2'})
table

#### We want the 2nd 'tr' row (index 1)

In [None]:
row = table.find_all('tr')[1]

In [None]:
row

In [None]:
cells = row.find_all('td')

#### Now let's get the info we want out of those cells like 'date', 'file number' and the link to the actual document

In [None]:
filing_type = cells[0].text
filing_date = cells[3].contents[0]
file_number = cells[4].a.contents[0]
doc_url = cells[1].a['href']

In [None]:
print(filing_type)
print(filing_date)
print(file_number)
print(doc_url)

#### Now that we have the url we want, we scrape that page too.

In [None]:
url = 'https://www.sec.gov'+doc_url

resultsPage = requests.get(url)
print(url)

In [None]:
# the url of the SEC 10-K document
soup2 = BeautifulSoup(resultsPage.content,'lxml')

#### And we dig down into the first row of the first table to find the link in the 'Document' column.

In [None]:
row = soup2.find('table').find_all('tr')[1].find_all('td')

In [None]:
description = row[1].contents[0]
final_url = row[2].a['href']

In [None]:
print(description)
print(final_url)

#### Finally, we procure the SEC 10-K file

In [None]:
url = 'https://www.sec.gov'+final_url

resultsPage = requests.get(url)
print(url)

#### We pull all of the text out of the BeautifulSoup data:

In [None]:
tenK = BeautifulSoup(resultsPage.content,'lxml')
all_text = tenK.text
print(all_text)

### So we got the 10-k text we wanted.  Let's wrap that all in a function so that we can do it for whichever company we search for.

In [None]:
def secTenKScraper(companies, uid, password):
    
    proxyDict = { "http"  : 'http://'+uid+":"+password+"@http-proxy.vanguard.com:80", 
              "https" : 'http://'+uid+":"+password+"@http-proxy.vanguard.com:80" }
    
    results = {'fails':[]}
    
    for company in companies:
        try:
            url = 'https://www.sec.gov/cgi-bin/browse-edgar'
            params = {'CIK': company, 'action':'getcompany', 'type':'10-K'}
            
            webpage = requests.get(url, params=params,proxies=proxyDict)
    
            soup = BeautifulSoup(webpage.content,'lxml')
            cells =  soup.find('table', {'class':'tableFile2'}).find_all('tr')[1].find_all('td')
    
            filing_date = cells[3].contents[0]
            file_number = cells[4].a.contents[0]
            doc_url = cells[1].a['href']
        
            url = 'https://www.sec.gov'+doc_url
            resultsPage = requests.get(url, proxies=proxyDict)
            soup2 = BeautifulSoup(resultsPage.content,'lxml')
            row = soup2.find('table').find_all('tr')[1].find_all('td')
            final_url = row[2].a['href']
        
            url = 'https://www.sec.gov'+final_url
            resultsPage = requests.get(url, proxies=proxyDict)
        
            tenK = BeautifulSoup(resultsPage.content,'lxml')
            all_text = tenK.text
            
            print(company)
                    
            results[company] = {
                'file_date':filing_date,
                'file_number':file_number,
                'url' : url,
                '10-K' : all_text,
                'html': resultsPage
                            }
        except:
            results['fails'].append(company)
    
    return results
        
    

In [None]:
filings = secTenKScraper(['MSFT','FB','WMT'],uid,password)
print(filings.keys())
print(filings['fails'])

In [None]:
print(filings['FB']['10-K'])

#### With this function, you can now scrape any SEC 10-K filing you want and increasing the amount of data you could use for analysis.

# Your Turn

## Scrape the NASDAQ website and retrieve the "Today's High/Low" value and the "P/E Ratio" for the stock AAPL



In [None]:
base_url = "https://www.nasdaq.com/symbol/"

symbol = "AAPL"

results = requests.get(base_url + symbol, proxies=proxyDict)


In [None]:
# Convert to BeautifulSoup data type here





In [None]:
# Use these blocks to try out code.  
# Add blocks as you need to: 
# "+" button in the top left corner



