# Introduction to Webscraping in Python
Webscraping is the process of automating information collection from the internet. Think back to the Iris DataFrame that we used in the Pandas introduction. This data had to be collected by hand by the creater measuring each aspect of the flower and recording that information. Through webscraping we can programmatically collect information from webpages and eliminate the bulk of manual data collection.

## Introduction
All websites are structured using HTML code to arrange and design the elements on the page. You can see this structure if you use the inspect element on a page. For this walk through we are going to be using the requests and beautiful soup libraries. The requests library allows you to access a website's content which will then input into beautiful soup. BSoup allows you to parse the HTML tree that is collected by the requests library. For an example visit this website: 
https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html

If you inspect element the page the right-hand tab on the side will pop up which shows the HTML tree. This code relates everything on the webpage to each other. Lets start with requesting a website and looking at unparsed HTML. 

In [1]:
### NEW CODE ###
!pip install bs4
!pip install requests
from bs4 import BeautifulSoup
import requests

url = 'https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html'
website_proper = requests.get(url)
soup = BeautifulSoup(website_proper.text, 'html.parser')



We can see here that we had to add the suffix ".text" to the end of our "website_proper" object. If we did not do this we would get a response error. This function (.text) will be used later on as well and should just be identified as a method of extracting text from a webpage. We can now verify that we have the website's "soup" by printing the soup variable in the below cell. 

In [2]:
### NEW CODE ###
print(soup)

<!DOCTYPE html>

<html lang="en-US"><!-- InstanceBegin template="/Templates/generic_inside.dwt" codeOutsideHTMLIsLocked="false" -->
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- stylesheet: global -->
<link href="/stylesheets/global.css" rel="stylesheet"/>
<!-- stylesheet: page-specific -->
<link href="/stylesheets/content.css" rel="stylesheet"/>
<link href="/stylesheets/menu_style.css" rel="stylesheet"/>
<!-- InstanceBeginEditable name="stylesheets" -->
<!-- InstanceEndEditable -->
<!-- jQuery library (if CDN fails, use local copy) -->
<script src="/javascripts/jquery.min.js" type="text/javascript"></script><!-- ver 3.5.1 -->
<!-- javascripts -->
<script src="/javascripts/google_analytics.js" type="text/javascript"></script>
<script src="/javascripts/google_analytics_all.js" type="text/javascript"></script>
<script src="/javascripts/google_search.js" type="text/javascript"></script>
<script src="/javascripts/google_translate.

So we can collect a soup of using the requests and BSoup library. We have done this for the Texas Death Row website, but lets do something slightly less visually structured. Why not the finance department homepage for McCombs? 

In [3]:
### NEW CODE ###
url = 'https://www.mccombs.utexas.edu/Departments/Finance'
website_proper = requests.get(url)
soup = BeautifulSoup(website_proper.text, 'html.parser')

Once we have a soup object of a website we can perform a number of functions on it. It is important to note that the "soup" variable above is not just a string it is a BSoup datatype which stores are large amount of data about the page as well which are going to learn to access. Lets start with getting the title of the webpage. 

In [4]:
### NEW CODE ###
soup.title

<title>
	Department of Finance | Departments | McCombs Business 
</title>

Pretty simple. So lets now try getting specific sections of the website. Lets say we are interested in information associated with each distinct section of the finance department: undergraduate, masters, and Ph.D. Now in this case it would be easy to find out the information just from reading, but imagine if you had a website that had thousands of subsections. We can use BSoup to iterate through the website and find each distinct section. There are a variety of methods to do this, but one of the more common approaches is through "tags". 

Tags are the first way that we relate different aspects of a webpage to together. For example, typically the title of a webpage is going to have a tag of "h1". Subheaders will be labeled "h2" and so on. In this case of the McComb's website the h1 tag is not the title, but the h2 and h3 tags follow typical conventions. Lets start by getting all of the "h3" tags for this website.

In [5]:
### NEW CODE ###
search_tag = 'h2'
soup.find(search_tag)

<h2 class="headlineOrange" id="main_0_primarycontentrowone_0_HeadlineBar_headlineBar">
    Overview
    
    
    
</h2>

You might notice that this above block of code only collects the one h2 header, but there are going to be more than one so something is wrong. This is because the ".find" function only collects the first element with the given tag. We need to use the ".find_all" instead. Here lets do it to find all of the h3 tags of this website. 

In [6]:
### NEW CODE ###
search_tag = 'h3'
soup.find_all(search_tag)

[<h3>Undergraduate</h3>, <h3>Graduate</h3>, <h3>Doctoral</h3>]

You can see that this returns a list of h3 tagged elements. The ".find_all" function will always return a list so we need to iterate over each tag to individual inspect each element. 

In [7]:
search_tag = 'h3'
h3_tags = soup.find_all(search_tag)

### NEW CODE ###
for tag in h3_tags: 
    print(tag.text)

Undergraduate
Graduate
Doctoral


Above we iterated over each element and also added the ".text" function to the tag element so that we only get the tag text and not the full HTML which you can see in the print of the list above. We now have these three elements which are the three section headers we care about. Now is a good time to discuss their relationships in the document again. 

We would call these three h3 tags "siblings" because they all exist at the same level. There is no "difference" between "Undergraduate" and "Graduate" besides the actual text displayed on the document. If you inspect element the page again you will see that all of the h3 tags reside under an h2 tag. This makes sense because they are sub-subsections. Thus we would describe the h2 tag as the parent for all all of these h3 tags. In the case of the McCombs website this is "Overview" section. There could also be mutliple h2 tags which we would consider to be siblings with each other. This h2 tags then have a parent of an h1 tag which relates them all as siblings.

To recap we can start with an h1 tag which will have children of h2 tags (who are siblings). Each of these h2 tags has its own children which are h3 tags (also sublings of each other). Each one of these h3 tags as the ability to have its own children. 

This is how you have a "tree" of a website were theoritically you could start at the very bottom of the tree (a singular word or paragraph) and navigate your way to the top of the website and get the title. We will now demonstrate the commands that allow you to navigate through a website's tree. 

In [8]:
search_tag = 'h3'
h3_tags = soup.find_all(search_tag)

### NEW CODE ### 
single_tag = h3_tags[0]   # this lets us look at just one tag at a time 
print(single_tag.parent)  # now lets look at the parent for this h3 tag

<div class="content-basic tag-block col-xs-12 col-sm-8" id="Overview">
<h2 class="headlineOrange" id="main_0_primarycontentrowone_0_HeadlineBar_headlineBar">
    Overview
    
    
    
</h2>
<a id=""></a>
<h2>Academic Programs</h2>
<p>Department of Finance faculty members teach finance and real estate courses in the three primary academic programs offered in the M<span class="minyC">c</span>Combs School of Business: the Bachelor of Business Administration (BBA), the Master of Business Administration (MBA), and the Doctor of Philosophy (Ph.D.), and in the Business Foundations certificate program.</p>
<h3>Undergraduate</h3>
<p> Students can major in finance as part of a BBA. The department offers one required course for all BBA students (FIN 357 - Business Finance), and a variety of undergraduate finance elective courses covering investments, money and capital markets, corporate, international, energy and real estate finance. Students can choose a general finance major, or focus th

In [11]:
search_tag = 'h3'
h3_tags = soup.find_all(search_tag)
single_tag = h3_tags[0] 

### NEW CODE ###
tag_children = single_tag.children
print(tag_children)  # this returns a list which we cannot view, this means we need to iterate over it

for individual_tag in tag_children: 
    print(individual_tag.text)   # does not really give us anything industry

<list_iterator object at 0x00000215DFABA5B0>


AttributeError: 'NavigableString' object has no attribute 'text'

## Advancing Into Data Science 

We can now and try and collect some more structured data using tags. Lets look at the Texas Death Row website again. 

In [12]:
url = 'https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html'
website_proper = requests.get(url)
soup = BeautifulSoup(website_proper.text, 'html.parser')

We now have the basic soup for the Texas Death Row website and we can use tags to navigate through the tree of the website. Here if we look at the tags we will notice that there seems to be many siblings. For example, each row (each execution record) is tagged "tr" and each cell in that row is tagged "td". Lets start by collecting all of the rows for this website. 

In [13]:
### NEW CODE ###
web_rows = soup.find_all('tr')
for individual_row in web_rows:
    cell = individual_row.find_all('td')
    print(cell)

[]
[<td style="text-align: center">573</td>, <td style="text-align: center"><a href="dr_info/rhoadesrick.html" title="Inmate Information for Rick Rhoades">Inmate Information</a></td>, <td style="text-align: center"><a href="dr_info/rhoadesricklast.html" title="Last Statement of Rick Rhoades">Last Statement</a></td>, <td style="text-align: center">Rhoades</td>, <td style="text-align: center">Rick</td>, <td style="text-align: center">999049</td>, <td style="text-align: center">57</td>, <td style="text-align: center">9/28/2021</td>, <td style="text-align: center">White</td>, <td style="text-align: center"> Harris</td>]
[<td style="text-align: center">572</td>, <td style="text-align: center"><a href="dr_info/hummeljohn.html" title="Inmate Information for John Hummel">Inmate Information</a></td>, <td style="text-align: center"><a href="dr_info/hummeljohnlast.html" title="Last Statement of John Hummel">Last Statement</a></td>, <td style="text-align: center">Hummel</td>, <td style="text-align

Above we can see that we collected a list of "tr" elements for the website. Each on of these still contains its BSoup properties. This can be seen from the for loop where for each row in the list of rows we access all of the "td" elements or cells. This means that we are able to collect and find each cell and its row association. 

We are now going to try and create our own dataframe of this website (essentially just copy and paste). To do this we need one last piece of information which is the column headers.

In [14]:
### NEW CODE ###
web_headers = soup.find_all('th')  # we now that the headers are tagged with "th" from using inspect element

for individual_header in web_headers: 
    print(individual_header.text) 

Execution
Link
Link
Last Name
First Name
TDCJNumber
Age
Date
Race
County


We can now create our own dataframe using the Pandas skills we learned last week. 

In [19]:
import pandas as pd
url = 'https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html'
website_proper = requests.get(url)
soup = BeautifulSoup(website_proper.text, 'html.parser')

headers = []
web_headers = soup.find_all('th')

for individual_header in web_headers: 
    headers.append(individual_header.text)

execution_df = pd.DataFrame(columns = headers)

web_rows = soup.find_all('tr')
for individual_row in web_rows:
    cells = individual_row.find_all('td')
    row_cells = []
    for cell in cells: 
        row_cells.append(cell.string)
    print(row_cells)
    

[]
['573', 'Inmate Information', 'Last Statement', 'Rhoades', 'Rick', '999049', '57', '9/28/2021', 'White', ' Harris']
['572', 'Inmate Information', 'Last Statement', 'Hummel', 'John', '999567', '45', '6/30/2021', 'White', ' Tarrant']
['571', 'Inmate Information', 'Last Statement', 'Jones', 'Quintin', '999379', '41', '5/19/2021', 'Black', ' Tarrant']
['570', 'Inmate Information', 'Last Statement', 'Wardlow', 'Billy', '999137', '45', '7/8/2020', 'White', ' Titus']
['569', 'Inmate Information', 'Last Statement', 'Ochoa', 'Abel', '999450', '47', '2/6/2020', 'Hispanic', ' Dallas']
['568', 'Inmate Information', 'Last Statement', 'Gardner', 'John', '999516', '64', '1/15/2020', 'White', ' Collin']
['567', 'Inmate Information', 'Last Statement', 'Runnels', 'Travis', '999505', '46', '12/11/2019', 'Black', 'Potter']
['566', 'Inmate Information', 'Last Statement', 'Hall', 'Justen', '999497', '38', '11/6/2019', 'White ', 'El Paso']
['565', 'Inmate Information', 'Last Statement', 'Sparks', 'Robert'

So now we are able to recreate a dataframe from a website. Lets look at some other aspects of BSoup before moving on. Sometimes websites are going to have a large number of items tagged with common "tags" such as "p" or "tr". These websites will do a good job seperating tags from each other using other identifiers. Lets look at another website for an example. 

Here we are looking at the McCombs course catalog. Lets say we are most interested in getting the tables from this data. Using inspect element we find that they are all tagged "table". 

In [20]:
url = 'https://catalog.utexas.edu/undergraduate/business/minor-and-certificate-programs/'
website_proper = requests.get(url)
soup = BeautifulSoup(website_proper.text, 'html.parser')

soup.find_all('table')

[<table class="sc_courselist" width="100%"><colgroup><col class="codecol"/><col class="titlecol"/><col class="hourscol"/></colgroup><tbody><tr><th colspan="2">Requirements</th><th>Hours</th></tr><tr class="even"><td class="codecol"><a class="bubblelink code" href="/search/?P=B%20A%20324" onclick="return showCourse(this, 'B A 324');" title="B A 324">B A 324</a></td><td>Business Communication: Oral and Written</td><td class="hourscol">3</td></tr>
 <tr class="even"><td class="orclass">or <a class="bubblelink code" href="/search/?P=B%20A%20324H" onclick="return showCourse(this, 'B A 324H');" title="B A 324H">B A 324H</a></td><td colspan="2"> Business Communication: Oral and Written: Honors</td></tr>
 <tr class="even"><td class="orclass">or <a class="bubblelink code" href="/search/?P=COM%20324H" onclick="return showCourse(this, 'COM 324H');" title="COM 324H">COM 324H</a></td><td colspan="2"> Introduction to Business Communication: Honors</td></tr>
 <tr class="even"><td class="orclass">or <a

So we also have the tables but lets say that there were several other tables on the webpage which did not included the information we wanted. There is another function of BSoup that allows you to use different attributes besides "tag" to filter through the tree. Some examples are "class" and "id". This allows you to search by both tag and then narrow down the parsed tags based on other elements. 

In [21]:
### NEW CODE ###
soup.find_all('table', attrs = {'class': 'sc_courselist', 
                                'width': '100%'})

[<table class="sc_courselist" width="100%"><colgroup><col class="codecol"/><col class="titlecol"/><col class="hourscol"/></colgroup><tbody><tr><th colspan="2">Requirements</th><th>Hours</th></tr><tr class="even"><td class="codecol"><a class="bubblelink code" href="/search/?P=B%20A%20324" onclick="return showCourse(this, 'B A 324');" title="B A 324">B A 324</a></td><td>Business Communication: Oral and Written</td><td class="hourscol">3</td></tr>
 <tr class="even"><td class="orclass">or <a class="bubblelink code" href="/search/?P=B%20A%20324H" onclick="return showCourse(this, 'B A 324H');" title="B A 324H">B A 324H</a></td><td colspan="2"> Business Communication: Oral and Written: Honors</td></tr>
 <tr class="even"><td class="orclass">or <a class="bubblelink code" href="/search/?P=COM%20324H" onclick="return showCourse(this, 'COM 324H');" title="COM 324H">COM 324H</a></td><td colspan="2"> Introduction to Business Communication: Honors</td></tr>
 <tr class="even"><td class="orclass">or <a

Above we decided that we wanted all of the tables on a webpage that were tagged "table", whose class was equal to "sc_courselist", and whose width was "100%". Hopefully, you can see the power of this ability for other webpages in the future. As the grow increasingly complex you are able to narrow down search conditions without using other libraries such as regular expressions. 

## More Advanced Webscraping Using Selenium
We are now going to explore ways to find elements on webpages using a variety of different features in Selenium and Beautiful Soup. For the last webscraping walkthrough you have been introduced to Selenium and the way it operates. 

Remember that Selenium is an automated web driver which essentially just means that it is a "person" clicking around on the screen but in this case the person is a computer whose actions are coded.

To run this program you need to have the Chrome Driver installed: https://sites.google.com/chromium.org/driver/

In [3]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver_path = './chromedriver.exe'  # driver path; assuming the driver is in the same folder as this code
                                    # if you are running on a windows us the .exe filename instead

options = webdriver.ChromeOptions()
driver = webdriver.Chrome(executable_path = driver_path, options = options)
url = 'https://www.sothebys.com/en/buy/auction/2018/impressionist-modern-art-online'  # declares the URL
driver.get(url)  # opens the URL declared in a seperate screen using Selenium

Above we have the basic code to open a website in Selenium. We are now going to go through some of the keys that you can use. If you go to URL you will notice that when the webpage is first displayed the entire page is not shown you have to scroll to the bottom of the page and then Sotheby's will load more items as you scroll down. 

In [4]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver_path = './chromedriver.exe'  # driver path; assuming the driver is in the same folder as this code
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(executable_path = driver_path, options = options)
url = 'https://www.sothebys.com/en/buy/auction/2018/impressionist-modern-art-online'  # declares the URL
driver.get(url)  # opens the URL declared in a seperate screen using Selenium

# NEW CODE
driver.find_element_by_tag_name('body').send_keys(Keys.END)  # uses the driver variable to find the element "body"

There are a few ways to get to the bottom of a page. My prefered method is to find the "body" element on a page and then go to the end of the element. Intutively this makes sense because the body element is going to contain the data of the page that we care about scraping (e.g. not the element at the body of every webpage like Social Media or Contact Us etc.)

So there are two different types of page loading. First, is infinite scrolling which is when you scroll to the bottom of a page and it will automatically load the next set of items once you get there. Alternatively, there was websites that allow you to click. 

In [6]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

driver_path = './chromedriver.exe'  # driver path; assuming the driver is in the same folder as this code
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(executable_path = driver_path, options = options)
url = 'https://www.sothebys.com/en/buy/auction/2018/impressionist-modern-art-online'  # declares the URL
driver.get(url)  # opens the URL declared in a seperate screen using Selenium

# NEW CODE
lastHeight = driver.execute_script("return document.body.scrollHeight")
while True:
    driver.find_element_by_tag_name('body').send_keys(Keys.END)
    time.sleep(15)

    try:
        load_more = driver.find_element_by_xpath('//*[@id="__next"]/div/div[4]/div/div[3]/div/div/div/div[2]/div[1]/div/div/div/div[3]/div/div[2]/ul/li[5]/button')
        load_more.click()
        
        time.sleep(10)
        
        driver.find_element_by_tag_name('body').send_keys(Keys.END)
        
    except: 
        pass
    newHeight = driver.execute_script("return document.body.scrollHeight")
    if newHeight == lastHeight:
        break
    lastHeight = newHeight

WebDriverException: Message: chrome not reachable
  (Session info: chrome=95.0.4638.54)
