# Web Scraping

Web Scraping is a term used for using a program to download and process content from the web.

A few of the Modules:
- webbrowser
- requests
- beautiful soup
- selenium

### Webbrowser

In [1]:
# opens a window of the url given
import webbrowser
webbrowser.open('http://inventwithpython.com/')

True

### Requests

In [2]:
#gets a doc from a webpage
import requests
res = requests.get('http://automatetheboringstuff.com/files/rj.txt')
type(res)
#check to make sure everything is ok
print(res.status_code == requests.codes.ok)
print(len(res.text))
#print first 250 words
print(res.text[:250])

True
174130
ï»¿The Project Gutenberg EBook of Romeo and Juliet, by William Shakespeare

This eBook is for the use of anyone anywhere at no cost and with
almost no restrictions whatsoever.  You may copy it, give it away or
re-use it under the terms of the Project


In [3]:
# Checking for erros
#if page not found throws an error
import requests
res=requests.get('http://inventwithpython.com/page_that_does_not_exist')
try:
    res.raise_for_status()
except Exception as exc:
    print('There was a problem: %s' % (exc))

There was a problem: 404 Client Error: Not Found for url: http://inventwithpython.com/page_that_does_not_exist


In [4]:
#gets file from website and writes to a txt on hard drive
import requests
res=requests.get('https://automatetheboringstuff.com/files/rj.txt')
print(res.raise_for_status())
# creates a new file in binary mode
playFile=open('C:\\Users\\albg1\\OneDrive\\Documents\\Coding\\Python\\PythonNotes\\RomeoAndJuliet.txt','wb')
# iter_content() returns a chunk we can use in our loop
# a chunk is a certain number of bytes in this case we use 100,000 bytes (which is usually a good number)
for chunk in res.iter_content(100000):
    playFile.write(chunk)
playFile.close()

None


### HTML Resources

http://htmldog.com/guides/html/beginner/
    
http://www.codecademy.com/tracks/web/
    
https://developer.mozilla.org/en-US/learn/html/

- An HTML file is a plain text file with .hmtl extension and the text in these files is surrounded by tags, which are words enclosed in angle brackets.
- The Tags tell the browser how to format the web page.
- A starting tag and closing tag can enclose some text to form an element.
- The text (or inner HTML) is the content between the starting and closing tags.

### Viewing the Source HTML of a Web Page and Opening Browser's Developer Tools

- You will need to look at the HTML source of a web page that your program will work with.
- To do this right click on a web page and select view source or view page source.
- If using Chrome or Internet Explorer you can press f12.
- Don't use Regular expressions to parse HTML it's better to use a premade module such as beautiful soup.

### Using the Developer Tools to Find HTML Elements

- Once a program has downloaded a web page using request module you will have the HTML content as a single string value.
- Now you need to figure out which part of the HTML corresponds to the information you are interested in. You can use the developers tools to do this. Or using the inspect element.
- Once you know what you are looking for beautiful soup can help you find it in the string.

### Parsing HTML with the Beautiful Soup Module

- Better than regular expressions for HTML page extraction.
- Module name beautifulsoup4 (import bs4)

Example HTML:

In [5]:
#get html page from internet
import requests, bs4
res=requests.get('http://nostarch.com')
res.raise_for_status()
noStarchSoup=bs4.BeautifulSoup(res.text)
type(noStarchSoup)
print(noStarchSoup)

<!DOCTYPE html>
<html dir="ltr" lang="en">
<head>
<script src="/cdn-cgi/apps/head/_Yd33iVQmzx1XZrSaZuiVTLpv7Y.js"></script><link href="https://www.w3.org/1999/xhtml/vocab" rel="profile"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="https://nostarch.com/sites/default/files/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
<meta content="Drupal 7 (http://drupal.org)" name="generator"/>
<link href="https://nostarch.com/" rel="canonical"/>
<link href="https://nostarch.com/" rel="shortlink"/>
<title>No Starch Press | "The finest in geek entertainment"</title>
<link href="https://nostarch.com/sites/default/files/css/css_lQaZfjVpwP_oGNqdtWCSpJT1EMqXdMiU84ekLLxQnc4.css" media="all" rel="stylesheet" type="text/css"/>
<link href="https://nostarch.com/sites/default/files/css/css_iJE8OMtNhvOQPbQGg8OqRmpr7AhRCfmCisQy8q7fFhk.css" media="all" rel="stylesheet" type="text/css"/>



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [6]:
#get html page from hard drive
import requests, bs4
exampleFile=open('C:\\Users\\albg1\\OneDrive\\Documents\\Coding\\Python\\PythonNotes\\example.html')
exampleSoup=bs4.BeautifulSoup(exampleFile)
type(exampleSoup)
print(exampleSoup)

<html><body><p>!-- This is the example.html example file. --&gt;

</p><title>The Website Title</title>
<p>Download my <strong>Python</strong> book from <a href="http://
inventwithpython.com">my website</a>.</p>
<p class="slogan">Learn Python the easy way!</p>
<p>By <span id="author">Al Sweigart</span></p>
</body></html>


### Finding an Element with the Select() Method

- You can retrieve a web page element from a BeautifulSoup object by calling the select() function and passing a string of a CSS selector for the element you are looking for.
- The various selector patterns can be combined to make sophisticated matches.
- Some examples:
    - soup.select('div') all elements named div
    - soup.select('#author') The element with an id attribute of author
    - soup.select('.notice') all elements that use a CSS class attribute named notice
    - soup.select('div span') all elements named span that are inside div
    - soup.select('div > span') all elements named span that are directly inside div with no other elements in between
    - soup.select('input[name]') all elements named < input > that have a name attribute with any value
    - soup.select('input[type="button"]') all elements named < input > that have an attribute named type with value button

In [7]:
import bs4
#open html file
exampleFile=open('C:\\Users\\albg1\\OneDrive\\Documents\\Coding\\Python\\PythonNotes\\example.html')
#pass the file to beautiful soup
exampleSoup=bs4.BeautifulSoup(exampleFile.read())
# select the element
elems=exampleSoup.select('#author')
print(type(elems))
print(len(elems))
print(type(elems[0]))
print(elems[0].getText())
print(str(elems[0]))
print(elems[0].attrs)

pElems=exampleSoup.select('p')
# first match
print(str(pElems[0]))
print(pElems[0].getText())
# second match
print(str(pElems[1]))
print(pElems[1].getText())
# third match
print(str(pElems[2]))
print(pElems[2].getText())
# all matches
print(str(pElems))

<type 'list'>
1
<class 'bs4.element.Tag'>
Al Sweigart
<span id="author">Al Sweigart</span>
{'id': 'author'}
<p>!-- This is the example.html example file. --&gt;

</p>
!-- This is the example.html example file. -->


<p>Download my <strong>Python</strong> book from <a href="http://
inventwithpython.com">my website</a>.</p>
Download my Python book from my website.
<p class="slogan">Learn Python the easy way!</p>
Learn Python the easy way!
[<p>!-- This is the example.html example file. --&gt;\n\n</p>, <p>Download my <strong>Python</strong> book from <a href="http://\ninventwithpython.com">my website</a>.</p>, <p class="slogan">Learn Python the easy way!</p>, <p>By <span id="author">Al Sweigart</span></p>]


### Getting Data from an Element's Attributes

In [8]:
import bs4
soup=bs4.BeautifulSoup(open('C:\\Users\\albg1\\OneDrive\\Documents\\Coding\\Python\\PythonNotes\\example.html'))
spanElem=soup.select('span')[0]
print(str(spanElem))
print(spanElem.get('id'))
print(spanElem.get('some_nonexistent_addr')==None)
print(spanElem.attrs)

<span id="author">Al Sweigart</span>
author
True
{'id': 'author'}


### Selenium-Controlled Browser

In [9]:
#open a chrome browser
from selenium import webdriver
browser=webdriver.Chrome()
print(type(browser))
browser.get('http://inventwithpython.com')

<class 'selenium.webdriver.chrome.webdriver.WebDriver'>


### Finding Elements on the Page

WebDriver Methods
- browser.find_element(s)_by_class_name(name)   Element that use the CSS class name
- browser.find_element(s)_by_css_selector(selector)  Elements that match the CSS selector
- browser.find_element(s)_by_id(id)   Elements with a matching id attribute value
- browser.find_element(s)_by_link_text(text)   < a > elements that completely match the text provided
- browser.find_element(s)_by_partial_link_text(text)   < a > elements that contain the text provided
- browser.find_element(s)_by_name(name)   Elements with a matching name attribute value
- browser.find_element(s)_by_tag_name(name)   Elements with a matching tag name (case insensitive; an < a > element is matched by 'a' and 'A')  


WebElements Attributes and Methods()
- tag_name   the tag name, such as 'a' for an < a > element
- get_attribute(name)  the value for the element's name attribute
- text    the text within the element, such as 'hello' in < span >hello</ span >
- clear()    for text field or text area elements, clears the text typed into it
- is_displayed()   returns true if the element is visible; otherwise returns false
- is_enabled()    for input elements, return True if the element is enabled; otherwise returns False
- is_selected()   for checkbox or radio button elements, return True if the element is selected; otherwise returns False
- location    a dictionary with keys 'x' and 'y' for the position of the element in the page

In [10]:
# find a class element on a webpage
from selenium import webdriver
browser=webdriver.Chrome()
browser.get('http://inventwithpython.com')
try:
    elem=browser.find_element_by_class_name('jumbotron')
    print('Found <%s> element with that class name!' % (elem.tag_name))
except:
    print('Was not able to find an element with that name.')

Found <div> element with that class name!


### Clicking the Page

In [11]:
from selenium import webdriver
browser=webdriver.Chrome()
browser.get('http://inventwithpython.com')
# sets link elem
linkElem=browser.find_element_by_link_text('Making Games with Python & Pygame')
print(type(linkElem))
# clicks on linkElem
linkElem.click()

<class 'selenium.webdriver.remote.webelement.WebElement'>


### Filling Out and Submitting Forms

In [12]:
from selenium import webdriver
browser=webdriver.Chrome()
browser.get('https://mail.yahoo.com')
# find box for input
emailElem=browser.find_element_by_id('login-username')
# write input
emailElem.send_keys('not_my_real_email')
# submit
emailElem.submit()

WebDriverException: Message: unknown error: call function result missing 'value'
  (Session info: chrome=65.0.3325.181)
  (Driver info: chromedriver=2.34.522940 (1a76f96f66e3ca7b8e57d503b4dd3bccfba87af1),platform=Windows NT 10.0.16299 x86_64)


### Sending Special Keys

Commonly Used Variables in the selenium.webdriver.common.keys module
- Keys.DOWN, Keys.UP, Keys.LEFT, Keys.RIGHT ---- the keyboard arrow keys
- Keys.ENTER, Keys.RETURN ---- The ENTER and RETURN keys
- Keys.HOME, keys.END, Keys.PAGE_DOWN, Keys.PAGE_UP ---- The home, end, pagedown, and pageup keys
- Keys.ESCAPE, Keys.BACK_SPACE. Keys.DELETE --- The ESC, BACKSPACE, and DELETE keys
- Keys.F1, Keys.F2, ..., Keys.F12 --- the F1 to F12 keys
- Keys.TAB ---- The TAB key

In [14]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser=webdriver.Chrome()
browser.get('http://nostarch.com')
htmlElem=browser.find_element_by_tag_name('html')
#scroll down page
htmlElem.send_keys(Keys.END)
#scroll up page
htmlElem.send_keys(Keys.HOME)

WebDriverException: Message: unknown error: call function result missing 'value'
  (Session info: chrome=65.0.3325.181)
  (Driver info: chromedriver=2.34.522940 (1a76f96f66e3ca7b8e57d503b4dd3bccfba87af1),platform=Windows NT 10.0.16299 x86_64)
