Web Scraping with PYTHON
======
### StatLab Workshop
February 24th, 2017

## OUTLINE:
1. What is web-scrapping?

2. Tools for Scraping

3. Data Management Considerations

4. Scraping an HTML Table

5. Scraping all hyperlinks

6. Scrap all text

7. Scraping multiple pages/creating a loop

8. Resources

## What is Web Scraping?
* Fetching and extracting data from websites using software or bots. 
* Useful when there is no API for fetching data
* No direct 'Download' of data
* Data trapped on older websites

## What sites can be scraped?
* Any website can be scraped
* Basic:
  1. Simple HTML pages
  2. Textual data, HTML tables, hyperlinks
* Advanced:
  1. Javascript/AJAX
  2. Password protected
  3. Unstructured Data
  4. Interactive/visualizations

## What are we scraping?
* The HTML behind every web page
* [right-click] "View Source"
* Will not interperet content like a browser

![title](siteexample.png)

### Copyright & Fair Use
* Check the site for license
> As per Clause 8, dealing with Restrictions on Content and Use of the Services, of the Twitter Terms of Service, as part of your free access, you are expressly prohibited to access, tamper with, or use non-public areas of the Twitter services; or probe, scan, or test the vulnerability of any system or network or breach or circumvent any security or authentication measures; or access or search or attempt to access or search the Services by any means (automated or otherwise) like scraping without the prior consent of Twitter.
* Ask the site owner
* Check withthe Licensing & Copyright Librarian (Joan Emmet)
* Electronic Resources @ Yale:
![Appropriate Use of Yale Electronic Resources](AppropriateUse.png)
See [Resources for Text & Data Mining](http://guides.library.yale.edu/c.php?g=547554&p=3757053 "Resources for text & Data Mining")

### Tools for Web Scraping
* Python:
  * BeautifulSoup --> Modifying, Parsing, and Searching HTML or XML
  * Selenium --> Testing websites; useful for scraping sites with lots of js, interactivity, or log-ins
* Scrapy --> create spiders to crawl the web
* Portia --> Based on Scrapy, but with a GUI! Runs in your browser
* OpenRefine
* R
* Excel --> NodeXL

## Data Management
![final.doc](final.gif)

### Data Management
* Decide on a naming convention & stick to it!
  * yyyymmdd_ProjectName_FileDescription.csv
  * 20170218_EPAStudy_MasterCodebook.xml
* Use datetime in Python for all outputs 

## Scraping an HTML Table
* We'll be scraping the Yale University Library homepage for a list of libraries and their hours of operation. 
* If you open the library site, you can see a small table with the names of library locations and their hours for today.


![homepage](homepage.png)

### 'View Source'
![viewsource](YULHomepage.png)
* When inspecting the HTML code behind the *table* element, we see that each library location name is stored in a *td* element with the class="hours-col-loc"  
* The hours of operation are stored in the *td* element directly following the location 
* Now that we've identified where the data we want to scrape lives on the page, we can begin to write our code.

### Importing Modules
* The first step in any python code is identifying which libraries we are using and importing those. For this tutorial, we will be using four python libraries: csv, BeautifulSoup, requests, datetime.

In [1]:
import csv
import requests
from bs4 import BeautifulSoup
import datetime

### Making a GET Request
* We access the HTML of any website by making a GET request for that site. 
* Generally you can copy & paste the full URL from your browser directly into your python code. 
* First we save the results of our GET request to the variable r, before printing those results to the screen.

In [2]:
url = "http://web.library.yale.edu"
r = requests.get(url)
print(r)

<Response [200]>


### Making the soup
* The python library BeautifulSoup allows us to parse our website's HTML before filtering or searching through the results. 
* Our code: 
  * Loads the variable r into BeautifulSoup  
  * Identifies 'html.parser' as the method for parsing the site's HTML. 
  * When we print the variable soup, we see the site's raw HTML, the same HTML that we see when we 'View Source' through the browser. 

In [3]:
soup = BeautifulSoup(r.text, 'html.parser')
print(soup)
#print(soup.prettify())

<!DOCTYPE html>

<!--[if lt IE 7 ]> <html lang="en" dir="ltr" class="ie6"> <![endif]-->
<!--[if IE 7 ]>    <html lang="en" dir="ltr" class="ie7"> <![endif]-->
<!--[if IE 8 ]>    <html lang="en" dir="ltr" class="ie8"> <![endif]-->
<!--[if IE 9 ]>    <html lang="en" dir="ltr" class="ie9"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html dir="ltr" lang="en"> <!--<![endif]-->
<head>
<meta content="IE=edge" http-equiv="X-UA-Compatible">
<!--

  GGGGGGGGGGGG      GGGGGGGGGGG               fGGGGGG
    ;GGGGG.             GGGi                     GGGG
      CGGGG:           GGG                       GGGG
       lGGGGt         GGL                        GGGG
        .GGGGC       GG:                         GGGG
          GGGGG    .GG.        ;CGGGGGGL         GGGG          .LGGGGGGGL
           GGGGG  iGG        GGG:   ,GGGG        GGGG        tGGf     ;GGGC
            LGGGGfGG        GGGG     CGGG;       GGGG       GGGL       GGGGt
             lGGGGL                  CGGG;       GGGG      C

### Straining the Soup
* We have the HTML for the entire page, so we must filter these results to reach just the elements we want to scrape.
* As we recall, the info we wish to scrape lives in a *td* element with *class="hours-col-loc"*
* Use the *find_all* function built in to BeautifulSoup

In [4]:
tableData = soup.find_all('td', class_="hours-col-loc")
print(tableData)

[<td class="hours-col-loc">Bass</td>, <td class="hours-col-loc">Beinecke</td>, <td class="hours-col-loc">CSSSI</td>, <td class="hours-col-loc">Divinity</td>, <td class="hours-col-loc">Haas Arts</td>, <td class="hours-col-loc">Law</td>, <td class="hours-col-loc">Medical</td>, <td class="hours-col-loc">Sterling</td>]


* Now we have a list of HTML elements, but we want the specific text inside the HTML. 
* Before we can begin pulling out this text, we need to create a loop in our code. This loop does three things: 
  * First it selects a *td* element in our tableData list and saves it as the variable td 
  * Then the .text function pulls out only the string of text inside the *td* element (library location) and stores the value in a variable called libraryName
  * Lastly it prints the result. 
* Our loop repeats the process again for the next *td* element in our list. We should see the name of each library printed to the screen in its own line.

In [5]:
for td in tableData:
    libraryName = td.text
    print(libraryName)

Bass
Beinecke
CSSSI
Divinity
Haas Arts
Law
Medical
Sterling


* Next we need to get the hours that each library is open. 
* This data lives in an adjacent *td* element which is referred to as a sibling. 
* BeautifulSoup allows us to move over to the neighboring element with a function called .next_sibling.
  * We create a variable called hours 
  * Then we set it equal to td.next_sibling.text. Our code will then print this variable. 
* If we run the code now, we should see a list alternating between a library's name and the hours of operation. 

In [6]:
for td in tableData:
    libraryName = td.text
    print(libraryName)    
    hours = td.next_sibling.text
    print(hours)

Bass
8:30am - 9:45pm
Beinecke
9am - 5pm
CSSSI
8:30am - 7pm
Divinity
8:30am - 4:50pm
Haas Arts
8:30am - 5pm
Law
8am - 10pm
Medical
7:30am - 10pm
Sterling
8:30am - 4:45pm


### Exporting the Data
* Now that we have filtered our results and reached the data we want to scrape from the site, we need to export our results so they can be analyzed. 
* We can save our results to a CSV file, which can be opened in excel or any text editor. 
  * First, we open a csv document with our python script. The open command will open the file within the first set of quotes. If that file doesn't exist, python will create a file with that name. 
  * Next, we set the python function csv.writer to a variable (w). This allows us to begin writing data to the CSV file. (The csv file will 'close' or 'save' at the end of the loop) 
  * Then, we must decide what to save to the CSV file

In [7]:
with open("yaleLibraryHours.csv", "w") as ourCSVdata:
    w = csv.writer(ourCSVdata)
    for td in tableData:
        libraryName = td.text
        print(libraryName)    
        hours = td.next_sibling.text
        print(hours)

Bass
8:30am - 9:45pm
Beinecke
9am - 5pm
CSSSI
8:30am - 7pm
Divinity
8:30am - 4:50pm
Haas Arts
8:30am - 5pm
Law
8am - 10pm
Medical
7:30am - 10pm
Sterling
8:30am - 4:45pm


* Before we can write our data to the CSV, we should create headers for our columns: 
  * We save the value of our headers as a list. The values in this list represent one row of our final CSV file. 
  * The function writerow will write the list called headers to our CSV file. 
  * After running the code, we will notice a new file called  "yaleLibraryHours.csv" is now in the same folder as our python code. We can open the file and we will see our headers. 

In [8]:
with open("yaleLibraryHours.csv", "w") as ourCSVdata:
    w = csv.writer(ourCSVdata)
    
    header = ['Library', 'Hours']
    w.writerow(header)
   
    for td in tableData:
        libraryName = td.text
        print(libraryName)    
        hours = td.next_sibling.text
        print(hours)

Bass
8:30am - 9:45pm
Beinecke
9am - 5pm
CSSSI
8:30am - 7pm
Divinity
8:30am - 4:50pm
Haas Arts
8:30am - 5pm
Law
8am - 10pm
Medical
7:30am - 10pm
Sterling
8:30am - 4:45pm


* We can now add our scraped data to our CSV file. To do this we must write a row of data during each iteration of the for loop.
  * First, we create an empty list called row. 
  * Then add our data to the list using row.append 
  * Finally, writing the list to our CSV file using the function w.writerow(row). 

In [9]:
with open("yaleLibraryHours.csv", "w") as ourCSVdata:
    w = csv.writer(ourCSVdata)
    
    header = ['Library', 'Hours']
    w.writerow(header)
   
    for td in tableData:
        row = []
        libraryName = td.text
        print(libraryName)
        row.append(libraryName)
        hours = td.next_sibling.text
        print(hours)
        row.append(hours)
        print(row)
        w.writerow(row)

Bass
8:30am - 9:45pm
['Bass', '8:30am - 9:45pm']
Beinecke
9am - 5pm
['Beinecke', '9am - 5pm']
CSSSI
8:30am - 7pm
['CSSSI', '8:30am - 7pm']
Divinity
8:30am - 4:50pm
['Divinity', '8:30am - 4:50pm']
Haas Arts
8:30am - 5pm
['Haas Arts', '8:30am - 5pm']
Law
8am - 10pm
['Law', '8am - 10pm']
Medical
7:30am - 10pm
['Medical', '7:30am - 10pm']
Sterling
8:30am - 4:45pm
['Sterling', '8:30am - 4:45pm']


### Scrapping all links from a site
* We want tot scrape the url and title from each 'a' tag that links to an external page
* Make sure your modules are imported
* Begins the same for most web scraping attempts

In [10]:
url = "http://web.library.yale.edu"
r = requests.get(url)
print(r)

soup = BeautifulSoup(r.text, 'html.parser')
#print(soup)


<Response [200]>


### Find all links & Print
* Using the 'find_all' function built in to BeautifulSoup
* Links are always located in the 'a' tag

In [11]:
links = soup.find_all('a')
print(links)

[<a class="element-invisible element-focusable" href="#main-content">Skip to main content</a>, <a href="/" title="Return to the Yale University Library home page"><span>Yale University Library</span></a>, <a href="http://orbis.library.yale.edu/vwebv/myAccount" title="">Your Library Account</a>, <a href="http://ask.library.yale.edu/" title="">Ask Yale Library</a>, <a href="http://schedule.yale.edu/" title="">Reserve Rooms</a>, <a href="/places/to-study" title="">Places to Study</a>, <a addthis:userid="yalelibrary" class="addthis_button_facebook_follow"></a>, <a addthis:userid="yalelibrary" class="addthis_button_twitter_follow"></a>, <a addthis:userid="yalelibrary" class="addthis_button_instagram_follow"></a>, <a class="addthis_button_email"></a>, <a href="http://search.library.yale.edu">Quicksearch</a>, <a href="http://orbis.library.yale.edu/vwebv/" title="Records for approximately 13 million volumes located across the University Library system.">Search Library Catalog (Orbis)</a>, <a h

*  Open a CSV file to record data
*  Define headers
*  Loop through all the links

In [12]:
with open("yaleLibraryLinks.csv", "w") as ourCSVlinks:
    w = csv.writer(ourCSVlinks)
    
    header = ['address', 'text']
    w.writerow(header)
    
    for link in links:
        print(link)
    

<a class="element-invisible element-focusable" href="#main-content">Skip to main content</a>
<a href="/" title="Return to the Yale University Library home page"><span>Yale University Library</span></a>
<a href="http://orbis.library.yale.edu/vwebv/myAccount" title="">Your Library Account</a>
<a href="http://ask.library.yale.edu/" title="">Ask Yale Library</a>
<a href="http://schedule.yale.edu/" title="">Reserve Rooms</a>
<a href="/places/to-study" title="">Places to Study</a>
<a addthis:userid="yalelibrary" class="addthis_button_facebook_follow"></a>
<a addthis:userid="yalelibrary" class="addthis_button_twitter_follow"></a>
<a addthis:userid="yalelibrary" class="addthis_button_instagram_follow"></a>
<a class="addthis_button_email"></a>
<a href="http://search.library.yale.edu">Quicksearch</a>
<a href="http://orbis.library.yale.edu/vwebv/" title="Records for approximately 13 million volumes located across the University Library system.">Search Library Catalog (Orbis)</a>
<a href="http://m

### Filter through 'a' elements
* Set the variable row2 to an empty list []
* Add varibles for 'href' and 'text'; these are the URL and Title of each link
* Use the 'try' function to ignore 'a' tags with no 'href'
* Use 'if href.startswith("http") == True' to filter out internal facing links

In [13]:
with open("yaleLibraryLinks.csv", "w") as ourCSVlinks:
    w = csv.writer(ourCSVlinks)
    
    header = ['address', 'text']
    w.writerow(header)
    
    for link in links:
        row2 = []
        try:
            href = link['href']
            if href.startswith("http") == True:
                print(href)
                text = link.text
                row2.append(href)
                row2.append(text)
                print(text)
                w.writerow(row2)
        except:
            continue
    

http://orbis.library.yale.edu/vwebv/myAccount
Your Library Account
http://ask.library.yale.edu/
Ask Yale Library
http://schedule.yale.edu/
Reserve Rooms
http://search.library.yale.edu
Quicksearch
http://orbis.library.yale.edu/vwebv/
Search Library Catalog (Orbis)
http://morris.law.yale.edu/
Search Law Library Catalog (MORRIS)
https://resources.library.yale.edu/cas/borrowdirect.aspx
Search Borrow Direct
http://firstsearch.oclc.org/dbname=WorldCat;autho=100157622;FSIP
Search WorldCat
http://yale.summon.serialssolutions.com/
Search Articles+
http://web.library.yale.edu/digital-collections
Search Digital Collections
http://findingaids.library.yale.edu/
Search Finding Aid Database
http://guides.library.yale.edu/
Subject Guides
http://guides.library.yale.edu/databases
Find Databases by Title
http://wa4py6yj8t.search.serialssolutions.com
Find eJournals by Title
http://guides.library.yale.edu/specialcollections
Guide to Using Special Collections
http://guides.library.yale.edu/research-help
Res

## Scraping all text
* Simply scrapes all text within a single website

In [14]:
url = "https://en.wikipedia.org/wiki/Main_Page"
r = requests.get(url)
print(r)

soup = BeautifulSoup(r.text, 'html.parser')

txt = soup.get_text()
print(txt)

    

<Response [200]>




Wikipedia, the free encyclopedia
document.documentElement.className = document.documentElement.className.replace( /(^|\s)client-nojs(\s|$)/, "$1client-js$2" );
(window.RLQ=window.RLQ||[]).push(function(){mw.config.set({"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":false,"wgNamespaceNumber":0,"wgPageName":"Main_Page","wgTitle":"Main Page","wgCurRevisionId":762066868,"wgRevisionId":762066868,"wgArticleId":15580374,"wgIsArticle":true,"wgIsRedirect":false,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":[],"wgBreakFrames":false,"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRelevantPageName":"Main_Page",

### Scraping Paginated Results
* Data spread across multiple pages on a single site
* URLs that follow a pattern through each search page are the easiest to scrape

In [15]:
r = requests.get("http://orbis.library.yale.edu/vwebv/search?searchArg=python&searchCode=GKEY%5E*&searchType=0&recCount=50&recPointer=0")
print(r)

<Response [200]>


In [16]:
r = requests.get("http://orbis.library.yale.edu/vwebv/search?searchArg=raspberry&searchCode=GKEY%5E*&searchType=0&recCount=50&recPointer=0")
print(r)

soup =BeautifulSoup(r.text, 'html.parser')
#print(soup)

searchResults = soup.find_all('div', class_='resultListTextCell')
#print(searchResults)
for item in searchResults:
    row3 = []
    line1 = item.a.text
    row3.append(line1)
    print(row3)


<Response [200]>
['Raspberry Island Light Station : Apostle Islands National Lakeshore, Bayfield, Wisconsin / by David H. Wallace.']
['Building a home security system with Raspberry Pi : build your own sophisticated modular home security system using the popular Raspberry Pi board / Matthew Poole.']
['Exploring the Raspberry Pi 2 with C++ / Warren Gay.']
['Learn to program with Minecraft : transform your world with the power of Python / by Craig Richardson.']
['Raspberry Pi robotic blueprints : utilize the powerful ingredients of Raspberry Pi to bring to life amazing robots that can act, draw, and have fun with laser tag / Richard Grimmett.']
['Python playground : geeky projects for the curious programmer / Mahesh Venkitachalam.']
["Maker's guide to the zombie apocalypse : defend your base with simple circuits, Arduino, and Raspberry Pi / Simon Monk."]
['Raspberry Pi projects for dummies / Mike Cook, Jonathan Evans, Brock Craft.']
['Raspberry Pi LED blueprints : design, build, and test

### Looping through pages
* We can continue to loop through the search results by changing the URL after each loop
* The 'recPointer' variable will increase by 50 each loop. 
* What's wrong with this code?

In [17]:
import csv
import requests
from bs4 import BeautifulSoup
import datetime

recPointer = 0

while recPointer <= 150:
	url = "http://orbis.library.yale.edu/vwebv/search?searchArg=raspberry&searchCode=GKEY%5E*&searchType=0&recCount=50&recPointer="+str(recPointer)
	r = requests.get(url)
	print(r)
	recPointer += 50
	soup = BeautifulSoup(r.text, 'html.parser')
	searchResults = soup.find_all('div', class_="resultListTextCell")

	for item in searchResults:
		row3 =[]
		line1 = item.a.text
		row3.append(line1)
		print(row3)


<Response [200]>
['Raspberry Island Light Station : Apostle Islands National Lakeshore, Bayfield, Wisconsin / by David H. Wallace.']
['Building a home security system with Raspberry Pi : build your own sophisticated modular home security system using the popular Raspberry Pi board / Matthew Poole.']
['Exploring the Raspberry Pi 2 with C++ / Warren Gay.']
['Learn to program with Minecraft : transform your world with the power of Python / by Craig Richardson.']
['Raspberry Pi robotic blueprints : utilize the powerful ingredients of Raspberry Pi to bring to life amazing robots that can act, draw, and have fun with laser tag / Richard Grimmett.']
['Python playground : geeky projects for the curious programmer / Mahesh Venkitachalam.']
["Maker's guide to the zombie apocalypse : defend your base with simple circuits, Arduino, and Raspberry Pi / Simon Monk."]
['Raspberry Pi projects for dummies / Mike Cook, Jonathan Evans, Brock Craft.']
['Raspberry Pi LED blueprints : design, build, and test

## Using datetime for file managment

In [18]:
 from datetime import datetime

 datestring = datetime.strftime(datetime.now(), '%Y%m%d_%H%M%S')
 print(datestring)
 f = open("myfile_"+datestring+".csv", 'w')
 f.close()

20170224_120943


### Resources
* "Web Scraping with Python" by Richard Lawson
http://hdl.handle.net/10079/bibid/12646583
* StackOverflow https://stackoverflow.com/

### DataRescue New Haven
![datarescuenhv](datarescuenhv.jpg)
http://web.library.yale.edu/news/2017/02/datarescue-new-haven-yale-march-4th