# Web Data Scraping | BAIS 6100

**Instructor: Qihang Lin**

A huge amount of data is publicly available on the Internet for various business interest. To effectively harvest that data, you'll need to become skilled at web scraping.

**Applications of web scraping**:

 - Monitor competitor's new product/price  
 - Monitor consumer sentiment and brand reputation
 - Analyze news articles
 - ......

## XML and HTML
* XML means ``eXtensible Markup Language'', which is a format to represent data using a tree structure.
* HTML and XML files have very similar formats. 
    * HTML is used to build a webpage.
    * XML is used to describe data.
* Scraping text data from webs requires processing a HTML file.

## Element, Tag, Text, and Attribute 
* In a XML or HTML file, an **element** is a piece of data contained by a pair of **tags** in angle brackets: `<>` and `</>`

<img src="https://myweb.uiowa.edu/qihlin/teaching/XML0.png">

    * This element's tag is "movie"
    * Start tag: <movie>
    * End tag: </movie>
    * Text: Good Will Hunting
    
* Attributes of an element are inside the start tag. 

<img src="https://myweb.uiowa.edu/qihlin/teaching/XML1.png">

    * attribute names: mins and lang.
    * attribute values: "126" and "en" (must be in quotes).

* An element can contain other elements as its content.

<img src="https://myweb.uiowa.edu/qihlin/teaching/XML3.png">

    * `movie` contains four elements: `title`, `director`, `year`, `genre`
    * `director` further contains two elements: `first_name` and `last_name`
    * In this example, there is no text immediately belonging to "movie" 

* In general, XML represents data using a tree structure. 

<img src="https://myweb.uiowa.edu/qihlin/teaching/XML4.png">

    * There is an unique root element. 
    * An element might not have any text value or attribute but it always has a name tag.

* In HTML, empty elements do not require an end tag. For example, **\<br\>\</br\>** (a line break) is an empty element and can be represented as **\<br\>** without an end tag. 

* In XML, empty elements can be represented by a single self-close tag like **\<br /\>**.


* Before the root element, XML may have some "Prolog" whose tag is `<? >` or `<! >`. 

<img src="https://myweb.uiowa.edu/qihlin/teaching/XML5.png">

## HTML 
* Similar to XML but only use the pre-defined tags: `<html>`, `<body>`, `<p>`, `<b>`, `<div>`, `<span>`......

<img src="https://myweb.uiowa.edu/qihlin/teaching/HTML0.png">

* The HTML file above is displayed in a browser as follows 
<img src="https://myweb.uiowa.edu/qihlin/teaching/HTML1.png">

## HTML Parser

Both XML and HTML files can be processed by some **parser** and converted into a tree-structure Python object. The parser we will used is **lxml**. 

We will apply parser to this page: https://myweb.uiowa.edu/qihlin/Seminars_old.htm

In [1]:
#!pip3 install --upgrade requests
#!pip3 install --upgrade lxml
import requests  #The library to download html file from internet
from lxml import html      #XML/HTML parser

#We use the following webpage as an example
URL = 'https://myweb.uiowa.edu/qihlin/Seminars_old.htm'
page = requests.get(URL) #Download the HTML file through URL.
root = html.fromstring(page.content) #Parse the HTML source code and create the tree object.

Defaulting to user installation because normal site-packages is not writeable
Collecting requests
  Downloading requests-2.28.1-py3-none-any.whl (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 KB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: requests
Successfully installed requests-2.28.1
You should consider upgrading via the '/usr/local/bin/python3.9 -m pip install --upgrade pip' command.[0m[33m
[0mDefaulting to user installation because normal site-packages is not writeable
Collecting lxml
  Downloading lxml-4.9.1-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m39.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hInstalling collected packages: lxml
Successfully installed lxml-4.9.1
You should consider upgrading via the '/usr/local/bin/python3.9 -m pip install --upgrade pip' command.[0m[33m


**root** here is the root element of the tree.   

In [2]:
#Return the text immediately contained by an element 
root.text
#It is empty because the root note does not immediately contains any text data. 

In [4]:
#Return all texts contained by an element or any of its descendant.
root.text_content()

'Management Sciences Seminars.IowaGold {\tcolor: #FF0;\tfont-size: x-large;\ttext-align: center;}.Semester {\tfont-size: large;}      Department of Management Sciences Seminar Series        Fall 2014        Date    Time    Room    Speaker    Affiliation    Talk Title and Abstract          8/29/2014    2:30pm-3:20pm    C107 PBB    Benjamin Rogers    The University of Iowa, ITS - Research Services    Introduction to High Performance Computing System in the University of Iowa        9/12/2014    1:30pm-2:20pm    W207 PBB    Dirk Mattfeld    Technische Universitaet Braunschweig, Carl-Friedrich Gauss Department, Decision Support Group    Data modeling and optimization for tactical planning in bike sharing systems          11/6/2014    2:30pm-3:20pm    S326 PBB    Wenjun Wang;Guanglin Xu;Stacy Voccia    The University of Iowa, Department of Management Sciences    INFORMS PRACTICE SESSION        11/7/2014    2:30pm-3:20pm    C107 PBB    Bill Schmarzo    EMC, Chief Technology Officer, Enterpri

In [5]:
#Return the attributes of an element as a dictionary
root.attrib

{'xmlns': 'http://www.w3.org/1999/xhtml'}

In [6]:
#Return the value of a particular attributes of an element
root.attrib["xmlns"]

'http://www.w3.org/1999/xhtml'

## XPath

* The data we want might be in an element in a deep location in the tree.
* XPath is a language to describe the location of an element in a XML/HTML file using tags and  attributes of that element and/or its ascendants.
* Here are some examples. If you are interested to learn more on XPath, see https://www.scrapingbee.com/blog/practical-xpath-for-web-scraping/

<img src="https://myweb.uiowa.edu/qihlin/teaching/XPath0.png" width="500">
<img src="https://myweb.uiowa.edu/qihlin/teaching/XPath1.png" width="500">
<img src="https://myweb.uiowa.edu/qihlin/teaching/XPath2.png" width="500">
<img src="https://myweb.uiowa.edu/qihlin/teaching/XPath3.png" width="500">
<img src="https://myweb.uiowa.edu/qihlin/teaching/XPath4.png" width="500">
<img src="https://myweb.uiowa.edu/qihlin/teaching/XPath5.png" width="500">
<img src="https://myweb.uiowa.edu/qihlin/teaching/XPath6.png" width="500">

## Identify XPath Using Browser 

It is not easy to write an XPath to the element we want by reading the HTML source code. We need to use browser's inspector to see which element contains the required data. 

This can be done in two right clicks in Chrome:

 1. Right click the text you want to get and select "Inspect". You will see the HTML source code behind that text in your browser.
   
 2. Right click the element highlighted in the source code and select "Copy" and then "Copy XPath" or "Copy Full XPath". A unique path to that element has been copied.

/html/body/table/tbody/tr/td[6]/a

Suppose we want to extract the titles of the presentations in the following website.
https://myweb.uiowa.edu/qihlin/Seminars_old.htm

According to the inspector, the unique XPath to a seminar's title "**INFORMS PRACTICE SESSION**" is:

**/html/body/table/tbody/tr[6]/td[6]/a**

/html/body/table/tbody/tr[7]/td[6]
  
How to interprate this path:

1. "**html**" is the tag of the root element. 

2. The title above is in an element tagged by "**a**", and this element is a child of "**td**" who is a child of **tr** and so on. 

2. "[8]"  and "[6]" are **child indexes**. "tr[8]" means the 8th "tr" child of "tbody". Similary, "td[6]" means the 6th "td" child of "tr[8]".

Similarly, the unique XPath to title "**The Big Data MBA: Moving from Monitoring to Monetization**" is:

**/html/body/table/tbody/tr[7]/td[6]/a**

The unique XPath to title "**Matheuristics for routing problems**" is:

**/html/body/table/tbody/tr[8]/td[6]/a**

Now, suppose we want to collect all titles on this page, what path can we use? 

Let's exam the patterns in the three path above. Only the child indexes of elements "tr" ([6], [7], [8]) changes with title but the child index of element "td" ([6]) does not change. Hence, we can tell that the XPath to all titles should be 
**/html/body/table/tbody/tr/td[6]/a**

Then we use this path and **xpath** method to return the elements where the presentations' titles are contained.

In [7]:
#Return all elements in a path like /html/body/table/tbody/tr/td/a
root.xpath('/html/body/table/tbody/tr/td[6]/a')

[<Element a at 0x7fafc5825bd0>,
 <Element a at 0x7fafc5825b30>,
 <Element a at 0x7fafc5825400>,
 <Element a at 0x7fafc5825b80>,
 <Element a at 0x7fafdfed6220>]

We can use list comprehension to convert elements to text data.

In [8]:
titles=[s.text for s in root.xpath('/html/body/table/tbody/tr/td[6]/a')]
titles

['Introduction to High Performance Computing System in the University of Iowa',
 'Data modeling and optimization for tactical planning in bike sharing systems',
 'INFORMS PRACTICE SESSION',
 'The Big Data MBA: Moving from Monitoring to Monetization',
 'Matheuristics for routing problems']

/html/body/table/tbody/tr[12]/td[6]

You may notice that there is a talk titled "**Updated Graduate Handbook**" on https://myweb.uiowa.edu/qihlin/Seminars_old.htm 
Why our code did not find this title? If you inspect its Xpath using the method above, you will know its path is:

**/html/body/table/tbody/tr[12]/td[6]**

The difference here is that this title is not in an element **"a"** but in an element **"td"**. However, if we remove "/a" from the path, we will miss the five titles we were able to collect earlier. See:


In [9]:
titles=[s.text for s in root.xpath('/html/body/table/tbody/tr/td[6]')]
titles

[None, None, None, None, None, None, None, 'Updated Graduate Handbook']

That is because those five titles are not the immediate text data of "td[6]" but the text data of the child "a" of "td[6]".
In this case, we will need to use "text_content()" to extract the text of all descendants of "td[6]". See:

In [10]:
titles=[s.text_content() for s in root.xpath('/html/body/table/tbody/tr/td[6]')]
titles

['Talk Title and Abstract',
 'Introduction to High Performance Computing System in the University of Iowa',
 'Data modeling and optimization for tactical planning in bike sharing systems',
 'INFORMS PRACTICE SESSION',
 'The Big Data MBA: Moving from Monitoring to Monetization',
 'Matheuristics for routing problems',
 'Talk Title and Abstract',
 'Updated Graduate Handbook']

By the way, some initial segment of the XPath can be replaced by "//" (relative path) as long as we still get the data we want. 

In [11]:
#A shorter version
titles=[s.text_content() for s in root.xpath('//tr/td[6]')]
titles

['Talk Title and Abstract',
 'Introduction to High Performance Computing System in the University of Iowa',
 'Data modeling and optimization for tactical planning in bike sharing systems',
 'INFORMS PRACTICE SESSION',
 'The Big Data MBA: Moving from Monitoring to Monetization',
 'Matheuristics for routing problems',
 'Talk Title and Abstract',
 'Updated Graduate Handbook']

## Obtain the Attributes of Element

Sometimes, we are interested in the atttributes of an element. For example, we want to extract the hyperlink behind each title on the web page  https://myweb.uiowa.edu/qihlin/Seminars_old.htm  

After inspecting the HTML code, we realize that each hyperlink is the value of attribute **"href"** of an element **"a"**. For example, the element **"a"** corresponding to **"Introduction to High Performance Computing System in the University of Iowa"** is 

\<a href="Ben Rogers Fall 2014 (Seminar).pdf">Introduction to High Performance Computing System in the University of Iowa</a\>

This means we can apply "**.attrib["href"]**" method to each element **"a"** to return the hyperlink. See below:

In [12]:
hyperlinks = [s.attrib["href"] for s in root.xpath('//tr/td[6]/a')]
hyperlinks

['Ben Rogers Fall 2014 (Seminar).pdf',
 'Dirk Mattfeld Fall 2014 (Seminar).pdf',
 'MIS PhD Students - Fall 2014-1.doc',
 'Bill Schmarzo Fall 2014 (Seminar).pdf',
 'Claudia Archetti Fall 2014 (Seminar).pdf']

## Web Data Scraping in General

We summerize the steps of scraping a single webpage:

* Request the webpage and create a root element using a parser.
* Locate a particular piece of data you want from the webpage.
* Use a browser to identify the unique XPath (with child indexes) to that data.
* Identify and remove the child indexes that vary with the data you want to obtain so as to obtain the Xpath to all data you want. 
* Use .xpath() to get all elements from the identified path
* Extract text or attribute with list comprehension.
* If the data extracted is different from your expectation, modify the path.

**This procedure only works for a static page and might not work well for a dynamic page.**

## Example: Yelp Review and Star Rating

Suppose we want to obtain all 400+ review texts and star ratings from https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city. There are only ten reviews per page, so we have to collect data from multiple pages. To do so, we have to track the changes URLs when we flip pages. Then, we can use a loop to scrape pages with different URLs.

As a practice, we will start with scraping data from one page and then do it for all pages. 

### Scrape One Page

In [12]:
URL = 'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
page = requests.get(URL) 
root = html.fromstring(page.content)

Using browser's inspector, we obtain the following XPaths of the text of the first three reviews: 

**//*[@id="main-content"]/div[2]/section[2]/div[2]/div/ul/li[1]/div/div[3]/p/span**

**//*[@id="main-content"]/div[2]/section[2]/div[2]/div/ul/li[2]/div/div[4]/p/span**

**//*[@id="main-content"]/div[2]/section[2]/div[2]/div/ul/li[3]/div/div[3]/p/span**

Here, **//*[@id="main-content"]** selects all elements anywhere in the tree with any tags as long as they have an attribute called "id" whose value is 'main-content'.

By comparison, we can tell that the indexes of **li** and the last **div** change with reviews. Hence, we remove the indexes of **li** and the last **div** in the path and use it to access all reviews. See below:

In [13]:
mypath='//*[@id="main-content"]/div[2]/section[2]/div[2]/div/ul/li/div/div/p/span'
reviews = [s.text_content() for s in root.xpath(mypath)]
print(len(reviews)) # There are ten reviews each page. See if any review is missing.
reviews

10


["I'm going to preface this by saying I saw the comedy movie Cedar Rapids (The Pullman is located in Iowa City) and I didn't have high expectations. Now, the best part is Iowa City blew me away... hip, fun, young and some incredible restaurants.I made my way with the help of my friend to the Pullman. WOW... this is a place to get excited about. First the service is outstanding... we sat at the counter, which I recommend. The food is nothing you would expect.... gourmet versions of some of your favorites.I had the Croque Madame... which you never see in the states. Its a french ham and cheese with béchamel sauce with a egg. It was really heaven... some of the best food I've had. For some reason, their fries were some of the best I've had as well... Pullman what do you do with these fries?The bartender and bar manager were like a well oiled machine... fun, friendly, engaging and efficient.The dessert which was some sort of chocolate mousse was oh so good.I'd say this is the best place to

Using browser's inspector, we obtain the following XPaths of the star ratings of first three reviews: 

**//*[@id="main-content"]/div[2]/section[2]/div[2]/div/ul/li[1]/div/div[2]/div/div[1]/span/div**

**//*[@id="main-content"]/div[2]/section[2]/div[2]/div/ul/li[2]/div/div[2]/div/div[1]/span/div**

**//*[@id="main-content"]/div[2]/section[2]/div[2]/div/ul/li[3]/div/div[2]/div/div[1]/span/div**

By comparison, we can tell that the index of **li** changes with star ratings. Hence, we remove the index of **li** in the path and use it to access all reviews. See below:

In [14]:
mypath='//*[@id="main-content"]/div[2]/section[2]/div[2]/div/ul/li/div/div[2]/div/div[1]/span/div'
ratings = [s.attrib['aria-label'] for s in root.xpath(mypath)]
ratings = [int(s.split()[0]) for s in ratings]
print(len(ratings)) # There are ten reviews each page. See if any review is missing.
ratings

10


[5, 5, 5, 5, 5, 5, 5, 1, 4, 5]

### Scrape Multiple Pages 

We have to use a for loop to scrape all reviews and ratings from https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city. 


A few think to keep in mind:

1. Identify some patterns of URLs.
    * Not every segment of URL is necessary to open a page 
        - https://www.google.com/search?q=Python&ie=utf-8&oe=utf-8
        - https://www.google.com/search?q=Python
    * Different URLs may open the same page
        - https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city
        - https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city?start=0
    * Track how URL changes when you flip pages in a website
        - https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city?start=10
        - https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city?start=20
        - https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city?start=500


2. Web server protects itself by blocking suspicious and frequent requests.
    * Use `time.sleep()` to avoid scraping too frequently and getting banned.


3. Do some research to make sure that you're not violating any copyright laws and terms of service.

In [15]:
import pandas as pd
import time 
from IPython.display import display, clear_output 

baseURL="https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city?start="
path1='//*[@id="main-content"]/div[2]/section[2]/div[2]/div/ul/li/div/div/p/span'
path2='//*[@id="main-content"]/div[2]/section[2]/div[2]/div/ul/li/div/div[2]/div/div[1]/span/div'
reivews_all=[]  #A list to receive all reviews 
ratings_all=[]  #A list to receive all ratings 
for i in range(5):    #To speed up the code, there we only do 5 loops instead of 55 as you saw in the video
    URL = baseURL+str(10*i)
    page = requests.get(URL) 
    root = html.fromstring(page.content)
    reviews = [s.text_content() for s in root.xpath(path1)]
    ratings = [s.attrib['aria-label'] for s in root.xpath(path2)]
    ratings = [int(s.split()[0]) for s in ratings]
    reivews_all.extend(reviews)  
    ratings_all.extend(ratings)
    time.sleep(2)
    clear_output()   #Clean the URL previously printed in notebook
    print(URL)
    if len(reviews)==0:
        break

df = pd.DataFrame({'Review': reivews_all, 'Rating': ratings_all}) 
df.to_csv("Pullman.csv",index=False)
df.head()

https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city?start=40


Unnamed: 0,Review,Rating
0,I'm going to preface this by saying I saw the ...,5
1,We ended up eating here three times in one wee...,5
2,I LOVE PULLMAN.I've been here numerous times w...,5
3,Happened upon this lovely establishment when i...,5
4,"In town for several days, we wanted to check o...",5


In [16]:
df = pd.DataFrame({'Review': reivews_all, 'Rating': ratings_all}) 
df.to_csv("Pullman.csv",index=False)
df.head()

Unnamed: 0,Review,Rating
0,I'm going to preface this by saying I saw the ...,5
1,We ended up eating here three times in one wee...,5
2,I LOVE PULLMAN.I've been here numerous times w...,5
3,Happened upon this lovely establishment when i...,5
4,"In town for several days, we wanted to check o...",5


In [17]:
import pandas as pd
import time 
from IPython.display import display, clear_output #This is for better monitoring the progress.

baseURL="https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city?start="
path1='//*[@id="main-content"]/div[2]/section[2]/div[2]/div/ul/li/div/div/p/span'
path2='//*[@id="main-content"]/div[2]/section[2]/div[2]/div/ul/li/div/div[2]/div/div[1]/span/div'
reivews_all=[]  #A list to receive all reviews 
ratings_all=[]  #A list to receive all ratings 
for i in range(5):   #For simplicity, we only scrape first five pages.
    URL = baseURL+str(10*i) #create URLs for different pages.
    page = requests.get(URL) 
    root = html.fromstring(page.content)
    reviews = [s.text_content() for s in root.xpath(path1)]
    ratings = [s.attrib['aria-label'] for s in root.xpath(path2)]
    ratings = [int(s.split()[0]) for s in ratings]
    reivews_all.extend(reviews)  
    ratings_all.extend(ratings)
    time.sleep(2)  #Pause the code for 2 seconds to avoid getting banned.
    clear_output()   #Clean the URL previously printed in notebook
    print(URL)
    if len(reviews)==0:
        break

#Save data to a csv file.
df = pd.DataFrame({'Review': reivews_all, 'Rating': ratings_all}) 
df.to_csv("Pullman.csv",index=False)
df.head()

https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city?start=40


Unnamed: 0,Review,Rating
0,I'm going to preface this by saying I saw the ...,5
1,We ended up eating here three times in one wee...,5
2,I LOVE PULLMAN.I've been here numerous times w...,5
3,Happened upon this lovely establishment when i...,5
4,"In town for several days, we wanted to check o...",5
