# 1. Legalities and Ethics

In an ideal world, web scraping would not be necessary and each website would provide an API to share their data in a structured format.

Is web scraping legal?
If the data is going to be republished, then the type of data scraped is important.
These cases suggest that when the scraped data constitutes facts (such as business locations and telephone listings), it can be republished. However, if the data is original (such as opinions and reviews), it most likely cannot be republished for copyright reasons.

There are three basic types of Intellectual Property: trade‐
marks (indicated by a ™ or ® symbol), copyrights (the ubiquitous ©), and patents
(sometimes indicated by text noting that the invention is patent protected or a patent
number, but often by nothing at all).

# 2. Background research
Before diving into crawling a website, we should develop an understanding about the scale and structure of our target website.
1. Checking robots.txt: for example http://example.webscraping.com/robots.txt
2. Examining the Sitemap: Sitemap files are provided by websites to help crawlers locate their updated content
without needing to crawl every web page.
3. Estimating the size of a website: A quick way to estimate the size of a website is to check the results of Google's
crawler (site:example.webscraping.com)
4. Identifying the technology used by a website:The type of technology used to build a website will effect how we crawl it. A useful tool to check the kind of technologies a website is built with is the builtwith module:

        import builtwith
    
        builtwith.parse('http://example.webscraping.com')
    
    
5. Finding the owner of a website: For some websites it may matter to us who is the owner. For example, if the owner is known to block web crawlers then it would be wise to be more conservative in our download rate

        import whois
    
        print(whois.whois('appspot.com'))


# 3. Crawling a website
In order to scrape a website, we first need to download its web pages containing the data of interest—a process known as crawling. Three common approaches to crawling a website:
1. Crawling a sitemap
2. Iterating the database IDs of each web page
3. Following web page links

In [3]:
# Downloading a web page

def download(url, retriesNum=2, user_agent='WebCrawler'):
    
    import requests
    from requests.exceptions import HTTPError
    from requests.adapters import HTTPAdapter
    
    adapter = HTTPAdapter(max_retries=retriesNum)
    session = requests.Session()
    session.mount(url, adapter)
    headers = {'User-agent': user_agent}
    
    print('Downloading: ', url)
    
    try:
        response = session.get(url, headers=headers)
        response.raise_for_status()
    except HTTPError as http_err:
        print(f'HTTP error occurred: {http_err}')
    except Exception as err:
        print(f'Other error occurred: {err}')
    except ConnectionError as ce:
        print(ce)
    else:
        print('Success!')
    
    return response

In [10]:
# 1.Sitemap crawler

def crawl_sitemap(url):
    import re 
    sitmap = download(url) 
    links = re.findall('<loc>(.*?)</loc>', sitemap) # extract the sitemap links
    for link in links: # download each link
        html = download(link)

In [11]:
# 2.ID iteration crawler

import itertools

max_errors = 5 # maximum number of consecutive download errors allowed
num_errors = 0 # current number of consecutive download errors

for page in itertools.count(1):
    url = 'http://example.webscraping.com/view/-%d' % page
    html = download(url)
    if html is None:
        num_errors += 1
        if num_errors == max_errors:
            break
    else:
        num_errors = 0 # success: can scrape the result    

In [1]:
# 3.Link crawler
'''
So far, we have implemented two simple crawlers that take advantage of the
structure of our sample website to download all the countries. These techniques
should be used when available, because they minimize the required amount of web
pages to download. However, for other websites, we need to make our crawler act
more like a typical user and follow links to reach the content of interest.
'''

import re
import urllib.parse

def link_crawler(seed_url, link_regex):
    crawl_queue = [seed_url]
    seen = set(crawl_queue) # keep track which URL's have seen before
    while crawl_queue:
        url = crawl_queue.pop()
        html = download(url)
        # filter for links matching our regular expression
        for link in get_links(html):
            if re.match(link_regex, link):
                link = urllib.parse.urljoin(seed_url, link)
                if link not in seen:
                    seen.add(link)
                    crawl_queue.append(link)
            
def get_links(html):
    # Return a list of links from html
    webpage_regex = re.compile('<a[^>]+href=["\'](.*?)["\']',re.IGNORECASE)
    return webpage_regex.findall(html)

In [None]:
link_crawler('http://example.webscraping.com', 'example.webscraping.com/(index|view)/')

In [None]:
# If we crawl a website too fast, we risk being blocked or overloading the server.
# To minimize these risks, we can throttle our crawl by waiting for a delay between downloads.

class Throttle:
    
    def __init__(self, delay):
        # amount of delay between downloads for each domain
        self.delay = delay
        # timestamp of when a domain was last accessed
        self.domains = {}
    def wait(self, url):
        domain = urlparse.urlparse(url).netloc
        last_accessed = self.domains.get(domain)
        if self.delay > 0 and last_accessed is not None:
            sleep_secs = self.delay - (datetime.datetime.now() - last_accessed).seconds
        if sleep_secs > 0:
            # domain has been accessed recently so need to sleep
            time.sleep(sleep_secs)
        # update the last accessed time
        self.domains[domain] = datetime.datetime.now()
        
''' We can add throttling to the crawler by calling throttle before every download:
throttle = Throttle(delay)
throttle.wait(url)
result = download(url, headers, proxy=proxy, num_retries=num_retries)'''

# 4. Beautiful Soup & RegEx

Now, we need to make this crawler achieve something by extracting data from each web page, which is known as scraping.
We will walk through three approaches to extract data from a web page using regular expressions, Beautiful Soup and lxml. 

### 1. Regular Expression

1. a*: it can be zero a or 1 or more as like aaa.
2. a+: it can be 1 or more as like aaaaa.
3. []: matches any character in the brackets like [A-Z]* can be APPLE.
4. (): everything in the prantheses are evaluated first.
5. {m,n}: Matches between m and n times like a{2,3}b = aab or aaab.
6. [^]: Matches any single character that is not in the brackets like [^A-Z]* = apple
7. | : or. for example: b(a|e)d = bad or bed
8. . :Matches any single character (including symbols, numbers, a space, etc.) like c.r = car or c#r
9. ^ : Indicates that a character or subexpression occurs at the beginning of a string. ^a = apple or a
10. \ :An escape character (this allows you to use special characters as their literal meanings).
11. dollar sign :Often used at the end of a regular expression, it means “match this up to the end of the string.” Without it, every regular expression has a de facto “.*” at the end of it, accepting strings where only the first part of the string matches. This can be thought of as analogous to the ^ symbol.  [A-Z]*[a-z]*$ = APall
12. ?!: “Does not contain.” 

 email address:
 
    [A-Za-z0-9\._+]+@[A-Za-z]+\.(com|org|edu|net)

In [None]:
import re

html = download(url)
re.findall('<td class="w2p_fw">(.*?)</td>', html)

# Consider if this table is changed so that the population data is no longer available in the second row.
re.findall('<tr id="places_area__row"><td
class="w2p_fl"><label for="places_area"
id="places_area__label">Area: </label></td><td
class="w2p_fw">(.*?)</td>', html)
           
# There are many other ways the web page could
be updated in a way that still breaks the regular expression.            
re.findall('<tr id="places_area__row">.*?<td\s*class=["\']w2p_fw["\']>(.*?)</td>', html)
           
# From this example, it is clear that regular expressions provide a quick way to scrape data 
# but are too brittle and will easily break when a web page is updated.
# Fortunately, there are better solutions.

### 2. Beautiful Soup

In [3]:
from bs4 import BeautifulSoup
import requests
source = requests.get('http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=63917421733088077').text
soup = BeautifulSoup(source, 'html.parser')# Other kinds of parsers: lxml, html5lib

In [None]:
# to handle errors
def getTitle(url):
    try:
        bs = BeautifulSoup(html.read(), 'html.parser')
        title = bs.body.h1
    except AttributeError as e:
        return None
    return title

title = getTitle('http://www.pythonscraping.com/pages/page1.html')
if title == None:
    print('Title could not be found')
else:
    print(title)

There are four objects in this library:
1. BeautifulSoup objects: Instances seen in previous code examples as the variable bs
2. Tag objects: Retrieved in lists, or retrieved individually by calling find and find_all on a BeautifulSoup object
3. NavigableString objects: Used to represent text within tags, rather than the tags themselves
4. Comment objects: Used to find HTML comments in comment tags.

#### find and find_all

In [8]:
from bs4 import BeautifulSoup
import requests
source = requests.get('http://www.pythonscraping.com')
bs = BeautifulSoup(source.text, 'html.parser')
# soup.find_all(tagName, tagAttributes) 
nameList = bs.findAll('div', {'class':'five columns'})
for name in nameList:
    print(name.get_text())




Buy WSwP Directly from O'Reilly:

 


Navigation

Blog
 





In [None]:
# find_all(tag, attributes, recursive, text, limit, keywords)
# find(tag, attributes, recursive, text, keywords)

# tag:
bs.find_all(['h1','h2','h3','h4','h5','h6'])

# attributes:
bs.find_all('span', {'class':{'green', 'red'}})

# recursive: The recursive argument is a boolean. How deeply into the document do you want to
# go? If recursive is set to True, the find_all function looks into children, and children’s children, 
# for tags that match your parameters. If it is False, it will look only at the top-level tags in your document. 

# limit: The limit argument, of course, is used only in the find_all method; find is equivalent 
# to the same find_all call, with a limit of 1. You might set this if you’re interested
# only in retrieving the first x items from the page

# keywords: This argument allows you to select tags that contain a particular attribute or set of attributes.
bs.find_all(id='title', class_='text')

#### Navigating Trees
The find_all function is responsible for finding tags based on their name and
attributes. But what if you need to find a tag based on its location in a document?
That’s where tree navigation comes in handy. You looked at navigating a
BeautifulSoup tree in a single direction:

        bs.tag.subTag.anotherSubTag

Now let’s look at navigating up, across, and diagonally through HTML trees. 

In [None]:
#If you want to find only descendants that are children, you can use the .children tag:
for child in bs.find('table',{'id':'giftList'}).children:
    print(child)

In [None]:
#The BeautifulSoup next_siblings() function makes it trivial to collect data from tables, 
#especially ones with title rows:
for sibling in bs.find('table', {'id':'giftList'}).tr.next_siblings:
    print(sibling)

In [None]:
#Occasionally, however, you can find yourself in odd situations 
#that require BeautifulSoup’s parent-finding functions, .parent and .parents. 

#### Accessing Attributes
However, often in web scraping you’re not looking for the content of a tag; you’re
looking for its attributes. This becomes especially useful for tags such as a, where the
URL it is pointing to is contained within the href attribute; or the img tag, where the
target image is contained within the src attribute.

    myImgTag.attrs['src']


### 3. Lxml

Lxml is a Python wrapper which helps make it faster than Beautiful Soup but also harder to install on some
computers. 

# Chapter 3: Caching Downloads

To support caching, the download function developed in Chapter 1, needs to be modified to check the cache before downloading a URL. We also need to move throttling inside this function and only throttle when a download is made, and not when loading from a cache. To avoid the need to pass various parameters for every download, we will take this opportunity to refactor the download function into a class, so that parameters can be set once in the constructor and reused multiple times. Here is the updated implementation to support this:

In [None]:
class Downloader:
    def __init__(self, delay=5, user_agent='WebCrawler', proxies=None, num_retries=1, cache=None):
        self.throttle = Throttle(delay)
        self.user_agent = user_agent
        self.proxies = proxies
        self.num_retries = num_retries
        self.cache = cache
    
    # the cache is checked before downloading
    def __call__(self, url):
        result = None
        if self.cache:
            try:
                result = self.cache[url]
            except KeyError:
                # url is not available in cache
                pass
            else:
                if self.num_retries > 0 and 500 <= result['code'] < 600:
                # server error so ignore result from cache and re-download
                    result = None
        if result is None:
            # result was not loaded from cache so still need to download
            self.throttle.wait(url)
            proxy = random.choice(self.proxies) if self.proxies
                else None
            headers = {'User-agent': self.user_agent}
            result = self.download(url, headers, proxy, self.num_retries)
            if self.cache:
                # save result to cache
                self.cache[url] = result
        return result['html']
    
    def download(self, url, headers, proxy, num_retries, data=None):
        ''' The download method of this class is the same as the previous download function, except now it
        returns the HTTP status code along with the downloaded HTML so that error codes 
        can be stored in the cache.'''
        return {'html': html, 'code': code}

The link crawler also needs to be slightly updated to support caching by adding the
cache parameter, removing the throttle, and replacing the download function with
the new class, as shown in the following code:

In [None]:
def link_crawler(..., cache=None):
    crawl_queue = [seed_url]
    seen = {seed_url: 0}
    num_urls = 0
    rp = get_robots(seed_url)
    D = Downloader(delay=delay, user_agent=user_agent, proxies=proxies, num_retries=num_retries, cache=cache)
    while crawl_queue:
        url = crawl_queue.pop()
        depth = seen[url]
        # check url passes robots.txt restrictions
        if rp.can_fetch(user_agent, url):
            html = D(url)
            links = []
            ...

### Disk Cache

To cache downloads, we will first try the obvious solution and save web pages to the filesystem. To do this, we will need a way to map URLs to a safe cross-platform filename (according to the limitations in different OS using different filesystem). To keep our file path safe across these filesystems, it needs to be restricted to numbers, letters, basic punctuation, and replace all other characters with an underscore, as shown in the following code:

In [1]:
import re
url = 'http://example.webscraping.com/default/view/Australia-1'
re.sub('[^/0-9a-zA-Z\-.,;_ ]', '_', url)

'http_//example.webscraping.com/default/view/Australia-1'

Additionally, the filename and the parent directories need to be restricted to 255 characters to meet the length limitations:

In [None]:
filename = '/'.join(segment[:255] for segment in filename.split('/'))

There is also an edge case that needs to be considered, where the URL path ends
with a slash (/), and the empty string after this slash would be an invalid filename.
However, removing this slash to use the parent for the filename would prevent
saving other URLs. The solution our disk cache will use is appending index.html to the filename
when the URL path ends with a slash.

In [None]:
import urlparse
components = urlparse.urlsplit('http://example.webscraping.com/index/')
path = components.path
if not path:
    path = '/index.html'
elif path.endswith('/'):
    path += 'index.html'
filename = components.netloc + path + components.query

Together, using this logic to map a URL to a
filename will form the main part of the disk cache.

In [None]:
import os
import re
import urlparse
import pickle #  The pickle module is used to convert the input to a string, which is then saved to disk.

class DiskCache:
    def __init__(self, cache_dir='cache'):
        self.cache_dir = cache_dir
        self.max_length = max_length
    def url_to_path(self, url):
        '''Create file system path for this URL
        '''
        components = urlparse.urlsplit(url)
        # append index.html to empty paths
        path = components.path
        if not path:
            path = '/index.html'
        elif path.endswith('/'):
            path += 'index.html'
        filename = components.netloc + path + components.query
        # replace invalid characters
        filename = re.sub('[^/0-9a-zA-Z\-.,;_ ]', '_', filename)
        # restrict maximum number of characters
        filename = '/'.join(segment[:250] for segment in
            filename.split('/'))
    return os.path.join(self.cache_dir, filename)
    def __getitem__(self, url):
        """Load data from disk for this URL
        """
        path = self.url_to_path(url)
        if os.path.exists(path):
            with open(path, 'rb') as fp:
                return pickle.load(fp)
        else:
            # URL has not yet been cached
            raise KeyError(url + ' does not exist')
    def __setitem__(self, url, result):
        """Save data to disk for this url
        """
        path = self.url_to_path(url)
        folder = os.path.dirname(path)
        if not os.path.exists(folder):
            os.makedirs(folder)
        with open(path, 'wb') as fp:
            fp.write(pickle.dumps(result))

In [None]:
# Drawbacks
# we applied various restrictions to map the URL to a safe filename, 
# but an unfortunate consequence of this is that some URLs will map to the same filename.

### Database Cache


NoSQL stands for Not Only SQL and is a relatively new approach to database design.
The traditional relational model used a fixed schema and splits the data into
tables. However, with large datasets, the data is too big for a single server and needs
to be scaled across multiple servers. This does not fit well with the relational model
because, when querying multiple tables, the data will not necessarily be available on
the same server. NoSQL databases, on the other hand, are generally schemaless and
designed from the start to shard seamlessly across servers. There have been multiple
approaches to achieve this that fit under the NoSQL umbrella. There are column data
stores, such as HBase; key-value stores, such as Redis; document-oriented databases,
such as MongoDB; and graph databases, such as Neo4j.


# Chapter 4: Concurrent Downloading

# Chapter 5: Dynamic Content

According to the United Nations Global Audit of Web Accessibility, 73 percent of leading websites rely on JavaScript for important functionalities (client-side language that’s ubiquitous in modern web pages: JavaScript).It can be used to collect information for user tracking,
submit forms without reloading the page, embed multimedia, and even power entire
online games. 

    <script>
    alert("This creates a pop-up using JavaScript");
    </script>


The consequence of this is that for many web pages the content that is displayed in our web browser is not available in the original HTML, and the scraping techniques covered so far will not work. There are two approaches to scraping data from dynamic JavaScript dependent websites:

• Reverse engineering JavaScript

• Rendering JavaScript

AJAX stands for Asynchronous JavaScript and XML and was
coined in 2005 to describe the features available across web
browsers that made dynamic web applications possible.This allowed JavaScript to make
HTTP requests to a remote server and receive responses, which
meant that a web application could send and receive data. The
traditional way to communicate between client and server was
to refresh the entire web page, which resulted in a poor user
experience and wasted bandwidth when only a small amount of
data needed to be transmitted.


### Reverse engineering a dynamic web page

The data is loaded dynamically with JavaScript. To scrape this data, we need to understand how the web
page loads this data, a process known as reverse engineering.

In [None]:
import json
import string
template_url = 'http://example.webscraping.com/ajax/search.json?page={}&page_size=10&search_term={}'
countries = set()
for letter in string.lowercase:
    page = 0
    while True:
        html = D(template_url.format(page, letter))
        try:
            ajax = json.loads(html)
        except ValueError as e:
            print e
            ajax = None
        else:
            for record in ajax['records']:
                countries.add(record['country'])
        page += 1
        if ajax is None or page >= ajax['num_pages']:
            break
open('countries.txt', 'w').write('\n'.join(sorted(countries)))

### Rendering a dynamic web page

Some websites will be very complex and difficult to understand,
even with a tool like Firebug. For example, if the website has been built with Google
Web Toolkit (GWT), the resulting JavaScript code will be machine-generated and
minified. This generated JavaScript code can be cleaned with a tool such as JS
beautifier, but the result will be verbose and the original variable names will be
lost, so it is difficult to work with. With enough effort, any website can be reverse
engineered. However, this effort can be avoided by instead using a browser rendering
engine, which is the part of the web browser that parses HTML, applies the CSS
formatting, and executes JavaScript to display a web page as we expect. In this
section, the WebKit rendering engine will be used, which has a convenient Python
interface through the Qt framework

In [None]:
# PyQt or PySide
# The following snippet can be used to import whichever Qt binding is installed:
try:
    from PySide.QtGui import *
    from PySide.QtCore import *
    from PySide.QtWebKit import *
except ImportError:
    from PyQt4.QtGui import *
    from PyQt4.QtCore import *
    from PyQt4.QtWebKit import *

In [None]:
# AJAX search

# This instantiates the QApplication object that the Qt framework requires to be created before other Qt objects to perform various initializations.
app = QApplication([])
# Next, a QWebView object is created, which is a container for the web documents.
webview = QWebView()
# A QEventLoop object is created, which will be used to create a local event loop
loop = QEventLoop()
# The loadFinished callback of the QwebView object is connected to the quit method of QEventLoop 
# so that when a web page finishes loading, the event loop will be stopped. 
# The URL to load is then passed to QWebView. 
# PyQt requires that this URL string is wrapped by a QUrl object, while for PySide, this is optional.
webview.loadFinished.connect(loop.quit)
# The QWebView load method is asynchronous, so execution will immediately pass to the next line 
# while the web page is loading—however, we want to wait until this web page is loaded, so loop.exec_() is called to start the event loop.
webview.load(QUrl(url))
# When the web page completes loading, the event loop will exit and execution can move to the next line, 
# where the resulting HTML of this loaded web page is extracted.
loop.exec_()
# Next, the QWebView GUI show() method is called so that the render window is displayed, which is useful for debugging. 
webview.show()
# Then, a reference to the frame is created to make the following lines shorter.
frame = webview.page().mainFrame()
# The QWebFrame class has many useful methods to interact with a web page. 
# The following two lines use the CSS patterns to locate an element in the frame, and then set the search parameters.
frame.findFirstElement('#search_term').
setAttribute('value', '.')
frame.findFirstElement('#page_size option:checked').
setPlainText('1000')
frame.findFirstElement('#search').
# Then, the form is submitted with the evaluateJavaScript() method to simulate a click event.
evaluateJavaScript('this.click()')
app.exec_()
html = webview.page().mainFrame().toHtml()
tree = lxml.html.fromstring(html)
tree.cssselect('#result')[0].text_content()

The final part of implementing our WebKit crawler is scraping the search results,
which turns out to be the most difficult part because it is not obvious when the AJAX
event is complete and the country data is ready. There are three possible approaches
to deal with this:

• Wait a set amount of time and hope that the AJAX event is complete by then

• Override Qt's network manager to track when the URL requests are complete

• Poll the web page for the expected content to appear

The first option is the simplest to implement but is inefficient, because if a safe timeout
is set, then usually a lot more time is spent waiting than necessary. Also, when the
network is slower than usual, a fixed timeout could fail. The second option is more
efficient but cannot be applied when the delay is from the client side rather than server
side—for example, if the download is complete, but a button needs to be pressed
before the content is displayed. The third option is more reliable and straightforward to
implement, though there is the minor drawback of wasting CPU cycles when checking
whether the content has loaded yet.

### Selenium

With the WebKit library used in the preceding example, we have full control to
customize the browser renderer to behave as we need it to. If this level of flexibility
is not needed, a good alternative is Selenium, which provides an API to automate the
web browser.

In [None]:
from selenium import webdriver
driver = webdriver.Firefox() # When this command is run, an empty browser window will pop up.

In [None]:
# To load a web page in the chosen web browser, the get() method is called:
driver.get('http://example.webscraping.com/search')

In [None]:
# To set which element to select, the ID of the search textbox can be used.
driver.find_element_by_id('search_term').send_keys('.')

In [None]:
# To return all results in a single search, we want to set the page size to 1000. 
# However, this is not straightforward because Selenium is designed to interact with the browser,
# rather than to modify the web page content. 
# To get around this limitation, we can use JavaScript to set the select box content:
js = "document.getElementById('page_size').options[1].text = '1000'"
driver.execute_script(js);
driver.find_element_by_id('search').click()

In [None]:
# Now we need to wait for the AJAX request to complete before loading the results, which was the hardest part of the script in the previous WebKit implementation.
# Fortunately, Selenium provides a simple solution to this problem by setting a timeout with the implicitly_wait() method:
driver.implicitly_wait(30)

In [None]:
links = driver.find_elements_by_css_selector('#results a')
countries = [link.text for link in links]

In [None]:
driver.close()

# Chapter 6: Interacting With Forms

# Chapter 7: Solving CAPTCHA

# Chapter 8: Scrapy

Scrapy is a popular web scraping framework that comes with many high-level
functions to make scraping websites easier. We will cover Portia, which is an application based on Scrapy that
allows you to scrape a website through a point and click interface.

We will use the following commands in this chapter:

• startproject: Creates a new project

• genspider: Generates a new spider from a template

• crawl: Runs a spider

• shell: Starts the interactive scraping console

In [1]:
import scrapy