When scraping presents unique organisational challenges, one website's h1 tag contains a title of an article while another contains the web page title.

You may be asked to scrape product prices from different websites, with the ultimate aim to be to compare product prices.

Large scalable crawlers tend to fall into one of several patterns. Learning these patterns and and applicable situations to improve maintainability and robustness of the crawlers.

We'll focus on web crawlers that collect a limited number of "types" of data from websites and store these data types as Python Objects that read and write from the database.

### Planning and Defining Objects.
It is easy to fall into the trap of trying to track all/almost all fields ,related to a product, that appear on multiple sites.This results in messy and hard to read datasets. The paradox of choice.

One good scalable approach, when deciding which data to collect, is to ignore the websites and ask yourself, 'what do I need?' For example, when you want to compare product prices among multiple stores and track those product prices over time. Thus you'll need just enough information to uniquely identify the product (across the multiple stores) and that's it:
- Product title
- Manufacturer
- Product ID no. (if available/relavant)

The more detailed information e.g. price ,reviews etc would be specific to a particular store and thus stored separately.

Ask yourself :
- Will this info help with project goals? Is it nice to have or is it essential?

- If it may help in future, how difficult will it be to go back and collect the data at a later time ?

- Does it make logical sense to store the data within a particular object?

If so, 
- is this data sparse or dense i.e. available on each site or every product e.g. colour

- how large is the data?

- If large data, will i need to contantly retrieve it or only on occasion?

- How variable is the type of data?

Let's say you'll do some meta analysis around product attributes and prices, the number of pages a book has etc you notice this data is scarce thus it may make sense to create a product type that looks like this:

- Product title
- Manufacturer
- Products ID
- Attiributes (optional list or dict)

and an attribute type looking like :
- attribute name
- attribute value

This flexibility to add new attributes over time without requiring a new data schema or rewrite code. When deciding how to store these attributes in the database, you can write JSON to the attribute table or store each attribute in a separate table with a product ID.

For keeping track of the prices found for each product, you'll need the following :
- Product ID
- Store ID
- Price 
- Date/Timestamp price was found at.

What if the product's attributes modified the price of the product?
For instance, some stores might charge more for large tshirts that smaller ones. For this you may consider splitting the single tshirt into separate product listings for each siz or creating a new item type to store information about instances of a product, containing these fields:
- Product ID
- Instance type (e.g. size fo shirt)

and each price would look like :
- Product instance ID
- Store ID
- Price
- Date/Timestamp price was found at.

These basic questions and logic are used when designing your Python objects, apply in almost every situation.


The data model is the underlying foundation of all the code that uses it. It is vital to give serious thought and planning to what, you need to collect and how to store it.



### Dealing with different website layouts
Google is able to extract useful data about a variety of websites, having no upfront knowledge about the website structure itself.

Humans are able to identify the title and main content of a page, it is more difficult to get a bot to do something.

Fortunately, in most cases of web crawling , the data is being collected from sites that you've used before, but a few, or a few dozen websites that are preselected by humans. this means that you don't need to use complicated algorithms or ml to detect which text on the page "looks most lika a title" or which is probably the "main content". `You determine what these elements are manually.`

The most obvious approach is to write a separate web crawler or page parser for each website. Each might take in a URL, string, or BeautifulSoup object and return a Python object for the thing that was scraped.

The following is an example of a Content Class (representing a piece of content on a website such as a news article) and two scraper functions that take in a BeautifulSoup Object and return an instance of Content:

In [7]:
import requests
from urllib.request import urlopen
from urllib.parse import urlparse
from bs4 import BeautifulSoup
import re
import datetime
import random

class Content:
        def __init__(self, url, title, body):
            self.url = url
            self.title = title
            self.body = body
def getPage(url):
    req = requests.get(url)
    return BeautifulSoup(req.text,'html.parser')
def scrapeNYTimes(url):
    bs = getPage(url)
    title = bs.find("h1").text
    lines = bs.find_all("p", {"class":"story-content"})
    body = '\n'.join([line.text for line in lines])
    return Content(url, title, body)
def scrapeBrookings(url):
    bs = getPage(url)
    title = bs.find("h1").text
    body = bs.find("div",{"class","post-body"}).text
    return Content(url, title, body)

url = 'https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/'

content = scrapeBrookings(url)
print('Title : {}'.format(content.title))
print('URL : {}\n'.format(content.url))
print(content.body)

url = 'https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html'
content = scrapeNYTimes(url)
print('Title: {}'.format(content.title))
print('URL: {}\n'.format(content.url))
print(content.body)

Title : Delivering inclusive urban access: 3 uncomfortable truths
URL : https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/


The past few decades have been filled with a deep optimism about the role of cities and suburbs across the world. These engines of economic growth host a majority of world population, are major drivers of economic innovation, and have created pathways to opportunities for untold amounts of people.	






Jeffrey Gutman
Nonresident Senior Fellow - Global Economy and Development







Adie Tomer
Fellow - Metropolitan Policy Program

 Twitter
AdieTomer






But all is not well within our so-called Urban Century. Rapid urbanization, rising gentrification, concentrated poverty, and shortages of basic infrastructure have combined to create spatial inequity in cities and suburbs across the globe. The challenges of housing, moving, and employing so many people have led to longer travel times, rising ho

Title: The Men Who Want to Live Forever
URL: https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html




In new scraper functions for additional news sites, one will notice that the sites parsing function:
    - Selects the title element and extracts the text for the title.
    - Selects the main content of the article.
    - Selects other content items as needed.
    - Returns a Content object instantiated with the strings found previously.
    
The only real site-dependent variables here are the CSS selectors used to obtain each piece of information. BeautifulSoup’s find and find_all functions take in a tag string and a dictionary of key/value attributes . These are passed into arguments in as parameters that define the structure of the site itself and the location of the target data.

To make things more convinient, dealing with all of these tag arguments and key/value pairs, you can use the BeautifulSoup select function with a single string CSS selector.

In [2]:
class Content:
    """
    Common base class for all articles/pages
    """
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body
    def print(self):
        """
        Flexible printing function controls output
        """
        print("URL: {}".format(self.url))
        print("TITLE: {}".format(self.title))
        print("BODY:\n{}".format(self.body))
class Website:
    """
    Contains information about website structure
    """
    def __init__(self, name, url, titleTag, bodyTag):
        self.name = name
        self.url = url
        self.titleTag = titleTag
        self.bodyTag = bodyTag

The website class does not store information collected from individual pages but store info on how to collect that data.

It doesn't store the Title page "My Page Title", it stores the string tag h1 that indicates where the titles can be found. This is why the class is called Website not Content.

Using the above content and Website classes you can then write a Crawler to scrape the title and content of any URL that is provided for a given web page from a given website:


In [13]:
import requests
from bs4 import BeautifulSoup

class Crawler:
    def getPage(self,url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            return None
        return BeautifulSoup(req.text,'html.parser')
    def safeGet(self,pageObj, selector):
        '''
        Utitlity func used to get a content string from a BeautifulSoup object
        and a selector. Returns an empty string if no object is found for the given
        selector.
        '''
        selectedElems = pageObj.select(selector)
        if selectedElems is not None and len(selectedElems) > 0 :
            return '\n'.join([elem.get_text() for elem in selectedElems])
        return ''
    def parse(self,site,url):
        """
        Extract content from a given page URL
        """
        bs = self.getPage(url)
        if bs is not None:
            title = self.safeGet(bs,site.titleTag)
            body = self.safeGet(bs, site.bodyTag)
            if title != '' and body != '':
                content = Content(url, title, body)
                #content.print()
                print(content)

In [14]:
# Define the website objects ad kicks off the process : 
crawler = Crawler()
siteData = [
['O\'Reilly Media', 'http://oreilly.com',
'h1', 'section#product-description'],
['Reuters', 'http://reuters.com', 'h1',
'div.StandardArticleBody_body_1gnLA'],
['Brookings', 'http://www.brookings.edu',
'h1', 'div.post-body'],
['New York Times', 'http://nytimes.com',
'h1', 'p.story-content']
]

websites = []
for row in siteData:
    websites.append(Website(row[0],row[1],row[2],row[3]))

crawler.parse(websites[0], 'http://shop.oreilly.com/product/'\
'0636920028154.do')
crawler.parse(websites[1], 'http://www.reuters.com/article/'\
'us-usa-epa-pruitt-idUSKBN19W2D0')
crawler.parse(websites[2], 'https://www.brookings.edu/blog/'\
'techtank/2016/03/01/idea-to-retire-old-methods-of-policy-education/')
crawler.parse(websites[3], 'https://www.nytimes.com/2018/01/'\
'28/business/energy-environment/oil-boom.html')


<__main__.Content object at 0x7f616bb47c50>


This new method might not seem remarkably simpler than writing a new Python function for each new website at first glance, imagine what happens when you go from a system with 4 website sources to a system with 20/200.

Of course, the downside is that you are giving up a certain amount of flexibility. In
the first example, each website gets its own free-form function to select and parse
HTML however necessary, in order to get the end result. In the second example, each
website needs to have a certain structure in which fields are guaranteed to exist, data
must be clean coming out of the field, and each target field must have a unique and
reliable CSS selector.
However, I believe that the power and relative flexibility of this approach more than
makes up for its real or perceived shortcomings.

The next section covers specific
applications and expansions of this basic template so that you can, for example, deal with missing fields, collect different types of data, crawl only through specific parts of a website, and store more-complex information about pages.

### Structuring Crawlers

When scraping it is necessary to use the methods of crawling through webstes and finding new pages in an automated way.

This shows how to incorporate these methods into a well-structured and expandable website crawler that can gather links and discover dta in an automated way. We'll go through 3 basic web crwlers : they apply to the majority of situations. Ryan wishes these structures inspire you to create elegant and robust crawler design.

These methods are : 
- Crawling through search
- Crawling through links.
- crawling multiple page types.

#### Crawling sites through search.

One of the easiest ways.

Although the process of searching a website for a keyword or topic and collecting
a list of search results seems difficult, it is easy because : 
    
- Most sites retrieve a list of search results for a particular topic by passing that topic as a string trough a
parameter in the URL. 
For example: http://example.com?search=myTopic. where "http://example.com?search=" can be saved as an object and the topic simply appended to it.

- After searching most sites will receive the results as an easily identifiable 
list of links, usually ith a convinient surrounding tag such as <span
class="result"> , the exact format of which can also be stored 
as a property of the website object. 

-Each result link is either a relative URL ("/articles/page.html")   or an absolute URL ("http ://exampl e.c om/articles/page.html"),   store either as a property of the Website object.

- Locate and normalize the URLs on the search page, you've successfully reduced the problem to the example in the previous section. Extracting data from a given page, given a website format.
    


##### Implementation
The `Content class` is much the same. Adding the URL property to keep track of content found.

The `Website Class` has a few new properties, the `SearchUrl` defines where you should go to get search results if you append the topic you are looking for. The `resultingListing` defines the box that holds info about

In [7]:
class Content:
    """Common base class for all articles/pages"""
    def __init__(self, topic, url, title, body):
        self.topic = topic
        self.title = title
        self.body = body
        self.url = url
    def print(self):
        """
        Flexible printing function controls output
        """
        print("New article found for topic: {}".format(self.topic))
        print("TITLE: {}".format(self.title))
        print("BODY:\n{}".format(self.body))
        print("URL: {}".format(self.url))