When scraping presents unique organisational challenges, one website's h1 tag contains a title of an article while another contains the web page title.

You may be asked to scrape product prices from different websites, with the ultimate aim to be to compare product prices.

Large scalable crawlers tend to fall into one of several patterns. Learning these patterns and and applicable situations to improve maintainability and robustness of the crawlers.

We'll focus on web crawlers that collect a limited number of "types" of data from websites and store these data types as Python Objects that read and write from the database.

### Planning and Defining Objects.
It is easy to fall into the trap of trying to track all/almost all fields ,related to a product, that appear on multiple sites.This results in messy and hard to read datasets. The paradox of choice.

One good scalable approach, when deciding which data to collect, is to ignore the websites and ask yourself, 'what do I need?' For example, when you want to compare product prices among multiple stores and track those product prices over time. Thus you'll need just enough information to uniquely identify the product (across the multiple stores) and that's it:
- Product title
- Manufacturer
- Product ID no. (if available/relavant)

The more detailed information e.g. price ,reviews etc would be specific to a particular store and thus stored separately.

Ask yourself :
- Will this info help with project goals? Is it nice to have or is it essential?

- If it may help in future, how difficult will it be to go back and collect the data at a later time ?

- Does it make logical sense to store the data within a particular object?

If so, 
- is this data sparse or dense i.e. available on each site or every product e.g. colour

- how large is the data?

- If large data, will i need to contantly retrieve it or only on occasion?

- How variable is the type of data?

Let's say you'll do some meta analysis around product attributes and prices, the number of pages a book has etc you notice this data is scarce thus it may make sense to create a product type that looks like this:

- Product title
- Manufacturer
- Products ID
- Attiributes (optional list or dict)

and an attribute type looking like :
- attribute name
- attribute value

This flexibility to add new attributes over time without requiring a new data schema or rewrite code. When deciding how to store these attributes in the database, you can write JSON to the attribute table or store each attribute in a separate table with a product ID.

For keeping track of the prices found for each product, you'll need the following :
- Product ID
- Store ID
- Price 
- Date/Timestamp price was found at.

What if the product's attributes modified the price of the product?
For instance, some stores might charge more for large tshirts that smaller ones. For this you may consider splitting the single tshirt into separate product listings for each siz or creating a new item type to store information about instances of a product, containing these fields:
- Product ID
- Instance type (e.g. size fo shirt)

and each price would look like :
- Product instance ID
- Store ID
- Price
- Date/Timestamp price was found at.

These basic questions and logic are used when designing your Python objects, apply in almost every situation.


The data model is the underlying foundation of all the code that uses it. It is vital to give serious thought and planning to what, you need to collect and how to store it.



### Dealing with different website layouts
Google is able to extract useful data about a variety of websites, having no upfront knowledge about the website structure itself.

Humans are able to identify the title and main content of a page, it is more difficult to get a bot to do something.

Fortunately, in most cases of web crawling , the data is being collected from sites that you've used before, but a few, or a few dozen websites that are preselected by humans. this means that you don't need to use complicated algorithms or ml to detect which text on the page "looks most lika a title" or which is probably the "main content". `You determine what these elements are manually.`

The most obvious approach is to write a separate web crawler or page parser for each website. Each might take in a URL, string, or BeautifulSoup object and return a Python object for the thing that was scraped.

The following is an example of a Content Class (representing a piece of content on a website such as a news article) and two scraper functions that take in a BeautifulSoup Object and return an instance of Content:

In [None]:
import requests

class Content:
    def __init__(self, url, title, body):
        self.url = url
        self.title = title
        self.body = body
    def getPage(url):
        req = requests.get(url)
        return BeautifulSoup(req.text,'html.parser')
    
    def scrapeNYTimes(url):
        bs = getPage(url)
        title = bs.find("h1").text
        lines = bs.find_all("p", {"class":"story-content"})
        body = '\n'.join([line.text for line in lines])
        return Content(url, title, body)
    def scrapeBrookings(url):
        bs = getPage(url)
        title = bs.find("h1").text
        body = bs.find("div",{"class","post-body"}).text
        return Content(url, title, body)
    
    url = 'https://www.brookings.edu/blog/future-development/2018/01/26/delivering-inclusive-urban-access-3-uncomfortable-truths/'
    
    content = scrapeBrookings(url)
    print('Title : {}'.format(content.title))
    print('URL : {}\n'.format(content.url))
    print(content.body)
    
    url = 'https://www.nytimes.com/2018/01/25/opinion/sunday/silicon-valley-immortality.html'
    content = scrapeNYTimes(url)
    print('Title: {}'.format(content.title))
    print('URL: {}\n'.format(content.url))
    print(content.body)

In new scraper functions for additional news sites, one will notice that the sites parsing function:
    - Selects the title element and extracts the text for the title.
    - Selects the main content of the article.
    - Selects other content items as needed.
    - Returns a Content object instantiated with the strings found previously.
    
The only real site-dependent variables here are the CSS selectors used to obtain each piece of information. BeautifulSoup’s find and find_all functions take in a tag string and a dictionary of key/value attributes . These are passed into arguments in as parameters that define the structure of the site itself and the location
of the target data.