# Chapter 4

In the chapter 4 of "Web Scraping with Python" you were able to learn more about the following subjects:
* Planning and Defining Objects
* Handling with different layouts
* Tracking through search
* Tracking through links
* Defining site through different mechanisms

The following cells aim to practice the contents listed above. For any sugestions, contact *gabriel.vasconcelos@usp.br*

Use the website https://faithful-ray-costume.cyclic.app/ and other sites requested to answer this notebook.

In [2]:
# Import BeautifulSoup and other libraries you find useful
from bs4 import BeautifulSoup
import requests

### a.
Build a flexible crawler based on search method

In [3]:
class Content:
    def __init__(self, title, author, price, cover):
        self.title = title
        self.author = author
        self.price = price
        self.cover = cover
        
    def __str__(self):
        return f'''==== Book ====
        Title: {self.title}
        Author: {self.author}
        Price: {self.price}
        Cover: {self.cover}
        '''

In [4]:
class Website:
    def __init__(self, name, url, searchUrl, resultListing,
                 resultUrl, absoluteUrl, titleTag, authorTag,
                 priceTag, coverTag):
        self.name = name
        self.url = url
        self.searchUrl = searchUrl
        self.resultListing = resultListing
        self.resultUrl = resultUrl
        self.absoluteUrl = absoluteUrl
        self.titleTag = titleTag
        self.authorTag = authorTag
        self.priceTag = priceTag
        self.coverTag = coverTag

In [21]:
class Crawler:
    def getPage(self, url):
        try:
            req = requests.get(url)
        except requests.exceptions.RequestException:
            print("It wasn`t possible to access this URL")
            return None
        return BeautifulSoup(req.text, 'html.parser')
    
    def safeGet(self, pageObj, selector):
        childObj = pageObj.select(selector)
        if childObj is not None and len(childObj) > 0:
            if childObj[0].name == 'img':
                return childObj[0].attrs['src']
            return childObj[0].get_text()
        return ""
    
    def search(self, topic, site: Website):
        bs = self.getPage(site.searchUrl + topic)
        searchResults = bs.select(site.resultListing)
        for result in searchResults[0].find_all(site.resultUrl, recursive=False):
            try:
                url = result.attrs["href"]
            except AttributeError:
                print("URL not found in crawler")
                return None
            if (site.absoluteUrl):
                bs = self.getPage(url)
            else:
                bs = self.getPage(site.url + url)
            if bs is None:
                print("Something was wrong with that page or URL. Skipping!")
                return
            title = self.safeGet(bs, site.titleTag)
            author = self.safeGet(bs, site.authorTag)
            price = self.safeGet(bs, site.priceTag)
            cover = self.safeGet(bs, site.coverTag)
            content = Content(title, author, price, cover)
            print(content)
            

### b.

Test it against these websites:
* https://scraping-cap4.herokuapp.com/
* https://www.amazon.com.br/
* https://www3.livrariacultura.com.br/

Obtain the following data:
* Book title
* Book author
* Book price
* Book cover

In [8]:
def soma(a, b):
    return a + b

soma(*[1, 2])

3

In [23]:
# Code below

crawler = Crawler()

siteData = [
    ['Livraria', 'https://faithful-ray-costume.cyclic.app', 'https://faithful-ray-costume.cyclic.app/?search=',
     '.book-list', 'a', False, '.book-div h1', '.book-title span', '.basic b span', 'img']
]

sites = [Website(*site) for site in siteData]

topics = ['estat', 'python']

for site in sites:
    for topic in topics:
        crawler.search(topic, site)

==== Book ====
        Title: Estatística básica
        Author: Pedro A. Morettin
        Price: R$ 104,90
        Cover: https://m.media-amazon.com/images/I/41pOrXotc-L._SY344_BO1,204,203,200_QL70_ML2_.jpg
        
==== Book ====
        Title: Data Science Do Zero: Noções Fundamentais com Python
        Author: Joel Grus
        Price: R$ 59,90
        Cover: https://m.media-amazon.com/images/I/51psvxQpAbS._SY344_BO1,204,203,200_QL70_ML2_.jpg
        
==== Book ====
        Title: Web Scraping com Python: Coletando Mais Dados da web Moderna
        Author: Ryan Mitchell
        Price: R$ 72,17
        Cover: https://m.media-amazon.com/images/I/41gx74x+gdL._SY344_BO1,204,203,200_.jpg
        
