# <span style="color:#0066ff">Exercícios Práticos<span>

**1. Coletar os seguintes dados da página: https://books.toscrape.com**
- Catálogo:
    + Classics
    + Science Fiction
    + Humor
    + Business

- Coletar os seguintes dados de cada livro:
    + Nome do livro
    + Preço em libras
    + Avaliação dos consumidores
    + Disponível em estoque

**2. Entregável:**

- Faça um plano escrito para cada uma das perguntas de negócio, contendo:
    - Saída: A simulação da tabela e gráfico final.
    - Processo: A sequência de passos organizada pela lógica de execução
    - Entrada: O link para as fontes de dados.
    

- Uma csv com todas as informação de todos os catálogos.

# <span style="color:#ff6600">Planejamento<span>

- Como saída, queremos uma tabela com a seguinte estrutura:

book_name | price | rate | in_stock | category | date
----------|-------|------|----------|----------|------
The Secret Garden | 15.08 | 4 | Yes | Classics | 2022-01-20

- **Processo**:
    + Importação de pacotes necessários para o webscraping, limpeza e organização dos dados.
    + Ir em cada categoria e extrair as informações necessárias.
    + Fonte de dados: https://books.toscrape.com

# <span style="color:#ff6600">Execução<span>

## <span style="color:#4da6ff">0. Packages <span>

In [1]:
import requests
import pandas as pd
import numpy as np
from datetime import datetime
import re
from bs4 import BeautifulSoup

## <span style="color:#4da6ff">1. Data extraction  <span>

In [2]:
# API Requests
url = 'https://books.toscrape.com'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get( url, headers=headers )

In [3]:
# Beautiful Soup object
soup = BeautifulSoup(page.text, 'html.parser')

In [4]:
soup

<!DOCTYPE html>

<!--[if lt IE 7]>      <html lang="en-us" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]>         <html lang="en-us" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]>         <html lang="en-us" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-us"> <!--<![endif]-->
<head>
<title>
    All products | Books to Scrape - Sandbox
</title>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta content="24th Jun 2016 09:29" name="created"/>
<meta content="" name="description"/>
<meta content="width=device-width" name="viewport"/>
<meta content="NOARCHIVE,NOCACHE" name="robots"/>
<!-- Le HTML5 shim, for IE6-8 support of HTML elements -->
<!--[if lt IE 9]>
        <script src="//html5shim.googlecode.com/svn/trunk/html5.js"></script>
        <![endif]-->
<link href="static/oscar/favicon.ico" rel="shortcut icon"/>
<link href="static/oscar/css/styles.css" rel="stylesheet" type="text/css"/>
<link href="s

In [5]:
soup.find_all('a')

[<a href="index.html">Books to Scrape</a>,
 <a href="index.html">Home</a>,
 <a href="catalogue/category/books_1/index.html">
                             
                                 Books
                             
                         </a>,
 <a href="catalogue/category/books/travel_2/index.html">
                             
                                 Travel
                             
                         </a>,
 <a href="catalogue/category/books/mystery_3/index.html">
                             
                                 Mystery
                             
                         </a>,
 <a href="catalogue/category/books/historical-fiction_4/index.html">
                             
                                 Historical Fiction
                             
                         </a>,
 <a href="catalogue/category/books/sequential-art_5/index.html">
                             
                                 Sequential Art
            

### <span style="color:#4da6ff">1.1. Classics  <span>

In [66]:
# API Requests
url = 'https://books.toscrape.com/catalogue/category/books/classics_6/index.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get( url, headers=headers )

soup = BeautifulSoup(page.text, 'html.parser')

products = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')
# products
#len(products)

# ================= product_name =================
product_name = [p.find('h3').find('a').get_text() for p in products]
# product_name


# ================= product_price =================
price_init = [p.find('p', class_="price_color").get_text() for p in products]
aux_price = [re.findall(r'\d+', price) for price in price_init]
price = [p[0] + '.' + p[1] for p in aux_price]
price = [float(x) for x in price]
# price


# ================= in_stock =================
# Function to Remove first and  end spaces
def remove_first_end_spaces(string):
    return "".join(string.rstrip().lstrip())

in_stock = [remove_first_end_spaces(p.find('p', class_='instock availability').get_text().replace("\n", " ")) for p in products]
# in_stock

# ================= rating =================
rating = [p.find('p')['class'][1] for p in products]
#rating

# ================= creating data frame =================
data_classics = pd.DataFrame([product_name , price, in_stock, rating]).T
data_classics.columns = ['title', 'price', 'stock', 'rating']

# genre
data_classics['genre'] = 'Classics'

# date
data_classics['date'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

data_classics.head()
#data_classics.shape

(19, 6)

### <span style="color:#4da6ff">1.2. Science Fiction  <span>

In [70]:
# API Requests
url = 'https://books.toscrape.com/catalogue/category/books/science-fiction_16/index.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get( url, headers=headers )

soup = BeautifulSoup(page.text, 'html.parser')

sci_fi = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')

# ================= product_name =================
product_name = [p.find('h3').find('a').get_text() for p in sci_fi]
# product_name


# ================= product_price =================
price_init = [p.find('p', class_="price_color").get_text() for p in sci_fi]
aux_price = [re.findall(r'\d+', price) for price in price_init]
price = [p[0] + '.' + p[1] for p in aux_price]
price = [float(x) for x in price]
# price


# ================= in_stock =================
# Function to Remove first and  end spaces
def remove_first_end_spaces(string):
    return "".join(string.rstrip().lstrip())

in_stock = [remove_first_end_spaces(p.find('p', class_='instock availability').get_text().replace("\n", " ")) for p in sci_fi]
# in_stock

# ================= rating =================
rating = [p.find('p')['class'][1] for p in sci_fi]
#rating

# ================= creating data frame =================
data_sci_fi = pd.DataFrame([product_name , price, in_stock, rating]).T
data_sci_fi.columns = ['title', 'price', 'stock', 'rating']

# genre
data_sci_fi['genre'] = 'Scientific Fiction'

# date
data_sci_fi['date'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

data_sci_fi.head()

Unnamed: 0,title,price,stock,rating,genre,date
0,Mesaerion: The Best Science ...,37.59,In stock,One,Scientific Fiction,2022-01-25 21:44:22
1,Join,35.67,In stock,Five,Scientific Fiction,2022-01-25 21:44:22
2,William Shakespeare's Star Wars: ...,43.3,In stock,Four,Scientific Fiction,2022-01-25 21:44:22
3,The Project,10.65,In stock,One,Scientific Fiction,2022-01-25 21:44:22
4,Soft Apocalypse,26.12,In stock,Two,Scientific Fiction,2022-01-25 21:44:22


### <span style="color:#4da6ff">1.3. Humor  <span>

In [71]:
# API Requests
url = 'https://books.toscrape.com/catalogue/category/books/humor_30/index.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get( url, headers=headers )

soup = BeautifulSoup(page.text, 'html.parser')

humor = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')

# ================= product_name =================
product_name = [p.find('h3').find('a').get_text() for p in humor]
# product_name


# ================= product_price =================
price_init = [p.find('p', class_="price_color").get_text() for p in humor]
aux_price = [re.findall(r'\d+', price) for price in price_init]
price = [p[0] + '.' + p[1] for p in aux_price]
price = [float(x) for x in price]
# price


# ================= in_stock =================
# Function to Remove first and  end spaces
def remove_first_end_spaces(string):
    return "".join(string.rstrip().lstrip())

in_stock = [remove_first_end_spaces(p.find('p', class_='instock availability').get_text().replace("\n", " ")) for p in humor]
# in_stock

# ================= rating =================
rating = [p.find('p')['class'][1] for p in humor]
#rating

# ================= creating data frame =================
data_humor = pd.DataFrame([product_name , price, in_stock, rating]).T
data_humor.columns = ['title', 'price', 'stock', 'rating']

# genre
data_humor['genre'] = 'Humor'

# date
data_humor['date'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

data_humor.head()

Unnamed: 0,title,price,stock,rating,genre,date
0,The Long Haul (Diary ...,44.07,In stock,One,Humor,2022-01-25 21:45:36
1,Old School (Diary of ...,11.83,In stock,Five,Humor,2022-01-25 21:45:36
2,I Know What I'm ...,25.98,In stock,Four,Humor,2022-01-25 21:45:36
3,Hyperbole and a Half: ...,14.75,In stock,Five,Humor,2022-01-25 21:45:36
4,Dress Your Family in ...,43.68,In stock,Three,Humor,2022-01-25 21:45:36


### <span style="color:#4da6ff">1.4. Business  <span>

In [72]:
# API Requests
url = 'https://books.toscrape.com/catalogue/category/books/business_35/index.html'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
page = requests.get( url, headers=headers )

soup = BeautifulSoup(page.text, 'html.parser')

business = soup.find_all('li', class_='col-xs-6 col-sm-4 col-md-3 col-lg-3')

# ================= product_name =================
product_name = [p.find('h3').find('a').get_text() for p in business]
# product_name


# ================= product_price =================
price_init = [p.find('p', class_="price_color").get_text() for p in business]
aux_price = [re.findall(r'\d+', price) for price in price_init]
price = [p[0] + '.' + p[1] for p in aux_price]
price = [float(x) for x in price]
# price


# ================= in_stock =================
# Function to Remove first and  end spaces
def remove_first_end_spaces(string):
    return "".join(string.rstrip().lstrip())

in_stock = [remove_first_end_spaces(p.find('p', class_='instock availability').get_text().replace("\n", " ")) for p in business]
# in_stock

# ================= rating =================
rating = [p.find('p')['class'][1] for p in business]
#rating

# ================= creating data frame =================
data_business = pd.DataFrame([product_name , price, in_stock, rating]).T
data_business.columns = ['title', 'price', 'stock', 'rating']

# genre
data_business['genre'] = 'Humor'

# date
data_business['date'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')

data_business.head()

Unnamed: 0,title,price,stock,rating,genre,date
0,The Dirty Little Secrets ...,33.34,In stock,Four,Humor,2022-01-25 21:46:45
1,The Third Wave: An ...,12.61,In stock,Five,Humor,2022-01-25 21:46:45
2,The 10% Entrepreneur: Live ...,27.55,In stock,Three,Humor,2022-01-25 21:46:45
3,Shoe Dog: A Memoir ...,23.99,In stock,Two,Humor,2022-01-25 21:46:45
4,Made to Stick: Why ...,38.85,In stock,Five,Humor,2022-01-25 21:46:45


### <span style="color:#4da6ff">1.5. Putting it all together  <span>


In [85]:
data_books = pd.concat([data_classics, data_sci_fi, data_humor, data_business], axis=0, ignore_index=True)
data_books.sample(10)

Unnamed: 0,title,price,stock,rating,genre,date
24,Sleeping Giants (Themis Files ...,48.74,In stock,One,Scientific Fiction,2022-01-25 21:44:22
11,Beowulf,38.35,In stock,Two,Classics,2022-01-25 15:21:33
25,Arena,21.36,In stock,Four,Scientific Fiction,2022-01-25 21:44:22
45,The Dirty Little Secrets ...,33.34,In stock,Four,Humor,2022-01-25 21:46:45
36,Old School (Diary of ...,11.83,In stock,Five,Humor,2022-01-25 21:45:36
19,Mesaerion: The Best Science ...,37.59,In stock,One,Scientific Fiction,2022-01-25 21:44:22
37,I Know What I'm ...,25.98,In stock,Four,Humor,2022-01-25 21:45:36
3,The Hound of the ...,14.82,In stock,Two,Classics,2022-01-25 15:21:33
55,The Lean Startup: How ...,33.92,In stock,Three,Humor,2022-01-25 21:46:45
33,The Last Girl (The ...,36.26,In stock,Two,Scientific Fiction,2022-01-25 21:44:22
