### Beautiful Soup - exercise

1. Collect data from page https://books.toscrape.com
* Classics
* Science Fiction
* Humor
* Business

Information to be collected:
* title
* price
* rates/reviews
* availability in stock

Report format:

Written plan for each business question covering:
* deliverable result (output): draft of final table or graph/chart
* process: steps by logic of execution
* input: link for source of data

Final .csv file with all information

### Written plan

The learning objective of this exercise  is to practice Python programming language using Beautiful Soup 
library to "scrape" books information from the website https://books.toscrape.com
    
Deliverable result: 
Must be a '.csv' file; table with 6 columns named as title, price, rate, availability in stock, category, and date/time of scrapping.

Process:
- create a pattern dataframe containing the 6 columns listed above 
- for each category (Classics, Science Fiction, Humor, and Business), create a Beautiful Soup object to extract the
information in their specific webpages and create a dataframe to concatenate with the pattern dataframe. Missing information will be filled with 'NA'.
- concatenate all dataframes into a final dataframe
- export the final dataframe to a '.csv' file

Input:

Webpages for each category as follow:

Classics: http://books.toscrape.com/catalogue/category/books/classics_6/index.html

Science Fiction: http://books.toscrape.com/catalogue/category/books/science-fiction_16/index.html

Humor: http://books.toscrape.com/catalogue/category/books/humor_30/index.html

Business: http://books.toscrape.com/catalogue/category/books/business_35/index.html


In [1]:
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import pandas as pd
import numpy as np
import re

In [2]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:97.0) Gecko/20100101 Firefox/97.0'}
# user-agent from http://developers.whatismybrowser.com/

# empty dataframe
df_details = pd.DataFrame()

# pattern dataframe
df_pattern = pd.DataFrame(columns=[['category', 'title', 'price', 'rate', 'in_stock', 'datetime']])

In [3]:
categories = ['http://books.toscrape.com/catalogue/category/books/classics_6/index.html',
              'http://books.toscrape.com/catalogue/category/books/science-fiction_16/index.html',
              'http://books.toscrape.com/catalogue/category/books/humor_30/index.html',
              'http://books.toscrape.com/catalogue/category/books/business_35/index.html']

In [4]:
data = pd.DataFrame()

for category in range(len(categories)):

    # ================== Category - general page ========================

    url = categories[category]

    page = requests.get(url, headers=headers)
    soup = BeautifulSoup(page.text, 'html.parser')

    # Category name
    category_name_scrapy = soup.find('div', class_='page-header action')
    category_name = category_name_scrapy.find('h1').get_text()

    # Books list
    books_list = soup.find('ol', class_='row')
    books = books_list.find_all('div', class_='image_container')

    links_with_text = []
    for a in books_list.find_all('a', href=True): 
        if a.text: 
            links_with_text.append(a['href'])

    for b in range(len(links_with_text)):
        links_with_text[b] = 'http://books.toscrape.com/catalogue/' + links_with_text[b][9:]

    # ================== Category - books individual pages ========================

    df_category = pd.DataFrame()

    for c in range(len(links_with_text)):

        book_details_list = {"category": category, "title": {}, "price": {}, "rate": {},  "in_stock": {}, "datetime": {}}

        # API requests
        book_url = links_with_text[c]
        page = requests.get(book_url, headers=headers)

        # Beautiful Soup object
        soup = BeautifulSoup(page.text, 'html.parser')

        book_details = soup.find('article', class_='product_page')
        book_details_list['title'] = book_details.find('img').get('alt')
        book_details_list['price'] = book_details.find('p', class_='price_color').get_text()[-5:]
        book_details_list['rate'] = str(book_details.find('p', class_=['star-rating One', 'star-rating Two', 'star-rating Three', 
                                             'star-rating Four', 'star-rating Five']))
        book_details_list['in_stock'] = book_details.find('p', class_='instock availability').get_text()[-18:-7]
        book_details_list['datetime'] = datetime.now()

        df_book = pd.DataFrame(list(book_details_list.items())).T
        df_book.columns = df_book.iloc[0]   # rename column names
        df_book = df_book.iloc[1:]   # delete first row

        df_category = pd.concat([df_category, df_book], axis=0).reset_index(drop=True)  
        
        # final dataframe
        data = pd.concat([data, df_category], axis=0).reset_index(drop=True)

### Cleaning and Formatting data

In [5]:
data['category'] = data['category'].replace([0, 1, 2, 3],['classics', 'science_fiction', 'humor', 'business'])

data['rate'] = data['rate'].apply(lambda x: re.search('(One|Two|Three|Four|Five)', x).group(1) 
                                  if pd.notnull(x) else 0)
data['rate'] = data['rate'].replace(['One', 'Two', 'Three', 'Four', 'Five'], [1, 2, 3, 4, 5])

data['in_stock'] = data['in_stock'].apply(lambda x: re.search('\d', x).group(0) if pd.notnull(x) else 0)

data = data.drop_duplicates().reset_index(drop=True)

### Creating the deliverable .csv file

In [6]:
data_raw_csv = data.to_csv('books.csv', index=False)