# Scrapping Amazon Website for Book Information

<img src="books.jpg"
     alt="Books"
     style="height: 400px; margin-right: 150px;" 
     align="center"/>

<h3>Alexandre Rosseto Lemos</h3>
<h3>Date: June, 2021</h3>

## Motivation

Last year, like in many years before, I planned some goals to be achieved in the year that was about to start. One of my goals was to read a certain amount of books, or more precisely, a certain amount of book pages: 5000.

As soon as I started reading, I began to think of projects that I could do to help me keep track of the final goal. First I thought I could fill out an Excel spreadsheet and then load it in Power BI desktop to create some visualizations and gain some insight of what I was reading. 

That was a pretty easy thing to do, considering I buy all my books for my Kindle, and Amazon's web page has all the information I could possible need about them, it was only a matter of copying the information from the book page to the spreadsheet.

Then, I began to think how I could challenge myself and automate this process using only the book's name and language as variables. So, I started learning about Web Scrapping and Web Crawling using Python.

## Problem Overview
This project uses Web Crawling and Web Scraping to retrieve information about books in the Amazon Kindle Store web page. It also uses Power BI as a visualization tool.

Using Python's libraries, I managed to obtain, for each book, all the information necessary (like number of pages, author, category) from Amazon web page and store it in a MS SQL database.

Then, using Power BI, I was able to create visualizations to help me gain insights of what I was reading.

In this project, I used the following Python libraries:

- selenium
- time
- bs4
- lxml
- pandas
- pymssql

At the end of this documentation, I provide the links to the official documentation pages of each library.

## Using Selenium to get to the correct book page

#### Selenium is a Web Crawling library, wich means I can use it to simulate a person navigating in the web. 

In this project, I used it to:
- Open Amazon's web page (using <a href="https://chromedriver.chromium.org/downloads">Google Chrome Web Driver</a>)
- Select the Kindle Store in the Search category
- Pass the name of the book and clicked in the "Search" button

Then, when the new page loaded, I used Selenium to select the first book in the list and click on the link, redirecting me to the book's page.

A book can have multiple versions, and each version has its own information, so it is important to obtain the information about the one I actually bought. In order to do so, I need to pick the first book that appears after searching for it using its name. The reason for this, is because I bought the versions that are on top of the "Most relevant" list, wich is the default sorting pattern used by Amazon's when displaying the search results.

In order to find the desired elements in the page, I had to manually inspect the front-end elements and find the ones necessary to use in finding the book's page.

#### bs4 and lxml are libraries that facilitate the manipulation of the information obtained in the front-end of web pages. 

In [1]:
# Initializing libraries
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
import time
import bs4
import lxml
import pandas as pd

### Functions created for the code

In [135]:
def create_dict(book_inf, init_date, final_date):
    '''
    Info:
        This function transforms the list of informations into a dictionary
    -------------
    Input:
        book_inf: List of informations from the book (type: list)
        init_date: Date of the beginning of the reading (type: string)
        final_date: Date of the end of the reading (type: string)
    -------------
    Output:
        dict_info: Dictionary with the inforations from the book (type: dictionary)
    '''
    
    # Adding the dates where I began and finished reading the book
    dict_info = {}
    dict_info['Inicio leitura'] = init_date
    dict_info['Fim leitura'] = final_date
    
    # Index of the categories of the book
    i = 1
    for line in book_inf:
        
        # The categories in wich the book belongs to only have one element in the line. It's necessary to add the key to the
        # information
        try:
            dict_info[line[0]]= line[1]
        except:
            
            # Spliting the information into two fields: The category field and the category rank field
            cat_num, rank_num = 'Categoria ' + str(i), 'Ranking na Categoria ' + str(i)       # Creating the keys
            aux_split = line[0].split(' em ')     # Separating the rank information from the category information
            aux_split2 = aux_split[0].split(' ')  # Separating to get only the number
            dict_info[cat_num] = aux_split[1]     # Adding the category information to the dictionary
            
            for element in aux_split2:
                try:
                    dict_info[rank_num] = int(element)  # Ading the rank information to the dictionary
                    i += 1    # A book can belong to many categories
                except:
                    pass
    
    return dict_info
    
def string_op(info_list):
    '''
    Info:
        This function makes string operations to clean and organize the information obtained from the book's web page
    -------------
    Input:
        info_list: List of informations from the book (type: list)
    -------------
    Output:
        formatted_list: Formatted list of informations from the book (type: list)
    '''
        
    # First and second operations
    # First operation: I only want the rank of the book in the Kindle Store, and usually this information is mixed with
    # other things. So I only take what is before the first '(' in the string. Also, I just want the numeric value of the field.
    # Second operation: I only want the numeric value of the field that has the number of pages
    for rank_info in info_list:
        if rank_info[0] == 'Ranking dos mais vendidos':    # Searching for the correct field
            rank_info[1] = rank_info[1].split('(')[0].split(' ')[1]      # Replacing the information
        if rank_info[0] == 'Número de páginas':
            rank_info[1] = rank_info[1].split(' ')[0]

    # Third operation: The star ranking comes mixed with the amount of classifications made by the customers.
    # To get split the informations, I edit the string and then split it.
    # Obs.: these informations are always in the last position of the list
    star_infos = info_list[-1][1].replace('estrelas', 'estrelas|').split('|')
    stars_users = star_infos[0].split(' de ')[0]    # Amount of stars given by the users
    total_stars = star_infos[0].split(' de ')[1].split(' ')[0]                         # Maximum amount of stars possible
    info_list[-1][1] = stars_users                                     # Replacing the mixed informations with the number of stars
    info_list.append(['Avaliacao Maxima', total_stars])                # Adding the maxium score possible
    info_list.append(['Classificacoes', star_infos[1].split(' ')[0]])  # Adding the amount of classifications made by the customers
    
    
    
    return info_list

def get_book_info(book_name, lang):

    '''
    Info:
        This function uses web crawl and web scrape to navigate to the book's Amazon web page and obtain the desired 
        informations about it.
    -------------
    Input:
        - book_name: Name of the book to get the information (type: str)
        - lang: Language of the book
    -------------
    Output:
        - book_info: List containing the informations about the book (type: list)
    '''
    
    # Using the 'headless' argument to avoid the chrome popping up
    op = webdriver.ChromeOptions()
    #op.add_argument('headless')

    # Passing the path to the chromedriver.exe file and the url to the Amazon page
    CHROME_DRIVER_PATH = "C:/Users/xande/Documents/WebDriver/chromedriver.exe"
    
    amazon_url = "https://www.amazon.com.br/"

    # Start of the web crawl
    driver = webdriver.Chrome(CHROME_DRIVER_PATH,
                              options=op
                              )
    driver.get(amazon_url)

    # Selecting the 'Loja Kindle' option to use in the search.
    category = Select(driver.find_element_by_id("searchDropdownBox"))    # Searching for the DropDown box       
    category.select_by_visible_text("Loja Kindle")                       # Selecting the desired section

    # Passing the name of the book to the search bar
    search_var = book_name + ' ' + lang                                   # Adding the language of the book
    searchbar = driver.find_element_by_id("twotabsearchtextbox")         # Searching for the Search bar by id
    searchbar.send_keys(search_var, Keys.TAB, Keys.ENTER)                # Passing the book's title and pressing the search button

    # Wait for the new page to load
    time.sleep(3)

    # Obtaining all the books in the page
    books = driver.find_elements_by_xpath("//a[@class = 'a-link-normal a-text-normal']")

    # Checking for the book with the same name as given the one given
    unmatching_names = True
    ind = 0  # Book index in the list
    while unmatching_names:

        # Retrieving the name of the book in the web page
        web_name = books[ind].find_elements_by_xpath(".//span")
        book_web_name = web_name[0].text

        # If the names match, select the book
        if book_web_name == book_name:
            unmatching_names = False
            books[ind].click()    # Selecting it

        # If the names don't match, search for the next element in the list
        else:
            ind += 1
    
    # Wait for the new page to load
    time.sleep(4)

    # Obtaining the html code for the book page
    book_page = driver.page_source
    
    #Closing google Chrome
    #driver.quit()

    # Using the bs4 and lxml libraries to work with the page source code
    soup = bs4.BeautifulSoup(book_page,"lxml")

    # Obtaining the desired information from the page
    # Finding the div elements 
    div_el = soup.select("div")

    book_info = []
    for info in div_el: 
        try:
            # The desired information is located in the element with the id = 'detailBullets_feature_div'
            if info['id'] == 'detailBullets_feature_div':
                book_info.append(info)
        except:
            pass

    # Obtaining the price of the book
    price_ele = soup.find_all('span', {'id': "kindle-price"})
    book_price = price_ele[0].text.replace('\n', '').split('$')[1]

    list_book_info = [['Livro', book_name], ['Preço (R$)', book_price]]   # Adding the book name to the list of informations
    for line in book_info[0].select('li'):
        list_book_info.append(line.text.replace('\n','').split(':'))
        
    formatted_book_info = string_op(list_book_info)
    
    return formatted_book_info

### Running the code

In [136]:
# List of books I am currently reading or have read this year
books = [('O andar do bêbado: Como o acaso determina nossas vidas', 'Português', '06/02/2021', '02/03/2021'), 
         ('A revolução dos bichos', 'Português', '02/03/2021', '04/03/2021'),
         ('1984', 'Português', '29/03/2021', '17/05/2021'),
         ('101 Perguntas e Respostas para Investidores Iniciantes', 'Português','10/01/2021', '04/02/2021'),
         ('101 Perguntas E Respostas Sobre Tributação Em Renda Variável: Tire suas dúvidas sobre tributação para Bolsa de Valores', 'Português','09/03/2021', ''),
         ('Rápido e devagar: Duas formas de pensar', 'Português','18/05/2021', ''),
        ]

# Obtaining the information about the books and storing it in a Pandas DataFrame structure
books_info = pd.DataFrame()
for book in books:
    book_info = get_book_info(book[0],book[1])
    book_info_dict = create_dict(book_info, book[2], book[3])
    books_info = books_info.append(book_info_dict, ignore_index=True)


In [137]:
books_info

Unnamed: 0,ASIN,Avaliacao Maxima,Avaliações dos clientes,Categoria 1,Categoria 2,Classificacoes,Configuração de fonte,Dicas de vocabulário,Editora,Fim leitura,...,Livro,Número de páginas,Preço (R$),Ranking dos mais vendidos,Ranking na Categoria 1,Ranking na Categoria 2,Tamanho do arquivo,Leitor de tela,Categoria 3,Ranking na Categoria 3
0,B008FPZPRA,5,46,Probabilidade e Estatística,"Ciências, Matemática e Tecnologia",1.036,Habilitado,Não habilitado,Zahar; 1ª edição (1 julho 2009),02/03/2021,...,O andar do bêbado: Como o acaso determina noss...,270,1967,1692,2.0,9.0,787 KB,,,
1,B009WWDBX0,5,49,Ficção clássica,Clássicos de Ficção,17.436,Habilitado,Não habilitado,Companhia das Letras; 1ª edição (10 janeiro 2007),04/03/2021,...,A revolução dos bichos,156,1745,526,10.0,38.0,1285 KB,,,
2,B009XE662U,5,49,Ficção clássica,Clássicos de Ficção,13.513,Habilitado,Não habilitado,Companhia das Letras; 1ª edição (21 julho 2009),17/05/2021,...,1984,482,2990,983,15.0,55.0,2686 KB,Compatível,,
3,B07L8NR3DF,5,47,Equipes,Investir,4.085,Habilitado,Não habilitado,Suno Research (9 dezembro 2018),04/02/2021,...,101 Perguntas e Respostas para Investidores In...,109,2000,1233,9.0,22.0,218 KB,Compatível,Negócios e economia,46.0
4,B07Q4SN446,5,45,Financeiro,Investir,648.0,Habilitado,Não habilitado,,,...,101 Perguntas E Respostas Sobre Tributação Em ...,99,2000,2890,1.0,42.0,3335 KB,,Negócios e economia,113.0
5,B00A3D1A44,5,47,Tomar de Decisões e Resolução de Problemas,Negócios e economia,4.351,Habilitado,Não habilitado,Objetiva; 1ª edição (1 agosto 2012),,...,Rápido e devagar: Duas formas de pensar,641,2990,359,6.0,13.0,3023 KB,,,


## Libraries documentation/main pages:
- <a href="https://selenium-python.readthedocs.io/index.html">Selenium</a>
- <a href="https://pandas.pydata.org">Pandas</a>
- <a href="https://docs.python.org/3/library/time.html">Time</a>
- <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/">Bs4</a>
- <a href="https://lxml.de/">Lxml</a>
- <a href="https://pypi.org/project/pymssql/">PyMSSQL</a>