# Beautiful Soup Dataset Project

***Goal:*** use Python and BeautifulSoup package to collect data (genre, price, stock availability, rating, title, and purchase link) of each book in [this test website](https://books.toscrape.com/)

[BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)    
BeautifulSoup uses html tags to pull necessary information from a specific webpage

## Web Scraping Best Practices:

Not every website allows for web scraping, here are the following best practices:

- **Iterative:** Always make sure, your code is as iterative as possible, keeping it dynamic, and not hard-coding any static values. This helps in cases where the website changes the number of items on their page keeping the structure same.
- **Compliant with Robots.txt and Terms & Conditions:** Don’t breach the implied contract, limits, permits, or prohibitions of web scraping that can be found in the terms and conditions and/or the robots.txt file.
- **Don’t Overburden the Website:** Querying a website excessively will interfere with its normal processes, and slow down its performances. Make sure your queries aren’t excessive.
- **Use an API:** If a site has the ability to download data via an API, obtain data that way, as opposed to scraping (even if there is a fee involved).

***since a test website is being used to scrape this specific data, we will only be focusing on the first criteria***

# Project Code
   
## Import Necessary Packages for Project   
   
We will be importing the following packages/modules for the following reasons
- **requests:** allows us to send a get request to the http address and access a specific webpage
- **BeautifulSoup:** allows us to extract and navigate through data using HTML tags in a webpage
- **pandas:** allows us to create/format/clean our dataset for easy analysis
- **re:** allows us to use Regular Expressions within Python and search for strings that match a certain pattern

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re

## Pulling data from webpage (rating, title, link, price, stock)

With the code below we are looking to pull the following easily obtainable information for each book:
- rating
- title 
- book link
- price
- stock availability

Since there is a "Books" category that contains all available books on the website, we will only be iterating over each page number within the "Books" category -- as opposed to iterating over both genre and page number.

In [3]:
#create dynamic URL for looping through book data
url_template = "https://books.toscrape.com/catalogue/page-{}.html"

#create list of page numbers to iterate over in the dynamic URL
page_nums = list(range(1, 51))

#initialize book_data for intial if criteria
book_data = []

for i in range(0,50):
    url = url_template.format(page_nums[i])
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    
# create aggregated book data
    if book_data == []:
        book_data = soup.findAll('article', attrs={'class': 'product_pod'})
    else:
        add_bookdata = soup.findAll('article', attrs={'class': 'product_pod'})
        book_data.append(add_bookdata)

# use aggregated book data to pull necessary data
title_data = re.findall("title=(.*?)>", str(book_data))
price_data = re.findall("<p class=\"price_color\">(.*)</p>", str(book_data))
starrating_data = re.findall("<p class=\"star-rating (\w*)\">", str(book_data))
instock_data = re.findall("<p class=\"instock availability\">\s*<i class=\"icon-ok\"></i>\s*(.*)\s*</p>", str(book_data))
booklink_data = re.findall("<div class=\"image_container\">\s*<a href=\"(.*)\"><img.*/></a>\s*</div>", str(book_data))

We are now looking to convert the existing lists of data into a Pandas dataframe for easy data manipulation and cleansing. We will also be renaming the data columns as needed

In [4]:
#create pandas series first of all the data
title_series = pd.Series(title_data)
price_series = pd.Series(price_data)
rating_series = pd.Series(starrating_data)
stock_series = pd.Series(instock_data)
link_series = pd.Series(booklink_data)

#dictionary to map new columns names to series -- rename column names
frame = {"title": title_series, "price(EUR)": price_series, "rating": rating_series, "instock": stock_series, "link": link_series}

#create book dataframe
book_df = pd.DataFrame(frame)

## Pulling data from a webpage pt 2 (genre)

Since genre data is not easily contained within each html tag for each book, we will have to pull this data from the book directory of each individual genre. 

We will accomplish this by creating a dictionary with each genre as the key and a list of books with that genre as the value (The genre directory in the website makes it easy to iterate over genre). This dictionary can then be flipped and then used to create a new column in our dataset

**obtaining list of genre links and genre labels to iterate over or use as keys in the dictionary**

In [5]:
index_url = "https://books.toscrape.com/"
response = requests.get(index_url)
soup = BeautifulSoup(response.text, "html.parser")
    
#subsection of category data
categories_data = soup.findAll('ul', attrs={'class': 'nav nav-list'})

#make list of URLs to parse through and make list of genres
nav_links = re.findall("<a\shref=\"(.*?)\">", str(categories_data))
categories = re.findall("<a\shref=\".*?\">\s*(.*)\s*</a>", str(categories_data))

# get rid of "book" entries in scraped genre list
nav_links = nav_links[1:]
categories = categories[1:]

#remove index.html from all strings so that it can be appended in format
#need to convert to series in order to have .str functionality
nav_links = pd.Series(nav_links).str[:-10].to_list()

#add forward slash to navlinks so that it can be appended if needed 
nav_links = list(map(lambda x: "/" + x, nav_links))

In [6]:
# add initial part of URL for the nav_links
indexurl_template = "https://books.toscrape.com{category_link}index.html"
url_template = "https://books.toscrape.com{category_link}page-{num}.html"

#initialize genre dictionary
genre_dict = {}

#checking list items in the range -- iterate over categories
for i in range(0,50):
    page_num = 1
    #refresh title data variable for each genre                           
    title_list = []
    #refresh more_pages for each genre
    more_pages = []
    
    #use url for .html -- genres with only 1 page
    url = indexurl_template.format(category_link = nav_links[i])
    response = requests.get(url)

    soup = BeautifulSoup(response.text, "html.parser")
    book_data = soup.findAll('article', attrs={'class': 'product_pod'})
    header_data = soup.findAll('form', attrs={'method':'get', 'class':'form-horizontal'})
    more_pages = re.findall("<form .*>\s*<div style=\"display:none\">\s*</div>\s*<strong>[0-9]*</strong> results.*showing <strong>[0-9]*</strong> to <strong>([0-9]*)</strong>.*\s*</form>", str(header_data))
    
    genre_titles = re.findall("title=(.*?)>", str(book_data))
    title_list.extend(genre_titles)

    while True:
        
        page_num +=1
        
        #end loop and save to dictionary if there are no additional results 
        if more_pages == []:
            genre_dict[categories[i]] = title_list
            break
        
        #continue loop and update/pull data if there are additional results
        else: 
            #reset url for genres with more than 1 page
            url = url_template.format(category_link = nav_links[i], num = page_num)
            response = requests.get(url)
            soup = BeautifulSoup(response.text, "html.parser")
            book_data = soup.findAll('article', attrs={'class': 'product_pod'})
            genre_titles = re.findall("title=(.*?)>", str(book_data))
            
            #use .extend instead of .append because .append will add the entire list instead of list elements
            title_list.extend(genre_titles)

            header_data = soup.findAll('form', attrs={'method':'get', 'class':'form-horizontal'})
            more_pages = re.findall("<form .*>\s*<div style=\"display:none\">\s*</div>\s*<strong>[0-9]*</strong> results.*showing <strong>[0-9]*</strong> to <strong>([0-9]*)</strong>.*\s*</form>", str(header_data))


**flipping dictionary to have the following format:**       
```python
title_dict{title: genre}
```

In [7]:
title_dict = {}

for key in genre_dict:
    for title in genre_dict[key]:
        book_title = title
        genre = key
        
        title_dict[book_title] = genre

**map titles to genre values to create new genre column**

In [8]:
book_df['genre']= book_df['title'].map(title_dict)

## Cleansing Dataset    

Now we are looking to cleanse our dataset and prepare it for analysis. We want to accomplish the following for each column: 
- **title:** remove quotations that were pulled in the data scraping process
- **price(EUR):** remove any non numeric characters and change the column data type to int/float so that numeric analytics can be carried out 
- **rating:** turn the string values into numeric characters so that numeric analytics can be carried out
- **instock:** turn this column into a boolean data type so that boolean analytics can be carried out
- **link:** reformat the links so that the entire URL is included (not only the children link)

**Note: We will first copy the dataset to the cleansed_df variable so that the original dataset is not altered**

In [16]:
#copy dataset
cleansed_df = book_df.copy()

#removing quotations from the title data
#need to use vectorized string slicing instead of .replace() bc titles could have quotations in their titles
cleansed_df["title"] = book_df["title"].str[1:-1] 

#removing non numeric symbols from price data so that numeric operations can be used
cleansed_df["price(EUR)"] = book_df["price(EUR)"].str[2:].astype(float)

#converting rating to numeric data so that numeric operations can be used
rating_dict = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
cleansed_df.replace({"rating": rating_dict},inplace=True)

#converting instock to boolean 
cleansed_df["instock"] = cleansed_df["instock"].str.contains("In stock")

#converting sub link to entire link
base_url = "https://books.toscrape.com/catalogue/"
cleansed_df["link"] = pd.Series(list(map(lambda x:"{}{}".format(base_url, x), cleansed_df["link"])))

## Final Dataset

**cleansed_df is our final dataset**

We will now run the following final checks to verify that the correct changes have been made:
- print first x rows
- print last x rows
- look at basic information of our DataFrame using df.info()

In [17]:
cleansed_df[:20]

Unnamed: 0,title,price(EUR),rating,instock,link,genre
0,A Light in the Attic,51.77,3,True,https://books.toscrape.com/catalogue/a-light-i...,Poetry
1,Tipping the Velvet,53.74,1,True,https://books.toscrape.com/catalogue/tipping-t...,Historical Fiction
2,Soumission,50.1,1,True,https://books.toscrape.com/catalogue/soumissio...,Fiction
3,Sharp Objects,47.82,4,True,https://books.toscrape.com/catalogue/sharp-obj...,Mystery
4,Sapiens: A Brief History of Humankind,54.23,5,True,https://books.toscrape.com/catalogue/sapiens-a...,History
5,The Requiem Red,22.65,1,True,https://books.toscrape.com/catalogue/the-requi...,Young Adult
6,The Dirty Little Secrets of Getting Your Dream...,33.34,4,True,https://books.toscrape.com/catalogue/the-dirty...,Business
7,The Coming Woman: A Novel Based on the Life of...,17.93,3,True,https://books.toscrape.com/catalogue/the-comin...,Default
8,The Boys in the Boat: Nine Americans and Their...,22.6,4,True,https://books.toscrape.com/catalogue/the-boys-...,Default
9,The Black Maria,52.15,1,True,https://books.toscrape.com/catalogue/the-black...,Poetry


In [19]:
cleansed_df[-20:]

Unnamed: 0,title,price(EUR),rating,instock,link,genre
980,Frankenstein,38.0,2,True,https://books.toscrape.com/catalogue/frankenst...,Default
981,Forever Rockers (The Rocker #12),28.8,3,True,https://books.toscrape.com/catalogue/forever-r...,Music
982,Fighting Fate (Fighting #6),39.24,3,True,https://books.toscrape.com/catalogue/fighting-...,Romance
983,Emma,32.93,2,True,https://books.toscrape.com/catalogue/emma_17/i...,Classics
984,"Eat, Pray, Love",51.32,3,True,https://books.toscrape.com/catalogue/eat-pray-...,Nonfiction
985,Deep Under (Walker Security #1),47.09,5,True,https://books.toscrape.com/catalogue/deep-unde...,Romance
986,Choosing Our Religion: The Spiritual Lives of ...,28.42,4,True,https://books.toscrape.com/catalogue/choosing-...,Religion
987,Charlie and the Chocolate Factory (Charlie Buc...,22.85,3,True,https://books.toscrape.com/catalogue/charlie-a...,Childrens
988,Charity's Cross (Charles Towne Belles #4),41.24,1,True,https://books.toscrape.com/catalogue/charitys-...,Romance
989,Bright Lines,39.07,5,True,https://books.toscrape.com/catalogue/bright-li...,Fiction


In [18]:
cleansed_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   title       1000 non-null   object 
 1   price(EUR)  1000 non-null   float64
 2   rating      1000 non-null   int64  
 3   instock     1000 non-null   bool   
 4   link        1000 non-null   object 
 5   genre       1000 non-null   object 
dtypes: bool(1), float64(1), int64(1), object(3)
memory usage: 40.2+ KB
