# Exercise: Scrape a Real Book Category Page into a DataFrame

We’ll scrape the **Travel** books category from [Books to Scrape](https://books.toscrape.com), a sandbox site made specifically for practicing scraping. :contentReference[oaicite:1]{index=1}.

This time, we will be using **selenium**

## Goal:  
Turn the list of books into a **pandas DataFrame** with one row per book.

Target URL (for this exercise):

```text
https://books.toscrape.com/catalogue/category/books/travel_2/index.html


### Preliminary questions
While inspecting the element, answer those questions (some of them already answered in yesterday's exercise)

- What are the tags and class name for the book cards ? 
- Where are the links to other "books categories" located ? 

In [3]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

import pandas as pd
from urllib.parse import urljoin


In [4]:
URL = "https://books.toscrape.com/catalogue/category/books/travel_2/index.html"
BASE = "https://books.toscrape.com/"

driver = ### open Chrome using webdriver

### access to the requested URL using driver.get

wait = WebDriverWait(driver, 10)


In [None]:
## Display all the available links of the current website to access to new categories
## Hint : use driver.find_elements, by.CSS Selector and identify where the link to other categories is located

In [None]:
# Click on the corresponding Link to access the following category

category = "Sports & Leisure"

category_link = driver.find_element('''Your Code Here''')
category_link.click()

In [None]:
# Wait until at least one book card is present
wait.until('''Your Code Here''')

In [5]:
# Grab all book cards
book_cards = driver.find_elements('''Your code Here''')
len(book_cards)


11

In [None]:
## Quit google chrome

In [6]:
rows = []

for card in book_cards:
    # Title + relative link are in h3 > a
    a_tag = card.find_element('''Your Code Here''')
    title = a_tag.get_attribute("title").strip()
    href = a_tag.get_attribute("href")  # Selenium often returns absolute already

    # Price is in p.price_color
    price_text = card.find_element('''Your Code Here''').text.strip()  # "£45.17"
    price = float(price_text.replace("£", "").replace("Â", ""))  # safety against encoding weirdness

    # Availability is in p.instock.availability
    availability = card.find_element('''Your Code Here''').text.strip()

    # If href was relative for some reason, urljoin fixes it
    product_url = urljoin(BASE, href)

    rows.append({
        "title": title,
        "price_gbp": price,
        "availability": availability,
        "product_url": product_url
    })

rows[:3]


[{'title': "It's Only the Himalayas",
  'price_gbp': 45.17,
  'availability': 'In stock',
  'product_url': 'https://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html'},
 {'title': 'Full Moon over Noah’s Ark: An Odyssey to Mount Ararat and Beyond',
  'price_gbp': 49.43,
  'availability': 'In stock',
  'product_url': 'https://books.toscrape.com/catalogue/full-moon-over-noahs-ark-an-odyssey-to-mount-ararat-and-beyond_811/index.html'},
 {'title': 'See America: A Celebration of Our National Parks & Treasured Sites',
  'price_gbp': 48.87,
  'availability': 'In stock',
  'product_url': 'https://books.toscrape.com/catalogue/see-america-a-celebration-of-our-national-parks-treasured-sites_732/index.html'}]

In [7]:
df = pd.DataFrame(rows)
df


Unnamed: 0,title,price_gbp,availability,product_url
0,It's Only the Himalayas,45.17,In stock,https://books.toscrape.com/catalogue/its-only-...
1,Full Moon over Noah’s Ark: An Odyssey to Mount...,49.43,In stock,https://books.toscrape.com/catalogue/full-moon...
2,See America: A Celebration of Our National Par...,48.87,In stock,https://books.toscrape.com/catalogue/see-ameri...
3,Vagabonding: An Uncommon Guide to the Art of L...,36.94,In stock,https://books.toscrape.com/catalogue/vagabondi...
4,Under the Tuscan Sun,37.33,In stock,https://books.toscrape.com/catalogue/under-the...
5,A Summer In Europe,44.34,In stock,https://books.toscrape.com/catalogue/a-summer-...
6,The Great Railway Bazaar,30.54,In stock,https://books.toscrape.com/catalogue/the-great...
7,A Year in Provence (Provence #1),56.88,In stock,https://books.toscrape.com/catalogue/a-year-in...
8,The Road to Little Dribbling: Adventures of an...,23.21,In stock,https://books.toscrape.com/catalogue/the-road-...
9,Neither Here nor There: Travels in Europe,38.95,In stock,https://books.toscrape.com/catalogue/neither-h...


Going further : 

* Using a for loop, try to extract the books names/prices/availability and product url of the first page of all categories
* Add a category column
* careful not to produce a list of list, use rows.extend() instead of .append()

In [14]:
### Hint : transform your script into a function you can iterate on using the links list

Now that you have the dataframe, get those answers again : 
* the average price of bookss
* the row of the 5th most expensive book
* plot a graph showing the price per book
* plot a graph showing the distribution of books among the different prices
* plot a graph showing the average price of books per category

In [15]:
### Your code here

### Using iloc : 

* Retrieve the first row of the DataFrame.

* Retrieve only the price_gbp column using its column index.

* Display the first 5 rows of the DataFrame.

### Using loc

* Display all books belonging to the Travel category.

* Display the title and price of books whose availability contains "In stock".

Hint: boolean condition + column list

* Set title as index, find the row of the book  "A Year in Provence (Provence #1)" then reset the index

### Using loc or iloc depending on the case : 

* Display all rows where price_gbp > 40.

* Display title, price, availability for books in the category "Sports and Game" with price < 30.

* Display the last three rows regardless of labels.

### Sorting and using iloc

* Find the 10th most expensive book

* Display the full row corresponding to the 10th most expensive book.
(Hint: indexing starts at 0)

In [16]:
### Your code here