## Book Data from GoodReads
### A Web Scraping Project

#### In this exercise, we will be scraping the goodreads web site to get information on books. Specifically, we will be targeting the list - 'Books that everyone should read at least once'. The scraping will be accomplished in two steps:

#### Step 1: Get the Book name and Book URL from all pages of the above mentioned list. Store this information in a dictionary and subsequently convert it to a CSV for persistence

#### Step 2: Get Book-specific information from each URL, build a pandas dataframe with this data and save the entire dataset to CSV for persistence

#### Let's begin!

In [6]:
# Import required libraries

from bs4 import BeautifulSoup as bs
import urllib.request
import pandas as pd
import numpy as np
import time
import lxml # If this library does not import then install via - !pip install lxml

import warnings
warnings.filterwarnings('ignore')

from requests import get
import os

In [2]:
# Required Variables

BASE_URL = 'https://www.goodreads.com'
LIST_URL = 'https://www.goodreads.com/list/show/264.Books_That_Everyone_Should_Read_At_Least_Once'

hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
      'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
      'Referer': 'https://cssspritegenerator.com',
      'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
      'Accept-Encoding': 'none',
      'Accept-Language': 'en-US,en;q=0.8'}

num_pages = 2

books = {'Title': [], 'URL': []}

In [3]:
books

{'Title': [], 'URL': []}

In [4]:
for i in range(1, num_pages):
    print(f'Reading Page {i}')
    list_page_url = f'{LIST_URL}?page={i}'
    list_page = get(list_page_url, headers = hdr)
    list_soup = bs(list_page.content, 'lxml')
    book_table = list_soup.find('table', attrs={'class': 'tableList'})
    rows = book_table.find_all('tr')
    for row in rows:
        link = row.find('a', attrs={'class': 'bookTitle'})
        title = link.find('span').text
        books['Title'] += [title]
        url = link.attrs['href']
        full_url = BASE_URL+url
        books['URL'] += [full_url]

books_df = pd.DataFrame.from_dict(books)
books_df.head()

Reading Page 1


Unnamed: 0,Title,URL
0,To Kill a Mockingbird,https://www.goodreads.com/book/show/2657.To_Ki...
1,Harry Potter and the Sorcerer's Stone (Harry P...,https://www.goodreads.com/book/show/3.Harry_Po...
2,1984,https://www.goodreads.com/book/show/40961427-1984
3,Pride and Prejudice,https://www.goodreads.com/book/show/1885.Pride...
4,The Diary of a Young Girl,https://www.goodreads.com/book/show/48855.The_...


In [5]:
test_url = 'https://www.goodreads.com/book/show/2657.To_Kill_a_Mockingbird'

In [9]:
if not os.path.exists('book_data_test.csv'):
    
    book_data = pd.DataFrame(columns=[
        
        'img_url',
        'book_title',
        'book_authors',
        'book_format',
        'book_pages',
        'avg_rating',
        'rating_count',
        'review_count'
    ])
    
    book_data.to_csv('book_data_test.csv')

book_data = pd.read_csv('book_data_test.csv')

In [10]:
book_data.head()

Unnamed: 0.1,Unnamed: 0,img_url,book_title,book_authors,book_format,book_pages,avg_rating,rating_count,review_count


In [13]:
try:
    book_page = get(test_url)
    book_soup = bs(book_page.content, 'lxml')
    #print(book_soup.prettify())
    
    book = dict()
    
    image_url = book_soup.find('img', attrs={'id': 'coverImage'})
    
    if(image_url):
        book['img_url'] = image_url.attrs['src']
    else:
        book['img_url'] = ''
        
    
    title = book_soup.find('h1', attrs={'id': bookTitle})
    if(title):
        book['book_title'] = title.text
    
except:
    print("hello")