# Predicting Genre from Book Descriptions
Ashley-Lauren Mighty  
December 2017

---

### Data
I collected the descriptions and genres for 4,500 books via the Goodreads API & direct web-scraping. The descriptions are the summaries written by the author & printed on the back or inside cover of a novel. When a user adds a book to their "want-to-read" or "currently-reading" list they can add the book to a shelve that matches how they describe the genre of the book. The genre that is collected from the web scrape is the shelf on which most users placed that book.

### Goal
The goal of this project is to accurately predict the user defined genre of a book based on its description. Accuracy for this project will be judged on the percentage of books that are correctly predicted as opposed to overall model accuracy.

In [1]:
import requests
import json
import urllib.request
from bs4 import BeautifulSoup
from goodreads import client
import time

import random
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
gc = client.GoodreadsClient('R8u5IJwBQSAdpvht7x8JqQ', 
                            'zejQqLJdQjMXj5387pOiaqiKORA7jRiXcizB3VTlAA')

# Data Collection
---
Use a mixture of webscraping & the Goodreads API & kaggle datasets to create a dataframe the contains a list of random books with their primary author, average rating, description, ID & Ebook status.

In [3]:
books = pd.read_csv('CSVs/kaggle_books.csv')
books = books[['best_book_id','original_publication_year','original_title','image_url']]
books = books.rename(columns={'best_book_id':'book_id','original_publication_year':'year'
                                  ,'original_title':'title'})
books.head()

Unnamed: 0,book_id,year,title,image_url
0,2767052,2008.0,The Hunger Games,https://images.gr-assets.com/books/1447303603m...
1,3,1997.0,Harry Potter and the Philosopher's Stone,https://images.gr-assets.com/books/1474154022m...
2,41865,2005.0,Twilight,https://images.gr-assets.com/books/1361039443m...
3,2657,1960.0,To Kill a Mockingbird,https://images.gr-assets.com/books/1361975680m...
4,4671,1925.0,The Great Gatsby,https://images.gr-assets.com/books/1490528560m...


In [4]:
#export kaggle book list to csv
book_list_full = books.copy()
book_list_full.to_csv('CSVs/book_list_full.csv', index=False)
#read in csv
book_list_full = pd.read_csv('CSVs/book_list_full.csv')
book_list_full.head()

Unnamed: 0,book_id,year,title,image_url
0,2767052,2008.0,The Hunger Games,https://images.gr-assets.com/books/1447303603m...
1,3,1997.0,Harry Potter and the Philosopher's Stone,https://images.gr-assets.com/books/1474154022m...
2,41865,2005.0,Twilight,https://images.gr-assets.com/books/1361039443m...
3,2657,1960.0,To Kill a Mockingbird,https://images.gr-assets.com/books/1361975680m...
4,4671,1925.0,The Great Gatsby,https://images.gr-assets.com/books/1490528560m...


In [6]:
book_info_full = []
for i in book_list_full['book_id']:
    url = 'https://www.goodreads.com/book/show/{}'.format(i)
    res = requests.get(url)
    soup = BeautifulSoup(res.content, 'lxml')
    book_dict = {}
    try:
        book = gc.book(i)
    except:
        pass
    time.sleep(2)
    book_dict['author'] = book.authors[0]
    time.sleep(2)
    book_dict['avg_rating'] = book.average_rating
    time.sleep(2)
    book_dict['description'] = book.description
    time.sleep(2)
    book_dict['is_ebook'] = book.is_ebook
    time.sleep(2)
    try:
        book_dict['genre'] = soup.find('a', {'class':'actionLinkLite bookPageGenreLink'}).text
    except:
        pass
    time.sleep(2)
    book_info_full.append(book_dict)

ConnectionError: HTTPSConnectionPool(host='www.goodreads.com', port=443): Max retries exceeded with url: /book/show/15062217 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x1a0fa51f60>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))

In [7]:
book_info_full = pd.DataFrame(book_info_full)

### Web scrape died approximately halfway through. Due to a time constraint I saved the data collected instead of trying to retrieve the entire dataset.

In [16]:
book_list45 = books[:4539].copy()
book_list45.to_csv('CSVs/book_list45.csv', index=False)

In [11]:
#save output to csv
#output crashed after approximately 4,500 books
book_info_full.to_csv('CSVs/book_info45.csv', index=False)

# Next Up: [Exploratory Data Analysis](Exploratory Data Analysis.ipynb)