<a href="https://colab.research.google.com/github/chalshaff12/sharing-repo/blob/master/NB2cisc3140midterm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scraping

---


by Michal Shaffer

**Using web scraping to extract and manipulate data from the goodreads 2018 Fiction choice awards list**


---
Goodreads.com has a yearly [choice awards event](https://www.goodreads.com/choiceawards/best-fiction-books-2018) where users vote on their favorite new books. After the event ends, goodreads publishes the list of nominees per genre along with the total vote count that each nominee received. .

#####For this project the following tasks will be completed:


1.   Set up the webpage url and prepare to get the data
2.   Scrape the raw html data from the webpage
3.   Parse through and extract return data
4.   Manipulate extracted data

##1) Initial set-up and preparation

In order to get any data from our page, we first have to do some initial setup procedures.

First, we'll need to get the html data, so we'll need to utilize the [`requests`](https://2.python-requests.org/en/master/) library. Let's import that first:



In [0]:
import requests

##2) Scraping the data from the page

Next, let's set up our page information and get the html dump of data.

In [0]:
genre = 'fiction' #we are looking at the fiction list for this project
url = f'https://www.goodreads.com/choiceawards/best-{genre}-books-2018'
page_html = requests.get(url).text #this gets the html text data and saves it into our variable

##3a) Parsing through our data

Html is difficult to parse through using python, so we need the help of another library. One of the very useful libraries is [`beautifulsoup4`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). (Click the link to view the documentation for that library.) 

Let's import that library first:

In [0]:
from bs4 import BeautifulSoup

Now, let's utilize the new library by parsing the html data into a more usable form:

In [0]:
soup_data = BeautifulSoup(page_html, 'lxml')

Great! You can print out all the newly-formatted html with

> `print(soup_data)`

but I don't want to clutter the page, so let's leave that out for now.

After looking through our `soup_data`, I found my way to the area of the page where the list of book nominees are displayed and made note of the html tags and class names.

![](https://i.imgur.com/9uX1YdN.png)

Using this information, let's navigate to that section now.


In [0]:
poll_contents = soup_data.find('div', {'pollContents'}) #first get to the main block of content
poll_item = poll_contents.find_all('div',{'inlineblock pollAnswer resultShown'}) #from there, get to the list of books

##3b) Extracting our data

Great! Now I want to go through each 'pollAnswer' to get the data for each nominated book. 
However, I can already see that the data is not in the format that I want to save it in. So before we run through it, let's make a few quick functions to help us slice and dice some of the string data we'll be getting:

In [0]:
#gets the characters between a given start char or string and end char or string
def between(string, start, end):
  start_pos = string.find(start) #find the start position of the inputted start
  end_pos = string.rfind(end) #find the end position of the inputted end
  new_start_pos = start_pos + len(start) #set the new starting position to the end of the inputted start
  return string[new_start_pos:end_pos] #Return everything between the two given positions.

#gets the characters before a given end char or string
def before(string, end):
  end_pos = string.find(end) #find the end position of the inputted end
  return string[0:end_pos] #return everything before the given position.

#gets the characters after a given end char or string
def after(string, start):
  start_pos = string.rfind(start) #find the start position of the inputted start
  new_start_pos = start_pos + len(start) #set the new starting position to the end of the inputted start
  return string[new_start_pos:] #return everything after the given position.

Great! Now, let's go through each book nominee item on the page and extract the data we want. 

I will be putting our extracted data into a nested list which will be easy to manipulate later and very easy to read.

Our list items will each include the book title, author, number of votes, and image url. It will look something like this:

>`[['title', 'author', votes, 'image url'], ['Harry Potter number 1', 'JK Rowling', 999999, 'http://www.someurl.com/image.jpg'], ['The Way of Kings','Brandon Sanderson', 234523, 'anotherimage.com/picture.png']]`

In [0]:
#create an empty list for our data
book_list = []

#begin looping through each poll item
for book in poll_item:
  title_tag = book.find('a', {'pollAnswer__bookLink'}) #navigate to the html tag holding the title information
  title = between(str(title_tag), 'title=\"', ' by '), #extract the book title with the between function 
  author = between(str(title_tag), ' by ', '\" src'),  #extract the author with the between function
  votes = book.find('strong',{'uitext result'}).text   #get the number of votes from its location
  vote = before(votes, "votes").strip()   #extract the actual number of votes with the before function and strip the whitespace
  image = between(str(title_tag), 'src=\"', '\" title'), #extract the image url with between function
  #add the data as a list to our book list
  book_list.append([title[0], author[0], int(vote.replace(',','')), image[0]]) 

Cool! Let's have a look at our list of extracted data!

In [0]:
print("List of book data:")
for book in book_list:
  print (book)

List of book data:
['Still Me', 'Jojo Moyes', 55300, 'https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1502818159l/35791968.jpg']
['An American Marriage', 'Tayari Jones', 41826, 'https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1491493625l/33590210._SY475_.jpg']
['Us Against You', 'Fredrik Backman', 38981, 'https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1516019348l/36373463._SY475_.jpg']
['An Absolutely Remarkable Thing', 'Hank Green', 24363, 'https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1520448919l/38186611.jpg']
['Killing Commendatore', 'Haruki Murakami', 23695, 'https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1527854255l/38820047.jpg']
['There There', 'Tommy Orange', 18614, 'https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1512071034l/36692478.jpg']
['All We Ever Wanted', 'Emily Giffin', 17490, 'https://i.gr-assets.com/images/S/compressed.photo.goodrea

Awesome, we have a usable list of nicely formatted data.


##4) Manipulating our data

Let's try some data manipulation.

I'll show you how to get the total amount of votes for all the books together, and the average amount of votes per book.

Let's first seperate our votes data into it's own list using `list comprehension`:

In [0]:
all_votes = [book[2] for book in book_list] 
print("List of votes:" , all_votes)

List of votes: [55300, 41826, 38981, 24363, 23695, 18614, 17490, 15427, 7343, 5554, 5045, 4405, 3126, 1973, 1641]


Great! Now we can easily get the calculations we want using some simple python tools:

In [0]:
total_votes = sum(all_votes) #sum() function sums up our list
avg_votes = total_votes/len(all_votes) #divide the sum by the length of the list to get the average
print("Total votes:" , total_votes, ". Average vote: ", avg_votes)

Total votes: 264783 . Average vote:  17652.2


That was easy!

Let's try one more thing with our scraped data.

The book nominees show on the website in the order of highest votes to lowest, showing their ranking in the choice awards competition. Let's try to switch the order, and see the lowest voted books first!

To accomplish this, we'll use python's `sorted()` function:

In [0]:
#sorted(book_list) sorts the list in ascending order
#key tells the list what to sort by
#our lambda function sets the key as the second index of each item (or sublist) of the list
ascending_book_list = sorted(book_list, key = lambda book: book[2])

print("Book list by ascending votes")
for book in ascending_book_list:
  print(book)

Book list by ascending votes
['The Book of Essie', 'Meghan MacLean Weir', 1641, 'https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1515873936l/36723245.jpg']
['How to Walk Away', 'Katherine Center', 1973, 'https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1507579931l/36249638.jpg']
["You Think It, I'll Say It", 'Curtis Sittenfeld', 3126, 'https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1509832627l/35952871.jpg']
['Everything Here Is Beautiful', 'Mira T. Lee', 4405, 'https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1492886846l/34262106.jpg']
['My Year of Rest and Relaxation', 'Ottessa Moshfegh', 5045, 'https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1513259517l/36203391._SX318_.jpg']
['A Spark of Light', 'Jodi Picoult', 5554, 'https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1524516474l/39072220.jpg']
['The Female Persuasion', 'Meg Wolitzer', 7343, 'https://i.gr-

## Signing off

Web scraping is a powerful, if controversial tool. It allows us to get information from websites that may not offer API access to the data we want and manipulate it to suit our needs.

Thank you
