### This notebook goes through reviews on GoodReads using their API.  The code is looking for books with the words "data science" in the title.

Do not share this notebook if keys are in it

Do not push the api_keys.json file to github, leave it on the local machine to keep them a secret.

https://www.goodreads.com/api

In [1]:
pwd

In [2]:
import requests 
import json

In [3]:
#open the file on local machine with the GoodReads API key
with open('api_keys.json','r') as f:
    keys = json.loads(f.read())

In [4]:
#use this if you want to see what the key value is - do not print when sharing on GitHub
# keys['key']

In [5]:
#Set the parameters based on the API documentation
#We want to find books that have "data science" as a part of the information provided

base_url = 'https://goodreads.com'
search_domain = '/search/index.xml'

parameters = {
    'q':'data science',
    'page': 1,
    'key': keys['key']
}

In [6]:
#confirm the code is able to access the site

r = requests.get(base_url + search_domain, params = parameters)
r.status_code

200

In [7]:
#view the results of the API call

r.text

'<?xml version="1.0" encoding="UTF-8"?>\n<GoodreadsResponse>\n  <Request>\n    <authentication>true</authentication>\n      <key><![CDATA[ka4Fs642Xr9prKBGBraoSw]]></key>\n    <method><![CDATA[search_index]]></method>\n  </Request>\n  <search>\n  <query><![CDATA[data science]]></query>\n    <results-start>1</results-start>\n    <results-end>20</results-end>\n    <total-results>2114</total-results>\n    <source>Goodreads</source>\n    <query-time-seconds>0.20</query-time-seconds>\n    <results>\n        <work>\n  <id type="integer">24087179</id>\n  <books_count type="integer">9</books_count>\n  <ratings_count type="integer">436</ratings_count>\n  <text_reviews_count type="integer">43</text_reviews_count>\n  <original_publication_year type="integer">2013</original_publication_year>\n  <original_publication_month type="integer">1</original_publication_month>\n  <original_publication_day type="integer">1</original_publication_day>\n  <average_rating>3.74</average_rating>\n  <best_book type=

In [8]:
import xmltodict

Review the results of the API call
- Convert to dictionary
- drill down into the dictionary to find the keys of the data we are looking for

In [9]:
results = xmltodict.parse(r.text)
results

OrderedDict([('GoodreadsResponse',
              OrderedDict([('Request',
                            OrderedDict([('authentication', 'true'),
                                         ('key', 'ka4Fs642Xr9prKBGBraoSw'),
                                         ('method', 'search_index')])),
                           ('search',
                            OrderedDict([('query', 'data science'),
                                         ('results-start', '1'),
                                         ('results-end', '20'),
                                         ('total-results', '2114'),
                                         ('source', 'Goodreads'),
                                         ('query-time-seconds', '0.20'),
                                         ('results',
                                          OrderedDict([('work',
                                                        [OrderedDict([('id',
                                                                       Or

In [10]:
type(results)

collections.OrderedDict

In [11]:
results.keys()

odict_keys(['GoodreadsResponse'])

In [12]:
results['GoodreadsResponse']

OrderedDict([('Request',
              OrderedDict([('authentication', 'true'),
                           ('key', 'ka4Fs642Xr9prKBGBraoSw'),
                           ('method', 'search_index')])),
             ('search',
              OrderedDict([('query', 'data science'),
                           ('results-start', '1'),
                           ('results-end', '20'),
                           ('total-results', '2114'),
                           ('source', 'Goodreads'),
                           ('query-time-seconds', '0.20'),
                           ('results',
                            OrderedDict([('work',
                                          [OrderedDict([('id',
                                                         OrderedDict([('@type',
                                                                       'integer'),
                                                                      ('#text',
                                                             

In [13]:
results['GoodreadsResponse'].keys()

odict_keys(['Request', 'search'])

In [14]:
results['GoodreadsResponse']['search']

OrderedDict([('query', 'data science'),
             ('results-start', '1'),
             ('results-end', '20'),
             ('total-results', '2114'),
             ('source', 'Goodreads'),
             ('query-time-seconds', '0.20'),
             ('results',
              OrderedDict([('work',
                            [OrderedDict([('id',
                                           OrderedDict([('@type', 'integer'),
                                                        ('#text',
                                                         '24087179')])),
                                          ('books_count',
                                           OrderedDict([('@type', 'integer'),
                                                        ('#text', '9')])),
                                          ('ratings_count',
                                           OrderedDict([('@type', 'integer'),
                                                        ('#text', '436')])),
           

In [15]:
results['GoodreadsResponse']['search'].keys()

odict_keys(['query', 'results-start', 'results-end', 'total-results', 'source', 'query-time-seconds', 'results'])

In [16]:
results['GoodreadsResponse']['search']['results']

OrderedDict([('work',
              [OrderedDict([('id',
                             OrderedDict([('@type', 'integer'),
                                          ('#text', '24087179')])),
                            ('books_count',
                             OrderedDict([('@type', 'integer'),
                                          ('#text', '9')])),
                            ('ratings_count',
                             OrderedDict([('@type', 'integer'),
                                          ('#text', '436')])),
                            ('text_reviews_count',
                             OrderedDict([('@type', 'integer'),
                                          ('#text', '43')])),
                            ('original_publication_year',
                             OrderedDict([('@type', 'integer'),
                                          ('#text', '2013')])),
                            ('original_publication_month',
                             OrderedDict([('@ty

In [17]:
results['GoodreadsResponse']['search']['results'].keys()

odict_keys(['work'])

In [18]:
results['GoodreadsResponse']['search']['results']['work']

[OrderedDict([('id',
               OrderedDict([('@type', 'integer'), ('#text', '24087179')])),
              ('books_count',
               OrderedDict([('@type', 'integer'), ('#text', '9')])),
              ('ratings_count',
               OrderedDict([('@type', 'integer'), ('#text', '436')])),
              ('text_reviews_count',
               OrderedDict([('@type', 'integer'), ('#text', '43')])),
              ('original_publication_year',
               OrderedDict([('@type', 'integer'), ('#text', '2013')])),
              ('original_publication_month',
               OrderedDict([('@type', 'integer'), ('#text', '1')])),
              ('original_publication_day',
               OrderedDict([('@type', 'integer'), ('#text', '1')])),
              ('average_rating', '3.74'),
              ('best_book',
               OrderedDict([('@type', 'Book'),
                            ('id',
                             OrderedDict([('@type', 'integer'),
                                    

In [19]:
len(results['GoodreadsResponse']['search']['results']['work'])

20

In [20]:
###You can see there are 20 books in this first page

In [21]:
results['GoodreadsResponse']['search']['results']['work'][0].keys()

odict_keys(['id', 'books_count', 'ratings_count', 'text_reviews_count', 'original_publication_year', 'original_publication_month', 'original_publication_day', 'average_rating', 'best_book'])

In [22]:
#details about the first book

results['GoodreadsResponse']['search']['results']['work'][0]['best_book']

OrderedDict([('@type', 'Book'),
             ('id',
              OrderedDict([('@type', 'integer'), ('#text', '17346997')])),
             ('title', 'Doing Data Science'),
             ('author',
              OrderedDict([('id',
                            OrderedDict([('@type', 'integer'),
                                         ('#text', '5930481')])),
                           ('name', 'Rachel Schutt')])),
             ('image_url',
              'https://images.gr-assets.com/books/1411927798m/17346997.jpg'),
             ('small_image_url',
              'https://images.gr-assets.com/books/1411927798s/17346997.jpg')])

In [23]:
results['GoodreadsResponse']['search']['results']['work'][0]['best_book'].keys()

odict_keys(['@type', 'id', 'title', 'author', 'image_url', 'small_image_url'])

In [24]:
title = results['GoodreadsResponse']['search']['results']['work'][0]['best_book']['title']

In [25]:
author = results['GoodreadsResponse']['search']['results']['work'][0]['best_book']['author']

In [26]:
page_len = len(results['GoodreadsResponse']['search']['results']['work'])
data_dict = {}
for book_idx in range(page_len):
    data_dict[book_idx] = {
        'book_id': results['GoodreadsResponse']['search']['results']['work'][0]['best_book']['id']['#text'],
        'book_author': results['GoodreadsResponse']['search']['results']['work'][book_idx]['best_book']['author']['name'],
        'book_title': results['GoodreadsResponse']['search']['results']['work'][book_idx]['best_book']['title'],
    }

In [27]:
import numpy as np
import pandas as pd

In [28]:
pd.DataFrame.from_dict(data_dict, orient='index')

Unnamed: 0,book_id,book_author,book_title
0,17346997,Rachel Schutt,Doing Data Science
1,17346997,John W. Foreman,Data Smart: Using Data Science to Transform In...
2,17346997,Joel Grus,Data Science from Scratch: First Principles wi...
3,17346997,Lillian Pierson,Data Science For Dummies
4,17346997,John D. Kelleher,Data Science
5,17346997,D.J. Patil,Building Data Science Teams
6,17346997,Foster Provost,Data Science for Business: What you need to kn...
7,17346997,Mike Loukides,What Is Data Science?
8,17346997,Roger D. Peng,R Programming for Data Science
9,17346997,Annalyn Ng,Numsense! Data Science for the Layman: No Math...


Create a for loop to go through 39 pages of results and find all data science related books

In [29]:
#importing time tells python to stop for 1 second so we don't overload the server we are hitting


import time

base_url = 'https://www.goodreads.com/'
search_endpoint = 'search/index.xml'
parameters = {
    'q': 'data science',
    'page': 1,
    'key': keys['key']
}

data_dictionary = {}
current_id = 0

for request_num in range(1,40):
    print("Hey you! I'm on page {}".format(request_num))
    parameters['page'] = request_num
    time.sleep(1)
    r = requests.get(base_url + search_endpoint,params=parameters)
    if r.status_code != 200:
        print(r.status_code)
        break
    results = xmltodict.parse(r.text)
    books = results['GoodreadsResponse']['search']['results']['work']
    for book in books:
        data_dictionary[current_id] = {
            'book_id': book['best_book']['id']['#text'],
            'book_title':  book['best_book']['title'],
            'book_author': book['best_book']['author']['name']
        }
        current_id += 1

Hey you! I'm on page 1
Hey you! I'm on page 2
Hey you! I'm on page 3
Hey you! I'm on page 4
Hey you! I'm on page 5
Hey you! I'm on page 6
Hey you! I'm on page 7
Hey you! I'm on page 8
Hey you! I'm on page 9
Hey you! I'm on page 10
Hey you! I'm on page 11
Hey you! I'm on page 12
Hey you! I'm on page 13
Hey you! I'm on page 14
Hey you! I'm on page 15
Hey you! I'm on page 16
Hey you! I'm on page 17
Hey you! I'm on page 18
Hey you! I'm on page 19
Hey you! I'm on page 20
Hey you! I'm on page 21
Hey you! I'm on page 22
Hey you! I'm on page 23
Hey you! I'm on page 24
Hey you! I'm on page 25
Hey you! I'm on page 26
Hey you! I'm on page 27
Hey you! I'm on page 28
Hey you! I'm on page 29
Hey you! I'm on page 30
Hey you! I'm on page 31
Hey you! I'm on page 32
Hey you! I'm on page 33
Hey you! I'm on page 34
Hey you! I'm on page 35
Hey you! I'm on page 36
Hey you! I'm on page 37
Hey you! I'm on page 38
Hey you! I'm on page 39


In [30]:
pd.DataFrame.from_dict(data_dictionary, orient='index')

Unnamed: 0,book_id,book_title,book_author
0,17346997,Doing Data Science,Rachel Schutt
1,17682206,Data Smart: Using Data Science to Transform In...,John W. Foreman
2,25407018,Data Science from Scratch: First Principles wi...,Joel Grus
3,25008399,Data Science For Dummies,Lillian Pierson
4,36722689,Data Science,John D. Kelleher
5,12700492,Building Data Science Teams,D.J. Patil
6,17912916,Data Science for Business: What you need to kn...,Foster Provost
7,13638556,What Is Data Science?,Mike Loukides
8,25358081,R Programming for Data Science,Roger D. Peng
9,34213247,Numsense! Data Science for the Layman: No Math...,Annalyn Ng
