# AIM 5001 Week 12 Assignment
## Working with HTML, XML, JSON and Web APIs

### Part I: Working with HTML, XML, and JSON

* JSON <br>
Source file https://raw.githubusercontent.com/chenkecoco1/AIM-5001/master/books.json

In [1]:
# import required libraries
import pandas as pd

In [2]:
# read the json file as dataframe with pd.read_json directly
bookjson = pd.read_json('https://raw.githubusercontent.com/chenkecoco1/AIM-5001/master/books.json')
# show the dataframe
bookjson

Unnamed: 0,title,ISBN-13,author,publisher,publish date
0,Lichen Biology,9780521692168,Thomas H. Nash III,Cambridge University Press,06/24/2008
1,Lichens of North America,9780300082494,"[Irwin M. Brodo, Sylvia Duran Sharnoff, Stephe...",Yale University Press,09/10/2001
2,Field Guide to the Lichens of Great Smoky Moun...,9781621905141,"[Erin Tripp, James Lendemer]",University of Tennessee Press,04/29/2020


* HTML <br>
Source file https://raw.githubusercontent.com/chenkecoco1/AIM-5001/master/books.html

In [3]:
# read the html file with pd.read_html
bookhtml = pd.read_html('https://raw.githubusercontent.com/chenkecoco1/AIM-5001/master/books.html')
# show the dataset
bookhtml

[                                               title  \
 0                                     Lichen Biology   
 1                           Lichens of North America   
 2  Field Guide to the Lichens of Great Smoky Moun...   
 
                        publisher publish_date        ISBN-13  \
 0     Cambridge University Press   06/24/2008  9780521692168   
 1          Yale University Press   09/10/2001  9780300082494   
 2  University of Tennessee Press   04/29/2020  9781621905141   
 
                                               author  
 0                                 Thomas H. Nash III  
 1  Irwin M. Brodo, Sylvia Duran Sharnoff, Stephen...  
 2                         Erin Tripp, James Lendemer  ]

In [4]:
# show the dataframe
bookhtml_df=bookhtml[0]
bookhtml_df

Unnamed: 0,title,publisher,publish_date,ISBN-13,author
0,Lichen Biology,Cambridge University Press,06/24/2008,9780521692168,Thomas H. Nash III
1,Lichens of North America,Yale University Press,09/10/2001,9780300082494,"Irwin M. Brodo, Sylvia Duran Sharnoff, Stephen..."
2,Field Guide to the Lichens of Great Smoky Moun...,University of Tennessee Press,04/29/2020,9781621905141,"Erin Tripp, James Lendemer"


* XML <br>
Source file https://raw.githubusercontent.com/chenkecoco1/AIM-5001/master/books.xml <br>
Try findtext with tutorial https://towardsdatascience.com/parsing-xml-data-in-python-da26288280e1

In [5]:
# import related libraries
import urllib.request
from lxml import objectify

# separate header and path
path, headers = urllib.request.urlretrieve('https://raw.githubusercontent.com/chenkecoco1/AIM-5001/master/books.xml')

# parse the xml data
tree = objectify.parse(open(path))
root = tree.getroot() 

# define an empty list that will be used to store the parsed data
data = []
# loop the data in the data list
for a in root.book:
    title=a.findtext("title")
    isbn=a.findtext("ISBN-13")
    publisher=a.findtext("publisher")
    publish_date=a.findtext("publish_date")
    author=a.findtext("author")

    data.append({"title": title, "ISBN-13": isbn, "publish_date": publish_date,"publisher": publisher, "author":author })


bookxml=pd.DataFrame(data, columns=["title", "ISBN-13", "author","publisher", "publish_date"])
bookxml

Unnamed: 0,title,ISBN-13,author,publisher,publish_date
0,Lichen Biology,9780521692168,Thomas H. Nash III,Cambridge University Press,06/24/2008
1,Lichens of North America,9780300082494,"Irwin M. Brodo, Sylvia Duran Sharnoff, Stephe...",Yale University Press,09/10/2001
2,Field Guide to the Lichens of Great Smoky Mou...,9781621905141,"Erin Tripp, James Lendemer",University of Tennessee Press,04/29/2020


The three dataframes are almost identical. However:
* a. JSON dataframe, if more than 1 author applied, all authors' name will be contained in a list. The brackets of authors' names list will be shown in the dataframe.
* b. HTML dataframe the column order is the same as the source file item order.
* c. XML dataframe column positions can be changed depending on the column name order while creating the dataframe.

### Part II: Working with Web API’s

In [6]:
# import useful libraries
import requests

# submit request via nytimes.com api with api keys for Stephen King
api=requests.get('https://api.nytimes.com/svc/books/v3/reviews.json?author=Stephen+King&api-key=wNAx8SCrJT2LWzsTkAkpGSlvtmvPLAdJ');
# read the json data to bookapi
bookapi= api.json()
# check the file
bookapi

{'status': 'OK',
 'copyright': 'Copyright (c) 2020 The New York Times Company.  All Rights Reserved.',
 'num_results': 66,
 'results': [{'url': 'http://www.nytimes.com/2011/11/13/books/review/11-22-63-by-stephen-king-book-review.html',
   'publication_dt': '2011-11-13',
   'byline': 'ERROL MORRIS',
   'book_title': '11/22/63',
   'book_author': 'Stephen King',
   'summary': 'Stephen King’s time traveler tries to undo some painful history.',
   'uuid': '00000000-0000-0000-0000-000000000000',
   'uri': 'nyt://book/00000000-0000-0000-0000-000000000000',
   'isbn13': ['9780307951434',
    '9780606351461',
    '9781442344280',
    '9781442344303',
    '9781442391635',
    '9781444727326',
    '9781451627282',
    '9781451627299',
    '9781451627305',
    '9781451651645',
    '9781501120602',
    '9781594135590']},
  {'url': 'http://www.nytimes.com/2011/10/31/books/stephen-kings-11-23-63-review.html',
   'publication_dt': '2011-10-31',
   'byline': 'JANET MASLIN',
   'book_title': '11/22/63'

In [7]:
# create dataframe based on the requested data
step1=pd.DataFrame(bookapi)
# see general info of data
step1.head()

Unnamed: 0,status,copyright,num_results,results
0,OK,Copyright (c) 2020 The New York Times Company....,66,{'url': 'http://www.nytimes.com/2011/11/13/boo...
1,OK,Copyright (c) 2020 The New York Times Company....,66,{'url': 'http://www.nytimes.com/2011/10/31/boo...
2,OK,Copyright (c) 2020 The New York Times Company....,66,{'url': 'http://www.nytimes.com/2004/01/04/boo...
3,OK,Copyright (c) 2020 The New York Times Company....,66,{'url': 'http://www.nytimes.com/1993/10/24/boo...
4,OK,Copyright (c) 2020 The New York Times Company....,66,{'url': 'http://www.nytimes.com/2001/11/04/boo...


In [8]:
# go back to investigate the data and keep the useful columns
bookSK=pd.DataFrame(bookapi['results'])[['book_title','publication_dt','byline','isbn13','summary']]
# show book examples of Stephen King
bookSK.tail()

Unnamed: 0,book_title,publication_dt,byline,isbn13,summary
61,End of Watch,2016-06-12,DENISE MINA,"[9781410489906, 9781501129742, 9781501134142, ...",A retired police detective sees the return of ...
62,The Outsider,2018-05-22,VICTOR LAVALLE,[9781501180989],“The Outsider” starts out as a routine police ...
63,Elevation,2018-10-26,GILBERT CRUZ,[9781982102319],"King’s slim new novel, “Elevation,” returns us..."
64,The Institute,2019-09-08,DWIGHT GARNER,[9781982110567],"In his latest, King tells the story of an inst..."
65,The Institute,2019-09-10,LAURA MILLER,[9781982110567],The terror doesn’t come from ghosts or fiends ...
