# Live Assignment 3
## DS 6001: Practice and Application of Data Science
### Drew Haynes (rbc6wr)

In this live coding session we will access three different datasets which we can access on the internet without having to supply any API keys or other credentials (we'll cover APIs with credentials next week).

The goal is to examine the structure of the data, decide what is metadata and what constitutes the content of the dataframe we are trying to build, and to use the various tools available to us to convert the data to a pandas dataframe in our Python environment.

We'll work on the following three examples together:

Cocktail recipes from The Cocktail DB: http://www.thecocktaildb.com/api/json/v1/1/filter.php?c=Cocktail
The published works of J.K. Rowling from OpenLibrary.org: https://openlibrary.org/authors/OL23919A/works.json
Data on the pages on Wikipedia that pop up when searching for the term "Virginia": https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=Virginia&format=json&srlimit=500
(This search gets the first 500 hits, but there are 210391 results. If time, we will use the sroffset parameter described in the API documentation to get the full list)

In [6]:
import numpy as np
import pandas as pd
import json
import requests

### Example 1

In [18]:
url = "https://www.thecocktaildb.com/api/json/v1/1/filter.php?c=Cocktail"
r = requests.get(url, headers = {'User-agent': 'rbc6wr@virginia.edu'})
r 
# 200 In general:
# 2xx are "positive" transactions
# 3xx are redirects
# 4xx are client errors (your fault)
# 5xx are server errors (their fault)


<Response [200]>

In [9]:
r.text

'{"drinks":[{"strDrink":"155 Belmont","strDrinkThumb":"https:\\/\\/www.thecocktaildb.com\\/images\\/media\\/drink\\/yqvvqs1475667388.jpg","idDrink":"15346"},{"strDrink":"57 Chevy with a White License Plate","strDrinkThumb":"https:\\/\\/www.thecocktaildb.com\\/images\\/media\\/drink\\/qyyvtu1468878544.jpg","idDrink":"14029"},{"strDrink":"747 Drink","strDrinkThumb":"https:\\/\\/www.thecocktaildb.com\\/images\\/media\\/drink\\/i9suxb1582474926.jpg","idDrink":"178318"},{"strDrink":"9 1\\/2 Weeks","strDrinkThumb":"https:\\/\\/www.thecocktaildb.com\\/images\\/media\\/drink\\/xvwusr1472669302.jpg","idDrink":"16108"},{"strDrink":"A Gilligan\'s Island","strDrinkThumb":"https:\\/\\/www.thecocktaildb.com\\/images\\/media\\/drink\\/wysqut1461867176.jpg","idDrink":"16943"},{"strDrink":"A True Amaretto Sour","strDrinkThumb":"https:\\/\\/www.thecocktaildb.com\\/images\\/media\\/drink\\/rptuxy1472669372.jpg","idDrink":"17005"},{"strDrink":"A.D.M. (After Dinner Mint)","strDrinkThumb":"https:\\/\\/www.t

In [15]:
my_json = json.loads(r.text) #loads, load, dumps, dump - The 's' stands for from a string, alternatively load from a file
my_json['drinks'][0]

{'strDrink': '155 Belmont',
 'strDrinkThumb': 'https://www.thecocktaildb.com/images/media/drink/yqvvqs1475667388.jpg',
 'idDrink': '15346'}

In [17]:
pd.json_normalize(my_json, record_path = ['drinks']) 
#Alternatively 
pd.json_normalize(my_json['drinks']) 

Unnamed: 0,strDrink,strDrinkThumb,idDrink
0,155 Belmont,https://www.thecocktaildb.com/images/media/dri...,15346
1,57 Chevy with a White License Plate,https://www.thecocktaildb.com/images/media/dri...,14029
2,747 Drink,https://www.thecocktaildb.com/images/media/dri...,178318
3,9 1/2 Weeks,https://www.thecocktaildb.com/images/media/dri...,16108
4,A Gilligan's Island,https://www.thecocktaildb.com/images/media/dri...,16943
...,...,...,...
95,Michelada,https://www.thecocktaildb.com/images/media/dri...,178343
96,Midnight Mint,https://www.thecocktaildb.com/images/media/dri...,14842
97,Mojito,https://www.thecocktaildb.com/images/media/dri...,11000
98,Mojito Extra,https://www.thecocktaildb.com/images/media/dri...,15841


### Example 2

In [57]:
url = 'https://openlibrary.org/authors/OL23919A/works.json'

In [58]:
r = requests.get(url, headers = {'User-agent': 'rbc6wr@virginia.edu'})
r 

<Response [200]>

In [59]:
my_json = json.loads(r.text) 
dict(list(my_json.items())[:2])

{'links': {'self': '/authors/OL23919A/works.json',
  'author': '/authors/OL23919A',
  'next': '/authors/OL23919A/works.json?offset=50'},
 'size': 296}

In [60]:
pd.json_normalize(my_json, record_path = ['entries']).head(3) 

Unnamed: 0,description,title,covers,subject_places,subjects,subject_people,key,authors,subject_times,latest_revision,...,type.key,created.type,created.value,last_modified.type,last_modified.value,subtitle,description.type,description.value,links,excerpts
0,The Eighth Story. Nineteen Years Later. Based ...,Harry Potter and the Cursed Child,"[8763851, 9326661, 10551120]",[London],"[Drama, Fantasy, Magic, Juvenile Drama, Good a...","[Harry Potter, Hermine Granger, Ron Weasly]",/works/OL17360811W,"[{'type': {'key': '/type/author_role'}, 'autho...",[1996],26,...,/type/work,/type/datetime,2016-08-11T18:54:46.688344,/type/datetime,2022-01-20T23:00:57.123004,,,,,
1,,Garri Potter I Prokliatoe Ditia,,,,,/works/OL25352276W,"[{'type': {'key': '/type/author_role'}, 'autho...",,3,...,/type/work,/type/datetime,2021-09-29T00:46:55.304886,/type/datetime,2022-01-20T23:00:57.123004,,,,,
2,,Harry Potter et l'Enfant Maudit,[12023661],,,,/works/OL25434568W,"[{'type': {'key': '/type/author_role'}, 'autho...",,3,...,/type/work,/type/datetime,2021-09-30T07:26:12.434275,/type/datetime,2022-01-20T23:00:57.123004,,,,,


In [61]:
# Only getting the first author
my_json['entries'][0]['authors'][0]['author']['key']

'/authors/OL5231739A'

In [62]:
# list comprehension
x = [1,2,3,4,5]
[j**2 - 1 for j in x] # j is a generic element of the list 'x'

[0, 3, 8, 15, 24]

In [67]:
[b['authors'][0]['author']['key'] for b in my_json['entries']][:10]

['/authors/OL5231739A',
 '/authors/OL5231739A',
 '/authors/OL23919A',
 '/authors/OL23919A',
 '/authors/OL23919A',
 '/authors/OL23919A',
 '/authors/OL23919A',
 '/authors/OL23919A',
 '/authors/OL23919A',
 '/authors/OL23919A']

### Example 3

In [46]:
url = 'https://en.wikipedia.org/w/api.php?action=query&list=search&srsearch=Virginia&format=json&srlimit=500'

In [47]:
r = requests.get(url, headers = {'User-agent': 'rbc6wr@virginia.edu'})
r 

<Response [200]>

In [52]:
my_json = json.loads(r.text) 
dict(list(my_json.items())[:2])

{'batchcomplete': '', 'continue': {'sroffset': 500, 'continue': '-||'}}

In [53]:
pd.json_normalize(my_json, record_path = ['query','search']).head(3) 

Unnamed: 0,ns,title,pageid,size,wordcount,snippet,timestamp
0,0,Virginia,32432,287333,24894,"<span class=""searchmatch"">Virginia</span> (/və...",2022-02-08T14:38:38Z
1,0,"Virginia Beach, Virginia",91239,106993,9468,"<span class=""searchmatch"">Virginia</span> Beac...",2022-01-30T18:28:40Z
2,0,West Virginia,32905,183397,17295,"West <span class=""searchmatch"">Virginia</span>...",2022-02-07T15:47:00Z


In [69]:
#srlimit = Number of elements
#sroffset = What rank to start at