
#Week 1 - Structured Data Retrieval

(Be sure to save this to Drive before starting)

We are going to spend the bulk of our time this week retrieving API data into a type of Python variable called JSON that has some structure to it.

## A brief recap

There is a chance you haven't touched Python since last class. Before proceeding it might be a good idea to take a look at your notebooks from last time and the course material on the Library Juice Academy Course site.

Take a look at the next code cell and try to guess what it will print before you run it to exercise your Python skills a bit.

In [None]:
# average temp for the month in degrees C
# in a dictionary variable

yearly_weather = {
    "January": -10,
    "February": -15,
    "March": 5,
    "April": 10,
    "May": 15,
    "June": 20,
    "July": 30,
    "August": 35,
    "September": 30,
    "October": 10,
    "November": 5,
    "December": 0
}


#some loops and conditionals
print("Below 0 months")
for month in yearly_weather:
  if yearly_weather[month] < 0:
    print(month)

print("Above 0 months")
for month in yearly_weather:
  if yearly_weather[month] >= 0:
    print(month)

print("---")

# making our code a bit interactive by asking what cutoff temp to use
cutoff = input("What tempurature do you want to divide on? ")

print("---")

print("Above and equal to "+ cutoff + " months")
for month in yearly_weather:
  if yearly_weather[month] > int(cutoff):
    print(month)

print("Below "+ cutoff + " months")
for month in yearly_weather:
  if yearly_weather[month] < int(cutoff):
    print(month)



## But First, indexing

One fundamental skill we should spend a bit of time working on is indexing inside of a data structure. This is a pretty old school thing but it is worth learning none the same. We'll see that it comes in handy when selecting subsets of lists. In Python this is called using the `slice` operator.

This is useful for us because we will be loading up big `lists` of data and sometimes we only want to look at a bit of that data.

In [None]:

countries = ["Brazil","Spain","Italy","United Kingdom","Germany","France"]

#Prints the first item
print(countries[0])

#Prints the last item
#Using a negative number will start the indexing at the end
print(countries[-1])

#Prints a third and fourth item
#Using a : will tell Python you want multiple items
#this is called the slice operator
print(countries[2:4])

#Prints the last two items
#Combination of negative values and the colon (slice)
print(countries[-2:])



**Q1** Print the 3rd item in the `countries` list


In [None]:
#Q1
countries[]

**Q2** Print the dogs in the following list, using the slice operator.


In [None]:
#Q2

animals = ["Garfield","Odie","Clifford","Nemo"]
print(animals[])

**Q3** Print your favorite dessert to the screen. Assuming your favourite is in this list!


In [None]:
#Q3
desserts = ["Pie","Cake","Ice Cream","Sherbert"]
print(desserts[])


##Grabbing information

We can get Python to retrieve information from the web by using the [requests]() library that will grab a web-page. Once we have that we can do interesting things with the contents as well as the metadata of the request.

We'll be using a [CollectionBuilder](https://collectionbuilder.github.io/) exhibit for this week's material. It is the one I use for the [AI in Libraries, for Skeptics](https://libraryjuiceacademy.com/shop/course/331-ai-and-libraries-for-skeptics/) class.

Run the next set of cells to see this in action.

In [None]:
#Our New Library!
import requests


#Grab the HTML of the CollectionBuilder page we will look at later
#It will be put in a variable called response
response = requests.get("https://elibtronic.github.io/AIL_Database/")

print("Page has been retrieved!")

In [None]:
#Prints the response code from the request
#200 is successful request
#List of codes: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
print(response.status_code)

In [None]:
#Prints the HTML Headers
#List of headers: https://en.wikipedia.org/wiki/List_of_HTTP_header_fields
print(response.headers)

In [None]:
#Print the content of the webpage
#This will spit out a bunch of unformated HTML text so be ready
print(response.text)

##Total information from a CollectionBuilder site

Now that we have learned to grab random webpages into Python we are going to move towards retrieving structured information. Instead of a random webpage we are going to grab information from an API.

Further we are going to be working with structured data called JSON.

In [None]:
#Our library to work with JSON
import json

#Our old friend pandas
import pandas as pd


## Collection Builder

CollectionBuilder is a great platform for turning a spreadsheet of infomration into a full fledged website. I use it for lots of different things, for example I create a [database](https://elibtronic.github.io/AIL_Database/) of readings for my other Library Juice class. Most importantly for us however is that CollectionBuilder also has an _API_ interface it creates for ever site. So you can look at the site dressed up as HTML, but also as data in a format called JSON.

CollectionBuilder sites by default present links to several different JSON components of the site. For the form below I've added a link to the _Metadata_ JSON url for my site.

In [None]:
# @title URL of JSON File for complete CollectionBuilder Metadata {"run":"auto","display-mode":"form"}

json_cb_url = "https://elibtronic.github.io/AIL_Database/assets/data/metadata.json" # @param {"type":"string","placeholder":"https://elibtronic.github.io/AIL_Database/assets/data/metadata.json"}
# @markdown (Click _Show Code_ to see everything that is going on here)

response_cb = requests.get(json_cb_url)
#all of the data from the sites
total_cb_data = response_cb.json()
#just the 'objects' indexed in the site
objects_cb_data = total_cb_data['objects']
print("Data fetched and JSON constructed!")


###Printing all items in a JSON object

Run the next couple of cells to see all of the data used by the site in a nice pleasing output. You can open up the collection builders [data page](https://elibtronic.github.io/AIL_Database/data.html) in another tab to see the data represented in a webpage instead of plain text.

In [None]:
#Will just print out the 'objects' in the JSON object, probably more useful
#Don't worry if this looks like mush, we'll focus in on some exact data
#in the upcoming sections
print(json.dumps(objects_cb_data,indent=2))

###Picking a specific item in the JSON output

We can use the skills we developed in the indexing section to select certain items from our JSON data. You can see that the data looks like a dictionary. It have a key and value.



In [None]:
#The first JSON object in our object using the slice operator
data_index = 0
print(json.dumps(objects_cb_data[data_index],indent=2))

In [None]:
# All the keys in the first item using the slice operator
for item in objects_cb_data[data_index].keys():
  print(item)


In [None]:
#Print all the values in the first item using the slice operator
for item in objects_cb_data[data_index].values():
  print(item)

**Q4** Print the contents of the 10th item in the JSON data in the cell below

In [None]:
#Q4
data_index =
print(json.dumps(objects_cb_data[data_index],indent=2))

**Q5**

Run the cell below and try to grab a random item from the data....



In [None]:
#Q5

#we pull in a function from the random library
from random import randrange
#we create a random integer somewhere in the range of 0 to length of objects
random_entry = randrange(len(objects_cb_data))
#Now we print the JSON data at that random index
print(json.dumps(objects_cb_data[random_entry],indent=2))

**Q5 continued**

In the cell below copy and paste the link that will take you to the web display of the random item displayed above

In [None]:
#Q5 continued
reference_url = ""

print("You can view the web display of this item here: " + reference_url)


##Subject information from a CollectionBuilder site

One of the API _end points_ from CollectionBuilder is a listing of the subjects in the dataset. Up until this point we were looking at the completed list of metadata in the site.

The following few cells look just at the subject JSON. Try looking at the [Subjects JSON URL](https://elibtronic.github.io/AIL_Database/assets/data/subjects.json) using just your browser

In [None]:
# @title URL of JSON file for subjects only in CollectionBuilder {"run":"auto","display-mode":"form"}

json_subjects_url = "https://elibtronic.github.io/AIL_Database/assets/data/subjects.json" # @param {"type":"string","placeholder":"https://elibtronic.github.io/AIL_Database/assets/data/subjects.json"}
# @markdown Feel free to change this to a different Collection Builder Subject JSON URL. (Heads up: You'll be doing this for your homework this week)


response_subjects = requests.get(json_subjects_url)
total_subject_data = response_subjects.json()
subject_cb_data = total_subject_data['subjects']


In [None]:
# print just the first item in the subjects JSON object
subject_index = 0
print(json.dumps(subject_cb_data[subject_index],indent=2))


## Subject Counts

We can now grab all of the subject names and corresponding counts how how often they were used and print them to the screen. We'll be using a Python `dictionary` as part of this work.

In [None]:
#A dictionary variable that will hold our key value combination of
#subject name and count
subject_frequency = {}

#Change JSON data into a dictionary
for item in subject_cb_data:
  subject_frequency[item['subject']] = item['count']


for key,value in subject_frequency.items():
  print("Subject label: "+ str(key) + ", shows up " + str(value) + " times in the data")


**Q6** Which subject shows up the most in the data?

In [None]:
#Q6
most_subject = ""
print("The subject that comes up the most is " + most_subject)


## Moral of the story

JSON is just one way to represent data that has keys and values. So, we can put JSON data into a pandas dataframe just like a spreadsheet. Run the cells below the see a couple of dataframes of the information from this site.

In the next cell we try to put each JSON entry into a new record in our dataframe. It doesn't look that nice.

In [None]:
#That doesn't look nice!
subjects_df = pd.read_json(json_cb_url)
subjects_df


## 'Normalizing' first

We can normalize the JSON first before we try to make it into a dataframe. That way Pandas will know that the key value combination needs to be put in the column and the row for each piece of data.

Run the next cell to see the data changed into a _proper_ dataframe.

In [None]:
#We'll normalize to get Pandas to try to 'shape' the data
normalized_df = pd.json_normalize(subjects_df['objects'])
normalized_df


## Homework Time

Head over to the homework to analyze a Collection Builder site of your choosing!