<a href="https://colab.research.google.com/github/elibtronic/lja_advanced_python_for_librarians/blob/main/Week_1_Workalong.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


#Week 1 - Structured Data Retrieval


## But First, indexing

One fundamental skill we should spend a bit of time working on is indexing inside of a data structure. This is a pretty old school thing but it is learning. We'll see that it comes in handing when selecting subsets of lists. in Python this is called using the `slice` operator.

This is useful for us because we will be loading up big `lists` of data and sometimes we only want to look at a bit of that data.

In [None]:

countries = ["Brazil","Spain","Italy","United Kingdom","Germany","France"]

#Prints the first item
print(countries[0])

#Prints the last item
#Using a negative number will start the indexing at the end
print(countries[-1])

#Prints a third and fourth item
#Using a : will tell Python you want multiple items
print(countries[2:4])

#Prints the last two items
#Combination of negative values and the colon
print(countries[-2:])



###Q?


In [None]:
#Q

###Q?


In [None]:
#Q

###Q?


In [None]:
#Q


##Grabbing information

We can get Python to retrieve information from the web by using the [requests]() library that will grab a web-page. Once we have that we can do interesting things with the contents as well as the metadata of the request.

We can bring in `urlparse` to really pick apart the contents of that object.

In [None]:
import requests

In [None]:
#Grab the HTML of the CollectionBuilder page we will look at later
response = requests.get("https://elibtronic.github.io/AIL_Database/")

In [None]:
#Prints the response code from the request
print(response.status_code)

In [None]:
#Prints the HTML Headers
print(response.headers)

In [None]:
#Print the content of the webpage
print(response.text)

##Total information from a CollectionBuilder site

Now that we have learned to grab random webpages into Python we are going to move towards retrieving structured information. Instead of a random webpage we are going to grab information from an API.

Further we are going to be working with structured data called JSON.

In [None]:
#Our library to work with JSON
import json

#Our old friend pandas
import pandas as pd


## Collection Builder

CollectionBuilder is a great platform for turning a spreadsheet of infomration into a full fledged website. I use it for lots of different things but most importantly I create a [database](https://elibtronic.github.io/AIL_Database/) of readings for my Library Juice class [AI in Libraries, For Skeptics](https://libraryjuiceacademy.com/shop/course/331-ai-and-libraries-for-skeptics/). Most importantly for us CollectionBuilder also has an API interface it creates for ever site it creates. So you can look at the site dressed up as HTML, but also as data in a format called JSON.

In [None]:
# @title URL of JSON File for complete CollectionBuilder Metadata {"run":"auto","display-mode":"form"}

json_cb_url = "https://elibtronic.github.io/AIL_Database/assets/data/metadata.json" # @param {"type":"string","placeholder":"https://elibtronic.github.io/AIL_Database/assets/data/metadata.json"}



In [None]:
response_cb = requests.get(json_cb_url)
total_cb_data = response_cb.json()
objects_cb_data = total_cb_data['objects']


###Printing all items in a JSON object

In [None]:
#Will print the whole JSON object
print(json.dumps(total_cb_data,indent=2))

In [None]:
#Will just print out the 'objects' in the JSON object, probably more useful
print(json.dumps(objects_cb_data,indent=2))

###Picking a specific item in JSON

And any Python datastructure for that matter

In [None]:
list_of_letters = ['a','b','c','d','e']

#print item at the '1' position
print(list_of_letters[1])


In [None]:
#The first JSON object in our object using the slice operator
print(json.dumps(objects_cb_data[0],indent=2))

In [None]:
# All the keys in the first item using the slice operator
for item in objects_cb_data[0].keys():
  print(item)


In [None]:
#Print all the values in the first item using the slice operator
for item in objects_cb_data[0].values():
  print(item)


### The URLs found in the data set

In [None]:
#print the URLs in the dataset
for item in objects_cb_data:
  print(item['URL'])


### A frequency list of URLS found in the data

To perform a little bit of analysis we are going to grab all of the URLs from the dataset, extract their domain names using the `urlparse` function and will print that to screen highest to lowest

In [None]:
from urllib.parse import urlparse


##Subject information from a CollectionBuilder site

One of API _end points_ from CollectionBuilder is a listing of the subjects in the dataset. The following few cells look just at that.

In [None]:
# @title URL of JSON file for subjects only in CollectionBuilder {"run":"auto","display-mode":"form"}
json_subjects_url = "https://elibtronic.github.io/AIL_Database/assets/data/subjects.json" # @param {"type":"string","placeholder":"https://elibtronic.github.io/AIL_Database/assets/data/subjects.json"}




In [None]:
response_subjects = requests.get(json_subjects_url)
total_subject_data = response_subjects.json()
subject_cb_data = total_subject_data['subjects']

In [None]:
# print just the first item in the subjects JSON object
print(json.dumps(subject_cb_data[0],indent=2))


## Frequency Listing

In [None]:


#print a frequency count of subjects
subject_frequency = {}


#Change JSON data into a dict
for item in subject_cb_data:
  subject_frequency[item['subject']] = item['count']

#Print the dict after sorting it
for info_pair in sorted(subject_frequency.items(), key=lambda x: x[1],reverse=True):
  print(info_pair[0], info_pair[1])


## Moral of the story

JSON is just one way to represent data that has keys and values. So, we can put JSON data into a pandas dataframe just like a spreadsheet

In [None]:
#That doesn't look nice!
subjects_df = pd.read_json(json_cb_url)
subjects_df

In [None]:
#We'll normalize to get Pandas to try to 'shape' the data
normalized_df = pd.json_normalize(subjects_df['objects'])
normalized_df