<div class="alert alert-block alert-success"><h3>IFN619 - Data Analytics for Information Professionals</h4></div>

## Module 1B Workshop :: Data Wrangling from APIs

1. From unstructured to structured
2. Sourcing from APIs
3. Analysing with APIs
4. 30 min hackathon

### [1] From unstructured to structured

- What is structured data?
- What is unstructured data?
- What is semi-structured data?

With the following code, we transform unstructured data into structured data.

But first, load the libraries used in this notebook...

In [None]:
# Libraries used by this notebook
from urllib import request, response
from IPython.core.display import display, HTML
from IPython.display import IFrame 
from collections import namedtuple
import json
import re
from urllib import request, response
import pandas as pd
from tapclipy import tap_connect
from tapclipy import textvis

We start by loading the data from from the file system. In this case it is a text file of 50 Amazon reviews.

In [None]:
file = open(???)
rawtext = file.read()
file.close()

In [None]:
rawtext

Easy first step in structuring the data: split the string into a list of strings

In [None]:
reviews = rawtext.split(???)
if reviews[-1]=='':
    del reviews[-1] #Remove last empty item

In [None]:
reviews

Now we structured each review further, by extracting the sentiment and the subject.

In [None]:
def getSentimentLabel(text):
    match = re.search(r"(?<=__label__)[0-9]+",text)
    value = match.group(0)
    if value=='1':
        return ???
    elif value=='2':
        return ???

def getSubject(text):
    split = re.split(r"(?<=__label__)[0-9]+",text)
    return split[1].strip()

Now that we have the bits, we can store them in our own custom data structure `Review` based on a `namedtuple`. We also create a function to parse the reviews into this data structure

In [None]:
Review = namedtuple('review',['label','subject','text'])

In [None]:
def parseReview(text):
    textSplit = text.split(???)
    text = textSplit[???]  
    subject = getSubject(textSplit[???])
    label = getSentimentLabel(textSplit[???])
    return Review(label,subject,text)

In [None]:
structuredReviews = list(map(parseReview,reviews))
structuredReviews

We have structured data now, but it is difficult to explore as it is not in a format that is easy for humans to read. Let's fix that...

In [None]:
def reviewsToHtml(reviewList):
    def pTag(review): #function that wraps review in tags
        return '<p><b class="'+review.label+'">'+review.subject+"</b>: "+review.text+"</p>"
    paras = map(pTag,reviewList) #Apply the wrapping function to the list
    return HTML(''.join(paras)) #Join the paragraphs together and return as HTML

structReviewsHtml = reviewsToHtml(structuredReviews)
css = HTML("""<style>
.positive { color: green; }
.negative { color: red; }
</style>""")

In [None]:
display(css,structReviewsHtml)

**DISCUSSION**
- We did this 50 reviews. How many could we do this task on?
- What other structuring could we do to the data?
- In what way/s might we have *corrupted* the data?

### [2] Sourcing from APIs

Much of the data available to us as information professional is not conveniently in text files on our local machines.

Increasingly data is being made open via Application Programming Interfaces (APIs).

In the following section, we explore what an API is.

First some functions to help us. Using functions, we can avoid typing the same (or very similar) code over and over again.

In [None]:
# Functions to fetch string/json from an API

def fetch_string_from_api(url):
    req = request.Request(url)
    resp = request.urlopen(req)
    return resp.read().decode('utf8')

def fetch_json_from_api(url):
    body = fetch_string_from_api(url)
    return json.loads(body)

In [None]:
#Fetch the data for the latest xkcd comic
xkcd_url = 'http://xkcd.com/info.0.json'
xkcd_json = ???(xkcd_url)
print(xkcd_json)

In [None]:
comicUrl = xkcd_json.get(???)
print(comicUrl)

In [None]:
display(HTML('<img src="'+???+'"/>'))

### Sourcing data through multiple calls

Often, a single call to one API is not sufficient to get the data we need. In many instances, we need to make a call, analyse the results to find something, make another call, and repeat...

In [None]:
musicdemons_url='https://musicdemons.com/api/v1/'
artistsResp = fetch_json_from_api(musicdemons_url+???)
artistsResp

In [None]:
# Why did that take so long?
len(???)

Can I make this easier to read?

In [None]:
artists_df = pd.DataFrame.from_dict(artistsResp)
artists_df

- What anomalies do you see in the data?
- How might these cause us problems down the track?

Sometimes it is easier to work with data if we look at a smaller subset...

In [None]:
artists_df.loc[(artists_df['year_started'] <= ???) & (artists_df['year_started'] >= ???)]

or even just one instance in the data...

In [None]:
vm = artists_df.loc[artists_df['name']== ???]
vm

In [None]:
artistId = vm.get(???).values[0]
artistId

Now we can make another call to the API and get more data

In [None]:
songs_url = musicdemons_url+'artist/'+str(???)+'/songs'
songsJson = fetch_json_from_api(songs_url)
songsJson

We may want to use 'selecting a song' often, so create a function

In [None]:
def get_song_by_title(title):
    return [song for song in songsJson if title.lower() in song['text'].lower()][0]

In [None]:
song = get_song_by_title(???)
song['id']

Now we have a song, we can get other data (which may even come from other APIs)

In [None]:
HTML('<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/'+song['youtube_id']+'?rel=0" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>')

In [None]:
lyrics_url = musicdemons_url+'song/'+str(???)+'/lyrics'
lyrics = fetch_string_from_api(lyrics_url)
lyrics

- What kind of data do we end up with?
- What issues might we need to deal with?

### [3] Analysing with APIs

APIs don't only provide data, they can provide *services* as well.

Take a look at [TAP](http://tap.infosci-apps.qut.edu.au) - What service does it provide?

We can use TAP to do some analysis on our lyrics

In [None]:
# Create TAP Connection
tap = tap_connect.Connect('http://tap.hi2lab.io')
tap.fetch_schema()
fx = textvis.Effects()

In [None]:
# TAP expects sentences, so change each line of the lyrics to a sentence.
lyric_sents = lyrics.replace(???,'. ')
lyric_sents

- At this point, what are we doing to the data?
- What are the risks of doing this?

I'm interested if the lyrics exhibit any features that are common to reflective writing, so I'm going to use TAP's `reflectExpressions` query.

In [None]:
query = tap.query(???)
analytics = tap.analyse_text(query, ???)
analytics

- What main features can we see in the analytics?

Once again, we need to make features in the data easy to see, so that we can make good decisions and **ask the right questions!**

In [None]:
# dictionary of css rules we want to apply to our data.
customStyle = {
    "pertains": {
        "background-color": "red",
        "color": "white"
    },
    "selfpossessive": {
        "background-color": "blue",
        "color": "white"
    },
    "definite": {
        "background-color": "green",
        "color": "white"
    },
    "keyterm": {
        "background-color": "yellow",
        "color": "black"
    }
}

style = fx.make_css(customStyle)

print(style)

In [None]:
# Mark up the text with HTML tags

markedup = effects.make_reflect_html(???)
markedup

In [None]:
display(HTML(effects.markup(???, ???)))

What if we want to find significant words in the data?

In [None]:
# First, structured the data so that the song is a list of verses
verses = lyrics.split(???)
verses

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

cv=CountVectorizer(max_df=0.3,stop_words=['it','for','up','of','are','be','all','and','is','has','how','in','to','on'])
word_count_vector=cv.fit_transform(verses)

sorted_by_value = sorted(cv.vocabulary_.items(), key=lambda kv: kv[1])
keyterms = [t[0] for t in sorted_by_value[:30]]
keyterms

Display this in a way that is easy to interpret

In [None]:
keyterm_lyrics = '<br/>'.join(verses)
for keyterm in keyterms:
    replacement = '<span class="keyterm"> '+keyterm+'</span>'
    keyterm_lyrics = keyterm_lyrics.replace(' '+keyterm,replacement)
#display(HTML(keyterm_lyrics))
display(HTML(effects.markup(keyterm_lyrics, style)))

### [4] 30 min hackathon

You've put together a team that is providing data analytics services to application developers. **You have your first client!** 

Before handing over the $$$, you're new client want's to know that your team can deliver, so you've been given a task to show your skills with a complete data analytics cycle within 30 minutes.

The rules are:
- You must work as a team
- You need to describe a realistic scenario that might be feasible for the client
- You can pick any data source that is open (
- You need to do something interesting with the data that fits with the scenario
- You need to visualise the data in some way

Resources:
- [Any API](https://any-api.com)
- [toddmotto public APIs](https://github.com/toddmotto/public-apis)