5.3 Exercise
=======================================
### First:  `pip install rdflib`

## WordNet Hypernyms

In [None]:
from nltk.corpus import wordnet as wn

In [None]:
def get_hypernyms(s):
    print s
    map(get_hypernyms, s.hypernyms())

In [None]:
get_hypernyms(wn.synset('giraffe.n.01'))

## RDFLib Tutorial

_adapted from http://semanticweb.org/wiki/Getting_data_from_the_Semantic_Web_

We shall parse some RDF from DBpedia on a number of people. The way you parse RDF with rdflib is you create a Graph, which is a sort of empty holder for data. Imagine this as a big container for data, and you can throw in to the container as much data as you like, then just filter out the bits you want.  

First we should import the Graph class from the rdflib package and create a Graph instance.

In [None]:
from rdflib import Graph, URIRef

g = Graph()

The `g` variable now has an empty graph.  

Now we should load some data from the web. The graph object has a method called 'parse' which allows you to give it a file name from your local system or an HTTP URI, as well as an optional format, and it will try to load data from that source. We'll load in data about Elvis Presley.

In [None]:
g.parse("http://dbpedia.org/resource/Elvis_Presley")

This will pause for a second or so to load the data from the web.  

We can see that we've loaded some data by seeing how many statements are in the graph object:

In [None]:
len(g)

At the time of writing, `len(g)` returned 2007. This will change as both the parsers that DBPedia use and the page on Wikipedia changes.  

RDF as a data format is basically a graph that is built up of 'triple' statements. These are made up of subjects, predicates and objects, like simple sentences in a natural language like English. The graph having 2,007 statements is a bit like it having 2,007 individual sentences, but not necessarily about the same thing. Those sentences are of the form:

- Elvis Presley is a rock-and-roll singer.
- Elvis Presley was born in the United States.
- Elvis Presley was born on the 8 January 1935.

RDF has sentences like this translated into a machine-readable structure. The subjects are URIs, as are the predicates (like 'is a', 'was born in' etc.) and the 'objects' of the sentence are either URIs of other resources or they are 'literals', blobs of data.  

RDF literals are basically strings. Other datatypes exist but are implemented as a type restriction on a string. So, for instance, integers or floats or dates are just strings with a little tag on them saying "by the way, this is an integer (or a float or a date or whatever)".  

So let's retrieve the birth and death dates from the graph. The first thing we need to know are the URIs of the properties. On DBpedia, the URIs used for this are:  

- http://dbpedia.org/ontology/birthDate
- http://dbpedia.org/ontology/deathDate  

To retrieve the birth date, we use a method called "subject_objects" on the graph object, which takes a URIRef (an object that wraps a URI) as an argument and then returns all the statements that match that as a  generator. You can then use a for-loop to iterate over the results:

In [None]:
for stmt in g.subject_objects(URIRef("http://dbpedia.org/ontology/birthDate")):
     print stmt

This is a tuple object. You can access the data inside it as you would a tuple, and you can call str() on the URIRef and Literal objects to return the string representation.

In [None]:
for stmt in g.subject_objects(URIRef("http://dbpedia.org/ontology/birthDate")):
    print "the person represented by {} was born on {}".format(*stmt)

Here is another example using influences:

In [None]:
for stmt in g.subject_objects(URIRef("http://dbpedia.org/ontology/influencedBy")):
    print "{} was influenced by {}".format(*stmt)

An RDF graph doesn't all have to be on the same topic. It could freely have 'sentences' about Elvis Presley, Bondi Beach, Barack Obama, the Moon, Camembert, your pet cat, a news article on the trial of a Nazi war criminal, triangles, some particualr species of whale, a television programme, and anything else that is a "thing". Let's iterate through the influences and add each one to our graph:

In [None]:
for s in g.subjects(URIRef("http://dbpedia.org/ontology/influencedBy")):
    print "parsing {}".format(s)
    g.parse(s)

We can now run our birth date call on the lot of them:

In [None]:
for stmt in g.subject_objects(URIRef("http://dbpedia.org/ontology/birthDate")):
    print "the person represented by {} was born on {}".format(*stmt)

Note `http://dbpedia.org/resource/Jack_Ketchum` appears twice due to the fact that he has two birth dates listed. This sort of thing is unfortunately common, particularly with human-generated data.

### Exercise
Use `rdflib` to find of which two schools the president of the University of New Haven is an alumnus.

*Hint*:
- _University of New Haven_ resource: `http://dbpedia.org/resource/University_of_New_Haven`
- _president_ property: `http://dbpedia.org/property/president`
- _alumnus_ property: `http://dbpedia.org/property/alumnus`

[Read the docs](http://rdflib.readthedocs.org/en/latest/index.html) for more info on how to use RDFLib.

#### Optional:
- Search `dbpedia` for relationships between named entities you discovered in the New York Times articles yesterday.

#### Advanced:
- Find relationships between entities in `dbpedia` that are also reflected in the abstract  
(*e.g.* "...Born in Tupelo, Mississippi, Presley..." <-> `(Elvis_Presley, birthPlace, Tupelo,_Mississippi)`)  
Use this to bootstrap patterns for discovering other relationships.
- Use `nltk.sem.extract_rels` to find potential relationsips between named entities in your NYT corpus. 
- You may discover that `nltk.ne_chunk` is not so good at correctly recognizing named entities. How would you improve it to assist in your relationship extraction?