## Words are our Power

You are a member of House Saryn, masters of the spoken and written word. Your house is responsible for long-distance communication across Caeros (using your magical *speaking stones*), guaranteeing contracts and written agreements, and mediating disputes.

Your house knows the power of words. In fact, you have access to one of the largest libraries in the world, with copies of almost all of the written records known to man. To access it, execute the following cell.

In [4]:
import nltk ## All of these imports will get wrapped
from nltk.corpus import PlaintextCorpusReader

corpus_root = '../../corpus/'
library = PlaintextCorpusReader(corpus_root, '.*', encoding='utf-8')

You can gather information about words and relationships between words using this libraray. For example,

In [5]:
fdist = nltk.FreqDist(w.lower() for w in library.words())

In [6]:
fdist.most_common(100)

[(u'the', 22008),
 (u',', 12982),
 (u'of', 11082),
 (u'.', 9870),
 (u'and', 7968),
 (u'to', 5908),
 (u'a', 4624),
 (u'in', 3794),
 (u'as', 2042),
 (u"'", 1977),
 (u'that', 1798),
 (u'for', 1574),
 (u'is', 1502),
 (u'with', 1492),
 (u'their', 1344),
 (u'are', 1304),
 (u'from', 1276),
 (u'or', 1074),
 (u'have', 1040),
 (u's', 1022),
 (u'house', 1020),
 (u'has', 998),
 (u'it', 944),
 (u'on', 924),
 (u'they', 890),
 (u'-', 826),
 (u'his', 788),
 (u'by', 786),
 (u'but', 774),
 (u'(', 766),
 (u'its', 746),
 (u'an', 728),
 (u'be', 726),
 (u'this', 712),
 (u'can', 644),
 (u'other', 628),
 (u':', 608),
 (u'who', 580),
 (u'all', 570),
 (u'most', 558),
 (u'city', 552),
 (u'into', 544),
 (u'one', 536),
 (u'more', 536),
 (u'her', 534),
 (u'he', 522),
 (u'war', 510),
 (u'was', 492),
 (u'these', 490),
 (u'some', 458),
 (u'not', 450),
 (u'nation', 436),
 (u'at', 424),
 (u'while', 396),
 (u';', 386),
 (u'when', 382),
 (u'nations', 370),
 (u'galifar', 368),
 (u'great', 366),
 (u'them', 362),
 (u'breland

This lists the most common words which appear in our library.

(TODO: insert one other simple demo here, maybe wordnet)

But you have not been summoned here for these simple tasks, that might befit a librarian.

No, we __will wish to unlock the secrets of consciousness__.

As you know, years ago, House Cadon created the first forgelings. Using the magic of the eldrich tongue, Cadon infused these constructs with intelligence and self-awareness. After the Last War ended, though, the Treaty of Starhaven required House Cadon to destroy all of its Creation Forges, and all of the Forgelings were granted their freedom. Since then, all research into the creation of intelligent machines has been forbidden.

House Saryn wishes to change that. We realize that the secret to true intelligence is the understanding of language, and using our mastery of words, we will rediscover the secret of creating awareness. This, we believe, is the secret to survival in the face of the aberrant invasion -- although we cannot reveal the reason for this to you just yet.

We first need our intelligence to understand the structure of sentences.

In [12]:
import string
my_sentence = string.join(library.sents()[500])
print my_sentence

The Wardens protect travelers from bandits , rabid beasts , and the aberrations that lurk in the shadows .


(Note: the above cell fails when I try to reun it locally. Check whether it works on the Binder server.)

Our first step to creating true intelligence is understanding how to __tag__ a sentence with parts of speech -- our creations will not be very intelligent if they cannot tell, when a person says "fly," whether they mean an insect or an action one does with a magical carpet.

We would like our creation to be able to take a sentence and label it as follows:

In [19]:
#print nltk.pos_tag(str(my_sentence)) ## Putting in sentence by hand to test
sentence="The Wardens protect travelers from bandits, rabid beasts, and the aberrations that lurk in the shadows."
print nltk.pos_tag(nltk.word_tokenize(sentence), tagset="universal")

[('The', u'DET'), ('Wardens', u'NOUN'), ('protect', u'NOUN'), ('travelers', u'NOUN'), ('from', u'ADP'), ('bandits', u'NOUN'), (',', u'.'), ('rabid', u'NOUN'), ('beasts', u'NOUN'), (',', u'.'), ('and', u'CONJ'), ('the', u'DET'), ('aberrations', u'NOUN'), ('that', u'DET'), ('lurk', u'VERB'), ('in', u'ADP'), ('the', u'DET'), ('shadows', u'NOUN'), ('.', u'.')]


Each of the words in the sentence has been matched with a label identifying its part of speech -- for example, "Wardens" is tagged as "NOUN" and lurk is tagged as "VERB." However, note that our labeler is not perfect. It incorrectly identified "protect" as a noun, even though it is used as a verb.

It falls upon you to determine a way to perform this labeling successfully.

One way to do this, which is not very successful, is simply to guess that all words have the same part of speech. __In normal conversation, the most common part of speech is nouns, followed by verbs__.

__Challenge__. Design a function, `dumb_guess`, which simply guesses that every word has the same part of speech. Using the above information, which part of speech should your function guess to maximize accuracy?

In [29]:
def dumb_guess(input_word):
    ## Your code here

IndentationError: expected an indented block (<ipython-input-29-09a67bfa27f1>, line 2)

(Note to cf: they should have chosen to tag everything as nouns)

We will investigate that performance of this function by __scoring__ it, that is, we will take a list of words with their correctly-labeled parts of speech and see what percentage of these your function gets correct.

In [30]:
default_tagger = nltk.DefaultTagger('NN') ## TODO: hide all of this in a wrapper
from nltk.corpus import brown
brown_tagged_sents = brown.tagged_sents(categories='news')
default_tagger.evaluate(brown_tagged_sents)

0.13089484257215028

Our function only gets about 13% of the words correct.

Let's create a more sophisticated guessing rule. Using our library, we will provide you with the 100 most common words and their parts of speech. Try the following strategy: if a given word is in this list, guess the part of speech that this word is usually used as; otherwise, default to the function you defined above.

(TODO: write this up carefully)

Good, this function has improved the accuracy of our tagger to almost 25%. Using the techniques that you've developed, our House Saryn scribes can determine how to further improve this accuracy.

Given your performance and loyalty to the cause, it is time to reveal a secret to you.

__We do not wish to create intelligent machines as weapons to stop the aberrant invasion. We want to create them as vessels for our own consciousness.__

House Cadon made a foolish mistake when they created the warforged: they allows their creations to exist independently and have free will. We cannot repeat their error; make no mistake, the aberrants will surely devour all life, and there is nothing we can do about that. Our only chance of survival is to implant our consciousness into beings which the aberrants cannot consume. __And since the aberrants destroy only organic matter, if we can transplant our minds into bodies of metal and stone, we will be safe.__

To do this, we must first re-create constructs that are capable of consciousness, as House Cadon created over 30 years ago. These constructs must be able to do more than identify parts of speech: they need to understand and react to all forms of natural language.

(Note to cf: this part will be a text classification exercise, *either* classifying common vs. Khalashi or else classifying the type of common document based on features. Depends on whether I can find a large enough corpus of Khalashi documents. Assume for the moment that they will do language classification and use Khalashi -- really Na'vi -- for testing.)

TODO: Using Na'vi for Kaluus completely changes the sound of the language. Should probably rename.

We will create a function which takes as input a sample of Kaluus or Common text, and then guesses which language the sample is written in. We could do this in the way that you predicted parts of speech above -- by observing some rules and then generating the rules by hand -- but instead we will use __machine learning__ to do this. This technique involves selecting a collection of __features__ that you think are important for deciding which language the sample belongs to, and then feeding these into a machine which tries to guess the rule for you.

(TODO: insert a few representative examples of Kaluus text here. Pick out one feature for them and show them how to encode it. Show the performance of a classifier, say naive Bayes, using this feature. Challenge them to select more features. Once they do, have them test it with naive bayes and with a decision tree. Ask them to decide which is better, using a richer performance metric e.g. confusion matrix).

Here is an example of text in the Nan language (in fact, it is a piece of poetry written by a Nan'fya scholar).

>Maria kxamlä na’rìng kä, Eywa tìhawnu sivi! 

>Maria kxamlä na’rìng kä a ayzìsìto kea ayrìk tsar ke lalmu

>Yeysu sì Maria!

>Peul tok peyä txe’lanit, Eywa tìhawnu sivi!

>Hì’ia ’evil tìsrawluke terok txe’lanit Mariayä

>Yeysu sì Maria!

>Ayutralur lu frrnesyul, Eywa tìhawnu sivi!

>Ayutral lu frrnesyul tengkrr zamerunge ’evit na’rìngmì

>Yeysu sì Maria!

Let's identify some features. We want to define a function that returns a collection of attributes of some word, which will help us determine whether the word is Nan or Common.

Perhaps you noticed from the above sample that a lot of Nan words end in vowels (Yeysu, si, hi'ia...). We can try defining a feature which simply returns the last letter of a word.

In [20]:
def word_features(word):
    return {'last_letter' : word[-1]}

Our function takes in a word and returns a __dictionary__, which links the name of our feature ("last_letter") to the value of that feature for the input word.

You can define a function which adds extra features by separating them as follows.

In [21]:
def word_features(word):
    return {last_letter: word[-1],
           first_letter: word[1],
           my_awesome_feature: 5 }

__Challenge__. Design a function, `word_features`, which selects some additional features that you think are desirable in deciding whether a word is Nan or Common.

(note to cf: they should think of obvious features like the first and last letter, perhaps the length and the fraction of vowels)

TODO: want sentiment analysis to go here for logical progression of algorithm complexity, but also want to analyze students' responses to the election question before the election actually occurs. Think more on this.

TODO: write this section. Use the nltk.sentiment library; goal is to combine polarity (positive, neutral, negative) with the subjectivity/objectivity classifier to optimize a certain function, but I have to decide what that is.