![](http://i67.tinypic.com/2jcbwcw.png)

## NLTK code examples
Python code examples to mirror lecture material

**Author List**: Sam Choi

**Original Sources**: http://nltk.org

**License**: Feel free to do whatever you want to with this code

Let's begin by importing NLTK and a couple sets of data. We'll import corpora previously downloaded through nltk, and start exploring the Project Gutenberg corpus - an archive of over 50,000 ebooks.

*for more info on accessing corpora: http://www.nltk.org/book/ch02.html*

In [None]:
import nltk
from nltk.corpus import gutenberg, shakespeare


## Project Gutenberg

Different corpora have different information available. Let's explore some of the functions that we highlighted using Jane Austen's *Emma*.  

sents() tokenizes a text file into multiple lists of words - each list contains the words in a single sentence.

In [None]:
sentences = gutenberg.sents('austen-emma.txt')

print("Sentence: " + str(sentences[1500]) + "\n")
print("Number of sentences: " + str(len(sentences)))


Likewise, words() tokenizes a text file into mutiple lists containing single words.

In [None]:
words = gutenberg.words('austen-emma.txt')
print("Word: " + str(words[50000]) + "\n")
print("Number of words: " + str(len(words)))


## Breakout

1. Download and import the gutenberg corpus
2. Calculate the number of sentences and words in each corpus
3. Find the corpus with the most words
4. Find the corpus with the most sentences


In [None]:
# import nltk
# from nltk.corpus import gutenberg

all_files = gutenberg.fileids()
for file in all_files:
    print(file)

# print(all_files)



## Shakespeare

Here we'll being playing around with the Shakespeare corpus - a corpus containing a set of Shakespeare's plays.

In [None]:
shakespeare.fileids()


We've printed a list of file names for each play in the Shakespeare corpus. Let's compare the lengths of each of these plays.

In [None]:
for play in shakespeare.fileids():
    words = shakespeare.words(play)
    print(play + ": " + str(len(words)))


In [None]:
play = shakespeare.xml('r_and_j.xml')
print (play[0].tag + ": " + play[0].text)


NLTK has already categorized words in the Shakespeare corpus with certain tags. One of these tags is 'PERSONAE/PERSONA', which marks words that are related to characters in Shakespeare's plays. Let's use this to list the characters in the play, and count how many roles are available.

In [None]:
people = []
for person in play.findall("PERSONAE/PERSONA"):
    people.append(person.text)

for person in people:
    print(person)

print("\n")
print("Cast size: " + str(len(people)))


## Twitter

Now let's explore another interesting corpus called twitter_samples: a sample of a couple thousand tweets from twitter's global feed.

twitter_samples is partitioned into positive_tweets and negative_tweets. We'll use this fact to make some basic comparisons between positive and negative tweets.

(include link to twitter_samples documentation)

In [None]:
from nltk.corpus import twitter_samples

happy = twitter_samples.tokenized('positive_tweets.json')
sad = twitter_samples.tokenized('negative_tweets.json')

print("Happy tweets: " + str(len(happy)))
print("Sad tweets: " + str(len(sad)))


Let's compare the frequency of some words that might be used in a tweet

In [None]:
sad_omg = []
for tweet in sad:
    if "omg" in tweet:
        sad_omg.append(tweet)

print("sad omg count: " + str(len(sad_omg)))

happy_omg = []
for tweet in happy:
    if "omg" in tweet:
        happy_omg.append(tweet)

print("Happy Tweets: " + str(happy_omg))

count = 1
for tweet in happy_omg:
    tw = " ".join(tweet)
    print("Tweet " + str(count) + ": " + str(tw) + "\n")
    count += 1

print("happy omg count: " + str(len(happy_omg)))


Based on the dataset provided by the twitter_samples corpus, sad tweets seem to contain the phrase "omg" more frequently than happy tweets.

This of course is a very basic example of analyzing twitter data - NLTK provides a very powerful set of tools that can be used for many other applications.

## Classifying Tweets  

Now that we've seen what's possible with NLTK and the twitter_samples corpus, let's create our own metric for classifying positive and negative tweets (we'll keep it relatively simple)


We've covered two of the files in the twitter_samples corpus, but lets see all of the filenames to see the other files.

In [None]:
print(twitter_samples.fileids())


It looks like the third file is called 'tweets.20150430-223406.json'. This happens to be a collection of tweets from 4/30/2015 that have yet to be classified. This means that for all intents and purposes, these tweets are pretty much a random sampling of the twitterverse.

Let's go ahead and unpack the tweets in this file like we did for the positive/negative tweets.

In [None]:
random_tweets = twitter_samples.tokenized('tweets.20150430-223406.json')
print("Random tweets: " + str(len(random_tweets)))


It looks like our sample contains 20,000 unclassified tweets. In order to classify them, let's create two lists of keywords/phrases that we'll be searching for.

In [None]:
happy_indicators = []   #fill with happy strings
sad_indicators = []   #fill with sad strings

# ex: 
# happy = ["amazing", "awesome", "yay", ":)", ":-)", ":o)", ":D", "=)", "=D"]
# sad = ["no", "bad", "terrible", "$#@%", ":(", ":-(", ":o(", "=("]


For now, we'll stick to these simple indicators. Can you think of any problems that might arise if we used this model in the real world?

*hint: things that are easy for humans to understand, but difficult for computers

In [None]:
happy_tweets = []
sad_tweets = []

for tweet in random_tweets:            # for each tweet
    for word in happy_indicators:         # we'll check if the tweet contains a word from our happy_indicators
        if word in tweet:
            happy_tweets.append(tweet)        # if it does, we'll add that tweet to our happy_tweets
    for word in sad_indicators:        # repeat for sad
        if word in tweet:
            sad_tweets.append(tweet)

print("Happy tweets: " + str(len(happy_tweets)))

print("Sad tweets: " + str(len(sad_tweets)))