# Introduction to text analysis in Python

#### This tutorial provides: 

1. A very brief introduction to Python programming.
2. An introduction to natural language processing with NLTK. 
3. An example of a measure of text similarity. 
4. An example of supervised classification using NLTK. 
5. An example of clustering with NLTK. 
 


#### How does the notebook work? 

The tutorial is running on a program called IPython, or "InteractivePython". The IPython Notebook runs in the browser. You won't need to install Python or any of the libraries and modules we will be working with today on your computer. Instead, we are using Wakari, a service that allows you to run IPython notebooks in the cloud.  

The notebook is separated by horizontal cells. Some of these cells contain text (instructions, extra information) and others are Python input fields, where you can write and execute code. 

- To navigate between cells you can click on the cell with your mouse, or use the up and down arrows on your keyboard. 
- If you double click on a text cell you can edit it (you can add your own notes, for example). 
- To write code in an input field simply click on the field and start typing. 
- To execute code in a cell click the "play" button in the tool bar above. The output will appear bellow the cell. You can also run the code in a cell by pressing Control-Shift-Enter on a PC or Command-Shift-Return on a Mac.

When we finish, you will be able to save all of your work to a Python script file and keep it for your records. 

#### Resources used
Examples based almost entirely on the NLTK manual: 
**Natural Language Processing with Python- Analyzing Text with the Natural Language Toolkit**
by Steven Bird, Ewan Klein and Edward Loper (free online: http://www.nltk.org/book/ )


## 1. Intro to Python

In [None]:
print 'Hello world!'

In [None]:
1+1

In [1]:
a=1+1
a

2

#### Working with strings of characters (text). 

Copy and paste the content of the following cells into the input cells below them. Execute the code to see how it works.  

In [None]:
monty = "Monty Python's Flying Circus. " 
# Monty is a text variable. And lines which start with a # are comments are are not executed. 
monty

In [None]:
monty*2 + " Plus just the last word:" + monty[-8:]


In [None]:
monty.find('Python') #finds position of substring within string

In [None]:
monty.upper() +' and '+ monty.lower() # turn to upper or lower case. 

In [None]:
monty.replace('y', 'x') # replace letter y in the string with letter x. 

#### Lists
As opposed to strings, lists are flexible about the elements they contain. 
Run the code below. It creates three list variables. 







In [None]:
list1 = ['Monty', 'Python']
list2 = ['and', 'the', 'Holy', 'Grail']
list3= [1, 2, 3]

In [None]:
len(list2)

In [None]:
list1[1]

In [None]:
list1 + list2 + list3 # adding the three lists together

In [None]:
list2.append("1975") # adding an element to a list
list2

In [None]:
sorted(list1 + list2) # sorting the elements of the two combined lists

In [None]:
# Join the two string elements of the list by a single space.
# The result is a string. 
' '.join(['Monty', 'Python']) 

#### Regular exppressions. 

Huge topic. You will need to learn more about regular expressions if you plan on working regularly with text. 

In [None]:
import re


Find and count all vowels.  

In [None]:
word = 'supercalifragilisticexpialidocious'
len(re.findall(r'[aeiou]', word))

## 2. NLP with NLTK

#### Import NLTK corpora

Import text4 of NLTK book examples, Inaugural Addresses. 

In [None]:
import nltk

** CAREFUL: **  The command bellow opens a downloader. 

In the bottom row, when prompted, you need to type: **d book**

Then let it run, it might take some time. 

In [None]:
nltk.download()

When done, to exit the downloader type: **q**

We will now import a text in the Inaugural Addresses collection: 

In [None]:
from nltk.book import text4

#### Operating on every element. List comprehension.
Read the code below, think about what it is supposed to do, then run it to see what it does. 

In [None]:
len(set([word.lower() for word in text4 if len(word)>5]))

In [None]:
[element.upper() for element in text4[0:5]]

In [None]:
for word in text4[0:5]:
    if len(word)<5 and word.endswith('e'):
        print word, ' is short and ends with e '
    elif word.istitle():
        print word, ' is a titlecase word '
    else:
        print word, ' is just another word '

**Exercise:** Search within the first 100 words of text4 and display those that are longer that 8 letters and end in a "g". 

#### Words in context

Search word in text, diasplay the results together with the context:


In [None]:
text4.concordance("America")

What other words appear in a similar range of contexts? 

In [None]:
text4.similar("citizen")

Examine just the contexts that are shared by two or more words:

In [None]:
text4.common_contexts(["war", "freedom"])

Location of a word in the text: how many spaces from the beginning does it appear? 

This positional information can be displayed using a dispersion plot. 

You need NumPy and Matplotlib. 


In [None]:
# Start pylab inline mode, so figures will appear in the notebook
%pylab inline

In [None]:
import numpy, matplotlib
from nltk.draw.dispersion import dispersion_plot
dispersion_plot(text4, ["citizens", "democracy", "freedom", "war", "America", "vote"])

#### Counting
The length of a text from start to finish, in terms of the words and punctuation symbols that appear. All tokens. 







In [None]:
len(text4)

Count how often a word occurs in a text:


In [None]:
text4.count("democracy")

How many distinct words does the book of Genesis contain? 
The vocabulary of a text is just the set of tokens that it uses. 

In [None]:
len(set(text4)) #types
# Each word used on average x times. Richness of the text. 
len(text4) / len(set(text4)) 

#### Define functions: 

What do you think they do? 


In [None]:
def lexical_diversity(text):
    return len(set(text))/len(text)

In [None]:
def percentage(count, total):
    return count/total

Then use the defined functions:

In [None]:
from __future__ import division #to get precise (float) division in Python 2.x. In Python 3.0 you get it automatically. 
lexical_diversity(text4)

In [None]:
percentage(text4.count('the'), len(text4)) 

#### Simple statistics

Counting Words Appearing in a Text (a frequency distribution). 


In [None]:
from nltk import FreqDist
fdist1 = FreqDist(text4)
fdist1

In [None]:
vocabulary1 = fdist1.keys() # list of all the distinct types in the text
vocabulary1[:3] # look at first 3

- words that occur only once, called hapaxes: 

In [None]:
fdist1.hapaxes()[:20]

 
- words that meet a condition, are long for example
    

In [None]:
V = set(text4)
long_words = [w for w in V if len(w) > 20]
sorted(long_words)


- words that characterize a text (are relatively long, and occur frequently)

In [None]:
fdist = FreqDist(text4)
sorted([w for w in set(text4) if len(w) > 12 and fdist[w] > 7])

#### Conditional frequency distributions

Working with the inaugural corpus:















In [None]:
from nltk.corpus import inaugural
inaugural.fileids()[:2]
[fileid[:4] for fileid in inaugural.fileids()] # Get the first 4 characters of the file IDs

How are the words "America" and "citizen" are used over time?

In [None]:
cfd = nltk.ConditionalFreqDist(
    (target, fileid[:4])
    for fileid in inaugural.fileids()
    for w in inaugural.words(fileid)
    for target in ['america', 'war']
    if w.lower().startswith(target))
cfd.plot()

Working with the news database corpus:


In [None]:
from nltk.corpus import brown
news_words=brown.words(categories="news") 
print(news_words) # get the first words in the corpus

In [None]:
freq= nltk.FreqDist(news_words)
freq.plot(30) # frequency of most commonly used words in the corpus

How are different verbs used in different news genres? 

In [None]:
from nltk import FreqDist
verbs=["should", "may", "can"]
genres=["news", "government", "romance"]
for g in genres:
    words=brown.words(categories=g)
    freq=FreqDist([w.lower() for w in words if w.lower() in verbs])
    print g, freq


#### Stopwords

What percentage of the words in a corpus are NOT stopwords? 


In [None]:
from nltk.corpus import stopwords
stopwords.words('english')

In [None]:
def content_fraction(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() not in stopwords]
    return len(content) / len(text)


In [None]:
print content_fraction(nltk.corpus.inaugural.words())

## Importing and accessing your own text

Useful libraries: 

In [None]:
import nltk, re, pprint
from urllib import urlopen

### User input


In [None]:
s = raw_input("Enter some text: ")

### Online articles
Getting text out of HTML is a sufficiently common task that NLTK provides a helper function nltk.clean_html(), which takes an HTML string and returns raw text.

In [None]:
url = "http://www.bbc.co.uk/news/education-24367153"
html = urlopen(url).read()
raw = nltk.clean_html(html)  
raw[:60]

In [None]:
raw = nltk.clean_html(html)
tokens = nltk.word_tokenize(raw)
tokens[:15]

### Online books

In [None]:
url="http://shakespeare.mit.edu/hamlet/full.html"
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print raw[:300]

In [None]:
tokens = nltk.word_tokenize(raw)
type(tokens)
tokens[50:70]

**Exercise:** Find out the type of variable tokens, the length and display tokens from 80 to 100. 

In [None]:
text = nltk.Text(tokens)
text.collocations()

### Local files
Using local files. 
Upload the following file to Wakari, by clicking the 'Import from web' icon in the upper left side corner:  https://dl.dropboxusercontent.com/u/11117852/UK_natl_2010_en_Lab.txt. Now the file is saved in your account and you can use it in the analysis. 

In [None]:
f = open("UK_natl_2010_en_Lab.txt", 'r')
raw = f.read()
print raw[:100]

Tokenize - divide into tokens:

In [None]:
tokens = nltk.word_tokenize(raw)
tokens[:10]

 Normalize - ignore upper case

In [None]:
lower_case=set(w.lower() for w in tokens)
print len(lower_case)

Stemming - strip off affixes

In [None]:
porter = nltk.PorterStemmer()
a=[porter.stem(t) for t in tokens]
a[60:90]

In [None]:
lancaster = nltk.LancasterStemmer()
a=[lancaster.stem(t) for t in tokens]
a[60:90]

Lemmatizing - the word is from a dictionary

In [None]:
wnl = nltk.WordNetLemmatizer()
a=[wnl.lemmatize(t) for t in tokens]
a[60:90]

Sentence segmentation:

In [None]:
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
sents = sent_tokenizer.tokenize(raw)
pprint.pprint(sents[182:185])

Writing output to file. The file is created in your Wakari directory. Here, we are writing each sentence on a separate line. 

In [None]:
output_file = open('output.txt', 'w')
sentence = set(sents)
for sent in sorted(sentence):
   output_file.write(sent + "\n")

## Text similarity

We can use both NLTK and scikit-learn for this. 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

Calculate tf-idf:

In [None]:
vect = TfidfVectorizer(min_df=1)
tfidf = vect.fit_transform(["New Year's Eve in New York",
                            "New Year's Eve in London",
                            "York is closer to London than to New York",
                            "London is closer to Bucharest than to New York"])

Calculate cosine similarity:

In [None]:
cosine=(tfidf * tfidf.T).A
print cosine

## Trained classification with NLTK

#### Names-gender identification example

In [None]:
from nltk.corpus import names
import random

Select relevant fearures. Here, last letter of name. 

In [None]:
def gender_features(word):
    return {'last_letter': word[-1]}

What is the feature for the name Shrek? 

gender_features('Shrek')

What is the feature for your own name? 

In [None]:
gender_features('iulia')

Train and test data: 

In [None]:
names = ([(name, 'male') for name in names.words('male.txt')] +
          [(name, 'female') for name in names.words('female.txt')])

Arrange data randomly and extract features

In [None]:
random.shuffle(names)
featuresets = [(gender_features(n), g) for (n,g) in names]
from nltk.classify import apply_features # use apply if you're working with large corpora

Divide data into training and test sets:

In [None]:
train_set = apply_features(gender_features, names[500:1000])
test_set = apply_features(gender_features, names[:500])

Use a Naive Bayes Classifier:

In [None]:
classifier = nltk.NaiveBayesClassifier.train(train_set)

Classify the test set and evaluate performance

In [None]:
print nltk.classify.accuracy(classifier, test_set)


What are the most informative features?

In [None]:
classifier.show_most_informative_features(5)

In [None]:
Use the algorithm to classify new data:

In [None]:
classifier.classify(gender_features('iulia'))


In [None]:
classifier.classify(gender_features('cioroianu'))

#### Exercise: 
    What other features could be relevant?
    Repeat the classification with the first letter of the name as the relevant feature. 
    Compare the accuracy and the most informative features. 
    Test it on your first and middle names. 
    Write all your code in the cell below. 

## Clustering

The example below is based on this one: https://gist.github.com/xim/1279283 (by Morten Neergaard)

In [None]:
import numpy
from nltk.cluster import KMeansClusterer, GAAClusterer, euclidean_distance
import nltk.corpus
import nltk.stem
stemmer_func = nltk.stem.snowball.SnowballStemmer("english").stem
stopwords = set(nltk.corpus.stopwords.words('english'))

Define normalize function

In [None]:
def normalize_word(word):
    return stemmer_func(word.lower())

Define feature selection function

In [None]:
def get_words(titles):
    words = set()
    for title in job_titles:
        for word in title.split():
            words.add(normalize_word(word))
    return list(words)

Define vector space function

In [None]:
def vectorspaced(title):
    title_components = [normalize_word(word) for word in title.split()]
    return numpy.array([
        word in title_components and not word in stopwords
        for word in words], numpy.short)

Upload example file. The file is here: https://dl.dropboxusercontent.com/u/11117852/example_jobs.txt . 

In [None]:
title_file = open("example_jobs.txt", 'r')

Get features

In [None]:
job_titles = [line.strip() for line in title_file.readlines()]
words = get_words(job_titles)
words[0:10]

K-Means clustering: 

In [None]:
cluster = KMeansClusterer(7, euclidean_distance)
cluster.cluster([vectorspaced(title) for title in job_titles if title])
classified_examples = [cluster.classify(vectorspaced(title)) for title in job_titles]

Print results:

In [None]:
for cluster_id, title in sorted(zip(classified_examples, job_titles)):
    print cluster_id, title

**Exercise:** Modify the number of clusters and see how the results change. 

**Exercise:** Modify the script above to implement group-average agglomerative clustering, with n classes, instead of K-means clustering. The corresponding NLTK function is: GAAClusterer(n).  