# Introduction to Data Science with Python 
## General Assembly
## Natural Language Processing (NLP)

Make sure you have installed nltk and downloaded the following copora:

* punkt
* gutenberg



## Lab Part 1

###Tokenization

What:  Separate text into units such as sentences or words

Why:   Gives structure to previously unstructured text

Notes: Relatively easy with English language text, not easy with some languages


"corpus" = collection of documents

"corpora" = plural form of corpus


In [1]:
# Import the NLTK library, and use ntlk.corpus.gutenberg.fileids() to
# find the filenames for Jane Austen's Emma and Lewis Carrol's Alice in 
# Wonderland
import nltk
nltk.corpus.gutenberg.fileids()

[u'austen-emma.txt',
 u'austen-persuasion.txt',
 u'austen-sense.txt',
 u'bible-kjv.txt',
 u'blake-poems.txt',
 u'bryant-stories.txt',
 u'burgess-busterbrown.txt',
 u'carroll-alice.txt',
 u'chesterton-ball.txt',
 u'chesterton-brown.txt',
 u'chesterton-thursday.txt',
 u'edgeworth-parents.txt',
 u'melville-moby_dick.txt',
 u'milton-paradise.txt',
 u'shakespeare-caesar.txt',
 u'shakespeare-hamlet.txt',
 u'shakespeare-macbeth.txt',
 u'whitman-leaves.txt']

In [2]:
# Break these novels up into sentences. Put these sentence lists into
# a list so that you can use it later
alice = nltk.corpus.gutenberg.raw('carroll-alice.txt')
emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
alice_sent = nltk.sent_tokenize(alice)
emma_sent = nltk.sent_tokenize(emma)

In [3]:
# Count the number of sentences in each novel.
print len(alice_sent)
print len(emma_sent)

1625
7493


In [4]:
# Break each sentence up into words. You will end up with a 
# list of lists of words for Emma and another one for Alice in
# Wonderland
for sent in alice_sent:
    alice_words = [x for x in nltk.word_tokenize(sent) if (x != '.') & (x != '?') & (x != '!')]
for sent in emma_sent:
    emma_words = [x for x in nltk.word_tokenize(sent) if (x != '.') & (x != '?') & (x != '!')]

In [5]:
# Count the number of words in each sentence
numb_of_words_alice = []
for sent in alice_words:
    numb_of_words_alice.append(len(sent))
    
numb_of_words_emma = []
for sent in emma_words:
    numb_of_words_emma.append(len(sent))

In [6]:
# Which novel has more average words per sentence?
import numpy
print "Average words per sentence for Alice:", sum(numb_of_words_alice) / float(len(numb_of_words_alice))
print "Average words per sentence for Emma:", sum(numb_of_words_emma) / float(len(numb_of_words_emma))
# Given their target audience, is this what you would expect?

Average words per sentence for Alice: 4.05357142857
Average words per sentence for Emma: 5.0


In [7]:
# Create a flat list (i.e. not a list of lists) of words in
# the two novels
def flatten(x):
    result = []
    for sent in x:
        for word in sent:
            result.append(word)
    return result

alice_flat = flatten(alice_words)
emma_flat = flatten(emma_words)

In [8]:
# For each novel, construct a set of all the distinct words used
def set_of_words(x):
    result = set()
    for word in x:
        result.add(word.upper())
    return result
    
alice_word_set = set_of_words(alice_flat)
emma_word_set = set_of_words(emma_flat)

print sum(numb_of_words_alice)
print len(alice_flat)

454
454


In [9]:
# Calculate the lexical diversity of each novel (distinct words / word count)
lexdiv_alice = len(alice_word_set) / float(len(alice_flat))
lexdiv_emma = len(emma_word_set) / float(len(emma_flat))
print lexdiv_alice
print lexdiv_emma

0.0594713656388
0.8


In [10]:
# (Optional, only for the very keen)
# Repeat the above analysis for all the Gutenberg samples
# Create a dataframe with the names of the novels, when they were written,
# whether they were for children, the lexical diversity and the average sentence length.
# Can you use logistic regression to predict the audience, based on the content?

## Lab Part 2



In [11]:
# Make nltk.Text objects from the two novels
alice_text = nltk.Text(nltk.word_tokenize(alice))
emma_text = nltk.Text(nltk.word_tokenize(emma))

In [12]:
# Does Jane Austen ever mention the word 'young' in Emma? What about Lewis Carroll?
emma_text.concordance('young')
alice_text.concordance('young')

Displaying 25 of 192 matches:
marriage . A worthy employment for a young lady 's mind ! But if , which I rath
e . '' `` Mr. Elton is a very pretty young man , to be sure , and a very good y
g man , to be sure , and a very good young man , and I have a great regard for 
hildren of their own , nor any other young creature of equal kindred to care fo
is fond report of him as a very fine young man had made Highbury feel a sort of
formed a very favourable idea of the young man ; and such a pleasing attention 
Mr. Knightley ; and by Mr. Elton , a young man living alone without liking it ,
ee of popularity for a woman neither young , handsome , rich , nor married . Mi
nciples and new systems -- and where young ladies for enormous pay might be scr
was no wonder that a train of twenty young couple now walked after her to churc
 a long visit in the country to some young ladies who had been at school there 
f Harriet Smith 's being exactly the young friend she wanted -- exactly the som
was a sing

In [13]:
# What are the common contexts for these words?
print emma_text.common_contexts(['young'])
print alice_text.common_contexts(['young'])

accomplished_woman worthy_man the_farmer the_man unexceptionable_man
a_man so_as amiable_man pretty_woman the_are a_man's too_: pert_lawyer
of_person alarming_man ,_cox a_woman too_; the_woman of_men
None
the_man here_lady the_crab the_lady this_lady
None


In [14]:
# Where does the word 'cat' appear in Alice and Wonderland?
alice_text.dispersion_plot(['cat'])

## Lab Part 3

###Stemming
What:  Reduce a word to its base/stem form

Why:   Often makes sense to treat multiple word forms the same way

Notes: Uses a "simple" and fast rule-based approach
       Output can be undesirable for irregular words
       Stemmed words are usually not shown to users (used for analysis/indexing)
       Some search engines treat words with the same stem as synonyms

In [33]:
# Create an English stemmer that uses the Snowball technique
stemmer = nltk.stem.snowball.SnowballStemmer("english")


In [36]:
# Stem the following words: charge, charging, charged
print stemmer.stem("charge")
print stemmer.stem("charging")
print stemmer.stem("charged")

charg
charg
charg


In [39]:
# Can you stem "words" with punctuation in them? Or which have no letters?
print stemmer.stem("doesn't")
print stemmer.stem("342")

doesn't
342


In [59]:
# Create a new list of words from the novels by dropping out spurious non-words.
# You might find word_is_just_letters() helpful
def word_is_just_letters(word):
    import re
    return re.search('^[a-zA-Z]+', word)

alice_words = nltk.word_tokenize(alice)
alice_words2 = []
for word in alice_words:
    if word_is_just_letters(word):
        alice_words2.append(word)

In [60]:
# Stem all those words
alice_words_stemmed = []
for word in alice_words2:
    stem = stemmer.stem(word)
    alice_words_stemmed.append(stem)

In [63]:
# create two collections.Counter objects (one for each novel)
# so that you can easily count word stems. If you give
# the stemmed lists as an argument to constructor, 
# you can use .most_common(25) to get the top 25 tokens
from collections import Counter
word_counter = Counter(alice_words_stemmed)
word_counter.most_common(25)


[(u'the', 1616),
 (u'and', 810),
 (u'to', 720),
 (u'a', 620),
 (u'it', 597),
 (u'she', 545),
 (u'of', 499),
 (u'said', 462),
 (u'alic', 397),
 (u'was', 366),
 (u'i', 364),
 (u'in', 359),
 (u'you', 356),
 (u'that', 284),
 (u'as', 256),
 (u'her', 252),
 (u'at', 209),
 (u"n't", 204),
 (u'on', 191),
 (u'had', 184),
 (u'with', 179),
 (u'all', 178),
 (u'be', 167),
 (u'for', 146),
 (u'so', 144)]

###Lemmatization / synset
What:  Derive the canonical form ('lemma') of a word
    
Why:   Can be better than stemming, reduces words to a 'normal' form.
    
Notes: Uses a dictionary-based approach (slower than stemming)
    

In [65]:
# What synsets does 'dog' belong to?
nltk.corpus.wordnet
nltk.corpus.wordnet.synsets("dog")

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

In [68]:
# Which synset is the one you were thinking of?
nltk.corpus.wordnet.synsets("dog")[0].definition()

u'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'

In [72]:
# What is its hypernym?
nltk.corpus.wordnet.synsets("dog")[0].hypernyms()

[Synset('canine.n.02'), Synset('domestic_animal.n.01')]

In [73]:
# What about wolves? What synsets does it belong to?
nltk.corpus.wordnet.synsets("wolves")

[Synset('wolf.n.01'),
 Synset('wolf.n.02'),
 Synset('wolf.n.03'),
 Synset('wolf.n.04'),
 Synset('beast.n.02')]

In [75]:
# How closely related are those concepts (dogs and wolves)?
dog = nltk.corpus.wordnet.synsets("dog")[0]
wolves = nltk.corpus.wordnet.synsets("wolves")[0]
dog.path_similarity(wolves)

0.3333333333333333

In [77]:
# How closely related are the concepts 'dog' and 'novel'?
novel = nltk.corpus.wordnet.synsets('novel')[0]
dog.path_similarity(novel)

0.0625

## Lab Part 3 Part of speech tagging

Other:
- Analysing data with the Alchemy API
- Further Reading

###Part of Speech Tagging

What:  Determine the part of speech of a word
    
Why:   This can inform other methods and models such as Named Entity Recognition
    
Notes: http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [27]:
# Use nltk.pos_tag to parse a sentence


In [28]:
# (Optional for the enthusiastic)
# What verbs did Jane Austen use a lot of?

## Lab Part 4
###Stopword Removal

What:  Remove common words that will likely appear in any text
    
Why:   They don't tell you much about your text

In [29]:
# most of top 25 stemmed tokens are "worthless"
c.most_common(25)

NameError: name 'c' is not defined

In [None]:
# view the list of stopwords
stopwords = nltk.corpus.stopwords.words('english')
sorted(stopwords)

In [None]:
##################
### Exercise  ####
##################


# Create a variable called stemmed_stops which is the 
# stemmed version of each stopword in stopwords
# Use the stemmer we used up above!

# Then create a list called stemmed_tokens_no_stop that 
# contains only the tokens in stemmed_tokens that aren't in 
# stemmed_stops

# Show the 25 most common stemmed non stop word tokens

## Lab Part 5
###Named Entity Recognition

What:  Automatically extract the names of people, places, organizations, etc.

Why:   Can help you to identify "important" words

Notes: Training NER classifier requires a lot of annotated training data
       Should be trained on data relevant to your task
       Stanford NER classifier is the "gold standard"

In [None]:
sentence = 'Ian is an instructor for General Assembly'

tokenized = nltk.word_tokenize(sentence)

tokenized

In [None]:
tagged = nltk.pos_tag(tokenized)

tagged


In [None]:
chunks = nltk.ne_chunk(tagged)

chunks


In [None]:
def extract_entities(text):
    entities = []
    # tokenize into sentences
    for sentence in nltk.sent_tokenize(text):
        # tokenize sentences into words
        # add part-of-speech tags
        # use NLTK's NER classifier
        chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sentence)))
        # parse the results
        entities.extend([chunk for chunk in chunks if hasattr(chunk, 'label')])
    return entities

for entity in extract_entities('Ian is an instructor for General Assembly'):
    print '[' + entity.label() + '] ' + ' '.join(c[0] for c in entity.leaves())

## Lab Part 6
###Term Frequency - Inverse Document Frequency (TF-IDF)

What:  Computes "relative frequency" that a word appears in a document
           compared to its frequency across all documents

Why:   More useful than "term frequency" for identifying "important" words in
           each document (high frequency in that document, low frequency in
           other documents)

Notes: Used for search engine scoring, text summarization, document clustering

How: 
    TF(t) = (Number of times term t appears in a document) / (Total number of terms in the document).
    IDF(t) = log_e(Total number of documents / Number of documents with term t in it).

In [None]:
sample = ['Bob likes sports', 'Bob hates sports', 'Bob likes likes trees']

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()


In [None]:
# Each row represents a sentence
# Each column represents a word
vect.fit_transform(sample).toarray()
vect.get_feature_names()


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
tfidf.fit_transform(sample).toarray()
tfidf.get_feature_names()


In [None]:
# the IDF of each word
idf = tfidf.idf_
print dict(zip(tfidf.get_feature_names(), idf))


In [None]:
###############
## Exercise ###
###############


# for each sentence in sample, find the most "interesting 
#words" by ordering their tfidf in ascending order


## Lab Part 7

###LDA - Latent Dirichlet Allocation

What:  Way of automatically discovering topics from sentences

Why:   Much quicker than manually creating and identifying topic clusters

In [None]:
import lda

# Instantiate a count vectorizer with two additional parameters
vect = CountVectorizer(stop_words='english', ngram_range=[1,3]) 
sentences_train = vect.fit_transform(sentences)


In [None]:
# Instantiate an LDA model
model = lda.LDA(n_topics=10, n_iter=500)
model.fit(sentences_train) # Fit the model 
n_top_words = 10
topic_word = model.topic_word_
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vect.get_feature_names())[np.argsort(topic_dist)][:-n_top_words:-1]
    print('Topic {}: {}'.format(i, ', '.join(topic_words)))


In [None]:
# EXAMPLE: Automatically summarize a document


# corpus of 2000 movie reviews
from nltk.corpus import movie_reviews
reviews = [movie_reviews.raw(filename) for filename in movie_reviews.fileids()]


In [None]:
# create document-term matrix
tfidf = TfidfVectorizer(stop_words='english')
dtm = tfidf.fit_transform(reviews)
features = tfidf.get_feature_names()

In [None]:
import numpy as np


In [None]:
# find the most and least "interesting" sentences in a randomly selected review
def summarize():
    
    # choose a random movie review    
    review_id = np.random.randint(0, len(reviews))
    review_text = reviews[review_id]

    # we are going to score each sentence in the review for "interesting-ness"
    sent_scores = []
    # tokenize document into sentences
    for sentence in nltk.sent_tokenize(review_text):
        # exclude short sentences
        if len(sentence) > 6:
            score = 0
            token_count = 0
            # tokenize sentence into words
            tokens = nltk.word_tokenize(sentence)
            # compute sentence "score" by summing TFIDF for each word
            for token in tokens:
                if token in features:
                    score += dtm[review_id, features.index(token)]
                    token_count += 1
            # divide score by number of tokens
            sent_scores.append((score / float(token_count + 1), sentence))

    # lowest scoring sentences
    print '\nLOWEST:\n'
    for sent_score in sorted(sent_scores)[:3]:
        print sent_score[1]

    # highest scoring sentences
    print '\nHIGHEST:\n'
    for sent_score in sorted(sent_scores, reverse=True)[:3]:
        print sent_score[1]

# try it out!
summarize()

## Lab Part 8

In [None]:
# TextBlob Demo: "Simplified Text Processing"
# Installation: pip install textblob
! pip install textblob

In [None]:
from textblob import TextBlob, Word

In [None]:
# identify words and noun phrases
blob = TextBlob('Greg and Thamali are instructors for General Assembly')
blob.words
blob.noun_phrases

In [None]:
# sentiment analysis
blob = TextBlob('I hate this horrible movie. This movie is not very good.')
blob.sentences
blob.sentiment.polarity
[sent.sentiment.polarity for sent in blob.sentences]

In [None]:
# sentiment subjectivity
TextBlob("I am a cool person").sentiment.subjectivity # Pretty subjective
TextBlob("I am a person").sentiment.subjectivity # Pretty objective
# different scores for essentially the same sentence
print TextBlob('Greg and Thamali are instructors for General Assembly in Sydney').sentiment.subjectivity



In [None]:
# singularize and pluralize
blob = TextBlob('Put away the dishes.')
[word.singularize() for word in blob.words]

In [None]:
[word.pluralize() for word in blob.words]


In [None]:
# spelling correction
blob = TextBlob('15 minuets late')
blob.correct()

In [None]:
# spellcheck
Word('parot').spellcheck()


In [None]:
# definitions
Word('bank').define()
Word('bank').define('v')

In [None]:
# translation and language identification
blob = TextBlob('Welcome to the classroom.')
blob.translate(to='es')
blob = TextBlob('Hola amigos')
blob.detect_language()