# Analyzing a Sentence

Lets start small. What can we learn about a sentence?

In [None]:
#
# Preamble
#

# Import our core libraries
import nltk
import pprint
import re
import itertools

# Read in the documents we will use
from nltk.corpus import reuters
coffee = reuters.open('test/19570').read()
gold = reuters.open('test/16589').read()

text = gold

# Segmentation and Tokenization

The first step for us to do is to actually get a sentence from the document we loaded. The general process of extracting _meaningful units_ of text from a larger text is called __segmentation__.

When applied to finding smaller word-like units from text, the process is often referred to as __tokenization__ (and you may find these terms used interchangeably). 
        
A _meaningful unit_ is anything that is useful for your application, these can be _sections_ of a document, _paragraphs_, _words_ or in our case _sentences_.

## Sentence Segmentation

In [None]:
# Simple sentence segmentation

#
# Using string.split
#
sentences = text.split('.')
sentences[0:10]

# Q1: Are there any issues with this approach?



In [None]:
#
# Using re.split
#
sentences = re.split(r'[\.\?!]', text)
sentences[0:10]

# Q2: How about now? What other corner cases and caveats might 
#    we want to stay aware of?

In [None]:
# 
# Using NLTK.
# 

sentences = nltk.sent_tokenize(text)
sentences[0:10]

# Notice that this retains the punctuation in the sentences. 

In [None]:
# NLTK will also handle cases like Mr. Jones correctly.
with_salutations = """The quick brown fox jumped over Mr. Jones.
                      Much to the disapproval of the dogs."""
nltk.sent_tokenize(with_salutations)

## Word Tokenization

We can apply the same approach to finding word-like units of text. All the above methods could be used but similar caveats will apply. So lets jump straight into using nltk for the tokenization.

In [None]:
# We first need to tokenize our sentence into smaller units, in our case words.
with_salutations = """The quick brown fox jumped over Mr. Jones.
                      Much to the disapproval of the dogs."""
words = nltk.word_tokenize(with_salutations)

# Lets print out all the things in this list and the lengths of the words
# Notice that python is whitespace sensitive with indentation
# indicating block structure.
for word in words:
    word_len = len(word)
    print(word + ' : ' + str(word_len))


### __Your Turn!__

Plot the lengths of all the sentences in the __gold__ article.

In [None]:
# Exercise 1
# Plot the lengths of all the sentences in the __gold__ article.
# First print them and then see if you can make an ascii bar chart of them.
# It could look something like this...
#
# ****
# ******************
# ***********
#

gold = reuters.open('test/16589').read()

# Step one. Find all the lengths and print them to the console/notebook.

# Step two (extra credit). Make an horizontal ascii bar graph of the numbers 


# Parts of Speech

What else can we find out about this sentence. We can look for specific patterns we have already identified, but we can also try and deduce what type of this word this is, e.g. is it a noun or a verb. 

This activity is known as __Part of Speech Tagging__ _(POS Tagging)_, and a number of algorithms have been developed to do this, some are statistical and others are rule based. 

Lets look at how we can do this in nltk.


In [None]:
# Lets look at the first sentence of the article
sentence = sentences[0]
print(sentence)

In [None]:
# We first need to tokenize our sentence into smaller units, in our case words.
words = nltk.word_tokenize(sentence)
words

In [None]:
# Note that the result includes punctuation as tokens.

In [None]:
# Now that we have the words we can use nltk's POS tagger

tagged = nltk.pos_tag(words)
tagged

These tags like NN and NNP come from a standardized set of tags known as the __Penn Treebank POS Tags__

You can see the full list here https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

In [None]:
# Lets extract all the nouns from all the sentences in the document

# Will return all the words in the words input param that match a given tag.
def get_tagged(words, tag):
    tagged = nltk.pos_tag(words)

    matches = []
    for tup in tagged:
        if tup[1] == tag:
            matches.append(tup)
    return matches

# This is the same function as above, but uses a List Comprehension
def get_tagged(words, tag):
    tagged = nltk.pos_tag(words)
    return [tup for tup in tagged if tup[1] == tag]

# Will word tokenize a list of sentence and return those tokens
def tokenize(sentences):
    return [nltk.word_tokenize(sentence) for sentence in sentences]

In [None]:
tokenized = tokenize(sentences)

In [None]:
nnps = [get_tagged(tokens, 'NNP') for tokens in tokenized]
nnps

As you can see. POS Tagging takes a little bit of time. It can also yield imperfect results. However it can be an interesting approach to finding out more about a text.

It is also useful to keep in mind that there are actually a number of different algorithms and models that can be used for POS tagging. 

See http://www.nltk.org/book/ch05.html and http://www.nltk.org/api/nltk.tag.html for a reference to what is in NLTK.

### __Your Turn!__

Plot the number of verbs (VB, VBD, VBG, VBN, VBP) across all sentences in the gold article. Which sentence has the most number of verbs.

In [None]:
# Exercise 2
# Count the number of verbs (VB) in each sentence of the gold article
# You may want to make some helper functions.


gold = reuters.open('test/16589').read()


# Step 1. Tokenize the article to sentences

# Step 2. For each sentence tokenize to words and store that list

# Step 3. POS tag all the tokens in each sentence and filter out non verb 
#         tokens.

# Step 4. Print out the counts for each sentence. Which sentence
#         has the most verbs


# Feature Selection

One way we can think of what we have been just been doing is finding _'things'_ that give some information about our sentences. Features cound be anything countable, whether it is is the amount of money mentioned in a sentence, or the number of characters. A big part of text analysis, particularly the statistical and pattern based approaches we will be looking at is feature selection and extraction.

Q3: What other features could you think to extract from a sentence?


# Examples

We can also see how even simple seeming features can be used to good effect in building visualizations.

[Listening Post](https://vimeo.com/3885443) by Mark Hansen and Ben Rubin is an installation that displays sentence gathered from the internet that contain phrases like "I am" or "I love."

[We Feel Fine](http://wefeelfine.org/) by Jonathan Harris and Sep Kamvar begins by searching for blog posts that containt phrases like "I am feeling", "I feel" ... [Regex]

[Stereotropes](http://stereotropes.bocoup.com/) by Bocoup uses POS tagging to extract adjectives from text to then visualize descriptions of characters in film. [POS Tagging]