In [1]:
import nltk

### 1. Tokenizing text to sentences 

We can tokenize a block of text into individual sentences using the following: 

In [2]:
# Tokenize text into sentences 
from nltk.tokenize import sent_tokenize
paragraph = "Hello world! It's good to see you.  Thanks for stopping by!"
sent_tokenize(paragraph)

['Hello world!', "It's good to see you.", 'Thanks for stopping by!']

### 2. Tokenizing sentences into words

We can also split a sentence into individual words.  This creates a list of the individual words: 

In [4]:
# Tokenize sentence into words
from nltk.tokenize import word_tokenize
word_tokenize("Hello world! It's me")

['Hello', 'world', '!', 'It', "'s", 'me']

As we can see, `word_tokenize()` splits on contractions, leaving the word "it's" in two pieces.  We need to implement a way to handle contractions if we want to use this method for tokenizing. 

### 3. Tokenizing sentences with regular expressions 

Regular expressions can get messy very quickly, so it is probably better to only use them when the previous two methods don't work satisfactorally: 

In [5]:
# Use regular expressions to tokenize 
from nltk.tokenize import RegexpTokenizer 
tokenizer = RegexpTokenizer("[\w']+")
tokenizer.tokenize("Can't is a contraction.")

["Can't", 'is', 'a', 'contraction']

### 4. Training a sentence tokenizer 

We can use `PunktSentenceTokenizer` to create a custom tokenizer when we want to tokenize in a way that is different than the default tokenizer in NLTK. 

### 5. Filtering stopwords 

Stopwords are common words that don't contribute to the meaning of a sentence, like "the" or "a".  NLTK comes with a `stopwords` corpus that contains lists for many languages.  Here's an example: 

In [6]:
# Filter for stopwords 
from nltk.corpus import stopwords
english_stops = set(stopwords.words('english'))
words = ["Can't", 'is', 'a', 'contraction']
[word for word in words if word not in english_stops]

["Can't", 'contraction']

### 6. Looking up Synsets for a word in WordNet 

WordNet is a dictionary for NLP.  We can use NLTK to look up words in WordNet.  Here's an example: 

In [7]:
# Look up word in WordNet 
from nltk.corpus import wordnet 
syn = wordnet.synsets('cookbook')[0]
print(syn.name())
print(syn.definition())

cookbook.n.01
a book of recipes and cooking directions


### 7. Finding word collocations 

We can find word collocations, which are when two or more words tend to appear frequently together, with the following example: 

In [9]:
# Load Feynman lectures text 
from nltk.corpus import webtext
from nltk.collocations import BigramCollocationFinder 
from nltk.metrics import BigramAssocMeasures 
words = [w.lower() for w in webtext.words('feynman.txt')]
bcf = BigramCollocationFinder.from_words(words)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 4)

[('â', '\x80\x9c'), ('1â', '\x80\x93'), (',', 'and'), (',', 'but')]

This isn't too useful, so let's also remove punctuation and stopwords: 

In [11]:
# Remove punctuation and stopwords from Feynman text 
from nltk.corpus import stopwords 
stopset = set(stopwords.words('english'))
filter_stops = lambda w: len(w) < 3 or w in stopset
bcf.apply_word_filter(filter_stops)
bcf.nbest(BigramAssocMeasures.likelihood_ratio, 10)

[('carbon', 'dioxide'),
 ('coming', 'back'),
 ('absolute', 'zero'),
 ('carbon', 'monoxide'),
 ('stuck', 'together'),
 ('billion', 'times'),
 ('stick', 'together'),
 ('tennis', 'balls'),
 ('water', 'vapor'),
 ('perpetual', 'motion')]