<br>
<img style="float:left" src="http://ipython.org/_static/IPy_header.png" />
<br>

# Session 2: Loading text, tokenisation, tagging, dictionaries and ngrams

In [1]:
import nltk

In [2]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


**Welcome back!**

So, what did we learn yesterday? A brief recap:

* The **IPython** Notebook
* **Python**: syntax, variables, functions, etc.

Today's focus will be on **developing more advanced NLTK skills** and using these skills to **investigate our own data**. 

*Any questions or anything before we dive in?*

## Dirty data

Now that we're going beyond nltk example data, we're bond to run into dirty data.

A common part of corpus building is corpus cleaning. Reasons for cleaning include:

1. Not break the code with unexpected input
2. Ensure that searches match as many examples as possible
3. Increasing readability, the accuracy of taggers, stemmers, parsers, etc.

The level of kind of cleaning depends on your data, the aims of your project and where you are in your research. In the case of very clean data (lucky you!), there may be little that needs to be done. With messy data, you may need to go as far as to correct variant spellings (online conversation, very old books).

If you need help with data cleaning, we offer trainings in [OpenRefine](https://github.com/yuandra/2016-02-01-data-acquisition-cleaning/blob/gh-pages/open-refine-01-intro.md)

### Discussion

*What are the characteristics of clean and messy data? Any personal experiences? Discuss with your neighbours.* 

It will be important to bear these characteristics in mind once you start building your own datasets and corpora. 

## Uploading text files

First of all, let's load in our text.

Google the Gutenberg Project and download a book as a plain text file. 

I chose [A Modest Proposal](https://www.gutenberg.org/ebooks/1080)

We can also look at file contents within the IPython Notebook itself:

In [3]:
import os
from nltk import word_tokenize
from nltk.text import Text

**Challenge!**

1. Find a .txt file from the Gutenberg Project or elsewhere and upload it to the Jupyter Notebook. 
2. Use the word_tokenize to break up the text data. 
3. Print the first 100 tokens.

In [7]:
text_path = "a_modest_proposal.txt"

In [8]:
file = open(os.path.join(text_path), "r", encoding="UTF-8")
text = file.read()
print(text)

A MODEST PROPOSAL

For preventing the children of poor people in Ireland, from being a
burden on their parents or country, and for making them beneficial to
the publick.

by Dr. Jonathan Swift


1729



It is a melancholy object to those, who walk through this great town,
or travel in the country, when they see the streets, the roads and
cabbin-doors crowded with beggars of the female sex, followed by three,
four, or six children, all in rags, and importuning every passenger for
an alms. These mothers instead of being able to work for their honest
livelihood, are forced to employ all their time in stroling to beg
sustenance for their helpless infants who, as they grow up, either turn
thieves for want of work, or leave their dear native country, to fight
for the Pretender in Spain, or sell themselves to the Barbadoes.

I think it is agreed by all parties, that this prodigious number of
children in the arms, or on the backs, or at the heels of their mothers,
and frequently of their fathe

In [19]:
token_text = word_tokenize(text)
print(token_text[:100])

['A', 'MODEST', 'PROPOSAL', 'For', 'preventing', 'the', 'children', 'of', 'poor', 'people', 'in', 'Ireland', ',', 'from', 'being', 'a', 'burden', 'on', 'their', 'parents', 'or', 'country', ',', 'and', 'for', 'making', 'them', 'beneficial', 'to', 'the', 'publick', '.', 'by', 'Dr.', 'Jonathan', 'Swift', '1729', 'It', 'is', 'a', 'melancholy', 'object', 'to', 'those', ',', 'who', 'walk', 'through', 'this', 'great', 'town', ',', 'or', 'travel', 'in', 'the', 'country', ',', 'when', 'they', 'see', 'the', 'streets', ',', 'the', 'roads', 'and', 'cabbin-doors', 'crowded', 'with', 'beggars', 'of', 'the', 'female', 'sex', ',', 'followed', 'by', 'three', ',', 'four', ',', 'or', 'six', 'children', ',', 'all', 'in', 'rags', ',', 'and', 'importuning', 'every', 'passenger', 'for', 'an', 'alms', '.', 'These', 'mothers']


The books were were working with yesterday had already had some processing done on them so that we could use NLTK to find features of the language. Remember that Python regards a text file as a single long string of characters. The first thing to do is to start breaking the text up into sentences and words.

Breaking a speech into tokens lets us do the sort of word counting that we were doing yesterday on the speeches. We can do some more interesting linguistic analysis if we use Part of Speech tagging. NLTK has a number of different Part of Speech tags that we could use, but the simplest one is called 'Universal', and we'll use that here.

In [15]:
sentence = "They refuse to permit us the refuse permit"
token_sen = word_tokenize(sentence)
print(token_sen)
tagged_sen = nltk.pos_tag(token_sen, tagset="universal")
print(tagged_sen)

['They', 'refuse', 'to', 'permit', 'us', 'the', 'refuse', 'permit']
[('They', 'PRON'), ('refuse', 'VERB'), ('to', 'PRT'), ('permit', 'VERB'), ('us', 'PRON'), ('the', 'DET'), ('refuse', 'NOUN'), ('permit', 'NOUN')]


Part of Speech tagging creates bigrams, that is, it associates the word with its tag in a pair of items that we can see above in brackets.  

In [25]:
tag_fdist_sen = nltk.FreqDist(tag for (word,tag) in tagged_sen)

In [26]:
tag_fdist_sen.most_common()

[('PRON', 2), ('VERB', 2), ('NOUN', 2), ('PRT', 1), ('DET', 1)]

**Challenge!**

Use Part of Speech tagging to tag the text that we have just tokenised the do the following:
* Find the most common parts of speech
* Find the most common verbs and create a frequency Distribution graph of your result
* Find the 10 most common nouns in the text

*Hint: to find the most common verbs and nouns, you will need to create a list that contains only the verbs or only the nouns from the speech. Use a for loop to create your list. Then create a frequency distribution*

In [22]:
tagged_text = nltk.pos_tag(token_text, tagset="universal")
print(tagged_text[:10])

[('A', 'DET'), ('MODEST', 'NOUN'), ('PROPOSAL', 'NOUN'), ('For', 'ADP'), ('preventing', 'VERB'), ('the', 'DET'), ('children', 'NOUN'), ('of', 'ADP'), ('poor', 'ADJ'), ('people', 'NOUN')]


In [24]:
tag_fdist_text = nltk.FreqDist(tag for (word,tag) in tagged_text)
tag_fdist_text.most_common(10)

[('NOUN', 766),
 ('VERB', 611),
 ('ADP', 510),
 ('.', 486),
 ('DET', 395),
 ('ADJ', 296),
 ('PRON', 279),
 ('ADV', 222),
 ('CONJ', 168),
 ('PRT', 121)]

In [30]:
verblist_text = []
for (word,tag) in tagged_text:
    if tag == 'VERB':
        verblist_text.append(word)
print(len(verblist_text))
verb_fdist = nltk.FreqDist(verblist_text)
print(verb_fdist.most_common(10))

611
[('be', 51), ('will', 36), ('have', 29), ('are', 22), ('is', 21), ('would', 18), ('being', 16), ('can', 15), ('may', 11), ('make', 8)]


In [38]:
nounlist_text = []
for (word,tag) in tagged_text:
    if tag == 'NOUN':
        nounlist_text.append(word)
print(len(nounlist_text))
noun_fdist = nltk.FreqDist(nounlist_text)
print(noun_fdist.most_common(20))

766
[('children', 19), ('kingdom', 14), ('country', 13), ('number', 11), ('thousand', 10), ('child', 9), ('year', 9), ('parents', 8), ('years', 8), ('shillings', 7), ('food', 6), ('pounds', 6), ('time', 5), ('infants', 5), ('want', 5), ('work', 5), ('nation', 5), ('charge', 5), ('breeders', 5), ('people', 4)]


**Extension - COLLOCATIONS**
There are a few things to note about this result - Project and Gutenberg have been returned as two different, very frequent nouns. Because we're humans, not computers, we know it's likely that they are often occuring together. We could test for bigrams (words that typically occur side by side) to see if this is the case. 

In order to perform this test, we must first convert our list of tokens into and NLTK text. We can then use specific NLTK functions on the text.

In [34]:
print(type(token_text))
nltk_text = nltk.Text(token_text)
print(type(nltk_text))
nltk_text.collocations()

<class 'list'>
<class 'nltk.text.Text'>
per annum; years old; twenty thousand; year old; hundred thousand; man
talk; fine gentlemen; eight shillings; ten shillings; therefore
humbly; solar year; already computed; fifty thousand; thousand
carcasses; thousand couple; would become; three pounds; poor people;
thousand children; two shillings


In [46]:
nltk_text.concordance('breeders', width=100)
nltk_text.similar('breeders')

Displaying 5 of 5 matches:
t two hundred thousand couple whose wives are breeders ; from which number I subtract thirty thousan
e will remain an hundred and seventy thousand breeders . I again subtract fifty thousand , for those
 the publick , because they soon would become breeders themselves : And besides , it is not improbab
 we are yearly over-run , being the principal breeders of the nation , as well as our most dangerous
wth and manufacture . Fourthly , The constant breeders , besides the gain of eight shillings sterlin



In [47]:
nltk_text.concordance('parents')
nltk_text.similar('parents')

Displaying 8 of 8 matches:
land , from being a burden on their parents or country , and for making them be
 at a certain age , who are born of parents in effect as little able to support
nstead of being a charge upon their parents , or the parish , or wanting food a
nd twenty thousand children of poor parents annually born . The question theref
n not turn to account either to the parents or kingdom , the charge of nutrimen
y have already devoured most of the parents , seem to have the best title to th
nd these to be disposed of by their parents if alive , or otherwise by their ne
swer , that they will first ask the parents of these mortals , whether they wou
children time want number arms backs heels nation age value charge
parish rest cloathing souls county art fruits sale persons


### Some linguistics...

*Functional linguistics* is a research area concerned with how *realised language* (lexis and grammar) work to achieve meaningful social functions.

One functional linguistic theory is *Systemic Functional Linguistics*, developed by Michael Halliday (Prof. Emeritus at University of Sydney).

Central to the theory is a division between **experiential meanings** and **interpersonal meanings**.

* Experiential meanings communicate what happened to whom, under what circumstances.
* Interpersonal meanings negotiate identities and role relationships between speakers 

Halliday argues that these two kinds of meaning are realised **simultaneously** through different parts of English grammar.

* Experiential meanings are made through **transitivity choices**.
* Interpersonal meanings are made through **mood choices**


Transitivity choices include fitting together configurations of:

* Participants (*a man, green bikes*)
* Processes (*sleep, has always been, is considering*)
* Circumstances (*on the weekend*, *in Australia*)

Mood features of a language include:

* Mood types (*declarative, interrogative, imperative*)
* Modality (*would, can, might*)
* Lexical density--wordshe number of words per clause, the number of content to non-content words, etc.

Lexical density is usually a good indicator of the general tone of texts. The language of academia, for example, often has a huge number of nouns to verbs. We can approximate an academic tone simply by making nominally dense clauses: 

      The consideration of interest is the potential for a participant of a certain demographic to be in Group A or Group B*.

Notice how not only are there many nouns (*consideration*, *interest*, *potential*, etc.), but that the verbs are very simple (*is*, *to be*).

In comparison, informal speech is characterised by smaller clauses, and thus more verbs.

      A: Did you feel like dropping by?
      B: I thought I did, but now I don't think I want to

Here, we have only a few, simple nouns (*you*, *I*), with more expressive verbs (*feel*, *dropping by*, *think*, *want*)

> **Note**: SFL argues that through *grammatical metaphor*, one linguistic feature can stand in for another. *Would you please shut the door?* is an interrogative, but it functions as a command. *invitation* is a nominalisation of a process, *invite*. We don't have time to deal with these kinds of realisations, unfortunately.

In the context of Fraser's speech, there are nearly twice as many nouns as verbs, and the verbs are generally quite simple ones (parts of To Be and To Have make up about a quarter). This suggests that Fraser's speech, even when giving a radio talk to his electorate, is more towards the formal end of the spectrum. 

## Recap
So far today we have:
* Imported text into NLTK
* Tokenised raw text into words
* Tagged words as parts of speech
* Converted a list into NLTK Text for further analysis

## Stopwords
Yesterday, when we did our frequency counts of the books in the NLTK Library, we noticed that a lot of speace was taken up by little words like 'and' and 'of' and 'the' which don't add a lot to our understanding of text. These are called 'stop words'. It will help our analysis if we exclude them.

In [60]:
fdist_text = nltk.FreqDist(nltk_text)
fdist_text.most_common(20)

[(',', 363),
 ('the', 165),
 ('of', 128),
 ('and', 110),
 ('to', 107),
 ('a', 83),
 ('in', 70),
 ('.', 66),
 ('I', 55),
 ('be', 51),
 ('for', 41),
 ('that', 38),
 ('their', 37),
 ('will', 36),
 ('as', 35),
 ('or', 34),
 ('by', 34),
 ('have', 29),
 ('at', 28),
 ('they', 27)]

In [59]:
print(len(nltk_text))
print(len(set(nltk_text)))

3912
1123


In [58]:
# find words, not punctuation
text = [item for item in nltk_text if item.isalpha()]
print(len(text))
# capitalisation doesnt matter
vocab = [word.lower() for word in text]
print(len(set(vocab)))

3403
1070


In [64]:
# stopword list in nltk documentation
from nltk.corpus import stopwords
ignored_words = nltk.corpus.stopwords.words("english")
unstopped = [word for word in vocab if word not in ignored_words]
print(unstopped[:10])
fdist_unstop = nltk.FreqDist(unstopped)
fdist_unstop.most_common(20)

['modest', 'proposal', 'preventing', 'children', 'poor', 'people', 'ireland', 'burden', 'parents', 'country']


[('children', 19),
 ('would', 18),
 ('kingdom', 15),
 ('one', 15),
 ('thousand', 15),
 ('country', 13),
 ('upon', 13),
 ('number', 11),
 ('may', 11),
 ('great', 10),
 ('therefore', 10),
 ('many', 9),
 ('child', 9),
 ('year', 9),
 ('parents', 8),
 ('well', 8),
 ('years', 8),
 ('two', 8),
 ('old', 8),
 ('hundred', 8)]

The list we have now is probably more intersting if we wanted to get a sense of the key issues in the text. Note, we're working with a very small sample here. This sort of analysis is much more useful over really big corpora.

*Note: We could have condensed the first two steps into a single line of code that looked like this:*

        unstopped = [word for word in speech if word.lower() not in stopwords.words('english') and word.isalpha()]

## Collocation
We've just used collocation to test a hypothesis about the most common nouns in the speech we were investigating. Collocation can be quite a powerful tool for finding features of language.

First, let's look for bigrams in the whole list of tokens:

That doesn't tell us much. Let's try again with 'unstopped' our list of tokens with the punctuation and stopwords removed

As well as identifying collocations (words that appear near each other), we can also look for n-grams or clusters, which appear immediately adjacent to each other. Repeated N-grams are a good way to get a sense of what a text is about. First, let's see how n-grams are created:

There are a lot of trigrams in the sentence, and they don't tell us much. It's when n-grams are repeated that they start to get interesting, but before we write code the code for that we need to have some knowledge of dictionaries...

### Building a dictionaries

We've already worked with strings and lists. Another kind of data structure in Python is a dictionary.
Here is how a simple dictionary works:

The point of dictionaries is to store a key (the word) and a value (the count). When you ask for the key, you get its value.

Notice that you use curly braces for dictionaries, but square brackets for lists.

### Finding duplicate ngrams

This last bit of code is more advanced. Don't worry if you forget what every line means. If you are interested getting more comfortable with Python, come to our [Python]('https://github.com/resbaz/2015-12-14-Python-for-Researchers') course.

# Web scraping using Beautiful Soup

The most important skill for using NLTK in your life as a researchers is going to be working with your own texts. First, let's look at reading in text files directly from the web.

Of course, a lot of the text you're going to want to work with won't be in handy text files already. That's where a Python library called Beautiful Soup comes in.

*Note*: the ! is a way of accessing command line functions from the notebook. We could also do this in the terminal (without the !). 

In [77]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

In [84]:
url = "http://en.wikipedia.org/wiki/Smog"

In [85]:
raw = urlopen(url).read()
print(type(raw))

<class 'bytes'>


Beautiful Soup breaks the single long string into its constituent parts, creating an object 'Beautiful Soup'

In [86]:
soup = BeautifulSoup(raw, 'html.parser')
print(type(soup))

<class 'bs4.BeautifulSoup'>


In [87]:
texts = []
for para in soup.find_all('p'):
    text = para.text
    texts.append(text)
print(texts[:10])

['Smog is a type of air pollutant. The word "smog" was coined in the early 20th century as a portmanteau of the words smoke and fog to refer to smoky fog, its opacity, and odour.[1] The word was then intended to refer to what was sometimes known as pea soup fog, a familiar and serious problem in London from the 19th century to the mid 20th century. This kind of visible air pollution is composed of nitrogen oxides, sulfur oxides, ozone, smoke or particulates among others (less visible pollutants include carbon monoxide, CFCs and radioactive sources).[citation needed] Human-made smog is derived from coal emissions, vehicular emissions, industrial emissions, forest and agricultural fires and photochemical reactions of these emissions.', 'Modern smog, as found for example in Los Angeles, is a type of air pollution derived from vehicular emission from internal combustion engines and industrial fumes that react in the atmosphere with sunlight to form secondary pollutants that also combine wi

In [90]:
import re
regex = re.compile('\[[0-9]*\]')
joined_texts = '\n'.join(texts)
joined_texts = re.sub(regex, '', joined_texts)
print(joined_texts[:1000])

Smog is a type of air pollutant. The word "smog" was coined in the early 20th century as a portmanteau of the words smoke and fog to refer to smoky fog, its opacity, and odour. The word was then intended to refer to what was sometimes known as pea soup fog, a familiar and serious problem in London from the 19th century to the mid 20th century. This kind of visible air pollution is composed of nitrogen oxides, sulfur oxides, ozone, smoke or particulates among others (less visible pollutants include carbon monoxide, CFCs and radioactive sources).[citation needed] Human-made smog is derived from coal emissions, vehicular emissions, industrial emissions, forest and agricultural fires and photochemical reactions of these emissions.
Modern smog, as found for example in Los Angeles, is a type of air pollution derived from vehicular emission from internal combustion engines and industrial fumes that react in the atmosphere with sunlight to form secondary pollutants that also combine with the p

In order to work on the text, the first step is to tokenise it into words.

In [92]:
wordlist = nltk.word_tokenize(joined_texts)
wordlist[:8]

['Smog', 'is', 'a', 'type', 'of', 'air', 'pollutant', '.']

For some other types of analysis, we'll need to create an NLTK text object

In [96]:
smog_text = nltk.Text(wordlist)
print(type(smog_text))

<class 'nltk.text.Text'>


And once we've done all that work creating clean text, it's a good idea to save it for later.

In [97]:
smog_text.concordance('smog')

Displaying 25 of 47 matches:
                                     Smog is a type of air pollutant . The wor
                                     smog '' was coined in the early 20th cent
s ) . [ citation needed ] Human-made smog is derived from coal emissions , veh
eactions of these emissions . Modern smog , as found for example in Los Angele
mary emissions to form photochemical smog . In certain other cities , such as 
rtain other cities , such as Delhi , smog severity is often aggravated by stub
fe or death . Coinage of the term `` smog '' is generally attributed to Dr. He
 clouds of smoke that contributes to smog . Air pollution from this source has
 major ingredient in the creation of smog in some large cities . The major cul
 ozone , and particles that comprise smog . Photochemical smog is the chemical
s that comprise smog . Photochemical smog is the chemical reaction of sunlight
active and oxidizing . Photochemical smog is therefore considered to be a prob
 reactions involved in 

In [98]:
%cd
! mkdir smog
%cd smog


/Users/cynthiiee
/Users/cynthiiee/smog


In [99]:
NLTK_file = open("NLTK-smog.txt", "w", encoding="UTF-8")
NLTK_file.write(str(wordlist))
NLTK_file.close()

37487

In [101]:
text_file = open("Smog-text.txt", "w", encoding="UTF-8")
text_file.write(joined_texts)
text_file.close()

Now have a look at the two files you've created in the file management system. Open them. How is the nltk file different from the .txt file?

**Challenge!**
* Find a webpage of interest to your studies and use Beautiful Soup to extract the text
* Tokenise the text
* Find the most common words in your text (Extension: remove the stop words)
* Find trigrams in your text 
* Save your text to a text file

*Hint*: feel free to collude with your neighbours and please copy and paste our previous code! Copying and pasting are essential skills of developers, as well as googling error messages (seriously!). If you don't believe me, ask a computer scientist. 

In [102]:
url = "https://en.wikipedia.org/wiki/Handstand"
raw = urlopen(url).read()
print(type(raw))
soup = BeautifulSoup(raw, 'html.parser')
print(type(soup))

<class 'bytes'>
<class 'bs4.BeautifulSoup'>


In [103]:
texts = []
for para in soup.find_all('p'):
    text = para.text
    texts.append(text)
print(texts[:10])

['A handstand is the act of supporting the body in a stable, inverted vertical position by balancing on the hands. In a basic handstand the body is held straight with arms and legs fully extended, with hands spaced approximately shoulder-width apart. There are many variations of handstands, but in all cases a handstand performer must possess adequate balance and upper body strength.', 'Handstands are performed in many athletic activities, including acro dance, cheerleading, circus, yoga, and gymnastics. Some variation of handstand is performed on every gymnastic apparatus, and many tumbling skills pass through a handstand position during their execution. Breakdancers incorporate handstands in freezes and kicks. Armstand dives—a category found in competitive platform diving—are dives that begin with a handstand. In games or contests, swimmers perform underwater handstands with their legs and feet extended above the water.', 'Handstands are known by various other names. In yoga, the hand

In [104]:
regex = re.compile('\[[0-9]*\]')
joined_texts2 = '\n'.join(texts)
joined_texts2 = re.sub(regex, '', joined_texts2)
print(joined_texts2[:1000])

A handstand is the act of supporting the body in a stable, inverted vertical position by balancing on the hands. In a basic handstand the body is held straight with arms and legs fully extended, with hands spaced approximately shoulder-width apart. There are many variations of handstands, but in all cases a handstand performer must possess adequate balance and upper body strength.
Handstands are performed in many athletic activities, including acro dance, cheerleading, circus, yoga, and gymnastics. Some variation of handstand is performed on every gymnastic apparatus, and many tumbling skills pass through a handstand position during their execution. Breakdancers incorporate handstands in freezes and kicks. Armstand dives—a category found in competitive platform diving—are dives that begin with a handstand. In games or contests, swimmers perform underwater handstands with their legs and feet extended above the water.
Handstands are known by various other names. In yoga, the handstand is

In [105]:
wordlist = nltk.word_tokenize(joined_texts2)
wordlist[:8]

['A', 'handstand', 'is', 'the', 'act', 'of', 'supporting', 'the']

In [108]:
fdist = nltk.FreqDist(wordlist)
fdist.most_common(10)

[(',', 20),
 ('.', 18),
 ('the', 15),
 ('is', 11),
 ('handstand', 10),
 ('in', 10),
 ('and', 10),
 ('body', 7),
 ('a', 7),
 ('In', 7)]

In [111]:
# find words, not punctuation
text = [item for item in wordlist if item.isalpha()]
print(len(text))
# capitalisation doesnt matter
vocab = [word.lower() for word in text]
print(len(set(vocab)))
fdist_vocab = nltk.FreqDist(vocab)
fdist_vocab.most_common(10)

314
169


[('in', 17),
 ('the', 15),
 ('handstand', 11),
 ('is', 11),
 ('and', 10),
 ('a', 8),
 ('body', 7),
 ('are', 7),
 ('handstands', 7),
 ('of', 6)]

In [112]:
unstopped = [word for word in vocab if word not in ignored_words]
print(unstopped[:10])
fdist_unstop = nltk.FreqDist(unstopped)
fdist_unstop.most_common(10)

['handstand', 'act', 'supporting', 'body', 'stable', 'inverted', 'vertical', 'position', 'balancing', 'hands']


[('handstand', 11),
 ('body', 7),
 ('handstands', 7),
 ('many', 4),
 ('inverted', 3),
 ('basic', 3),
 ('extended', 3),
 ('cases', 3),
 ('balance', 3),
 ('performed', 3)]

In [113]:
hand_text = nltk.Text(wordlist)
hand_text.concordance('handstand')

Displaying 11 of 11 matches:
                                   handstand is the act of supporting the body 
alancing on the hands . In a basic handstand the body is held straight with arm
of handstands , but in all cases a handstand performer must possess adequate ba
and gymnastics . Some variation of handstand is performed on every gymnastic ap
any tumbling skills pass through a handstand position during their execution . 
diving—are dives that begin with a handstand . In games or contests , swimmers 
arious other names . In yoga , the handstand is known as Adho Mukha Vrksasana t
ed bananeira . There are two basic handstand styles in modern gymnastics : curv
le . In many cases ( e.g. , when a handstand is being performed in conjunction 
tands have these characteristics : Handstand `` freezes '' are common in breakd
t subject to formal rules . Common handstand variations include : Blood pressur
