# More with `nltk`
For the homework, you likely used nltk's sentence tokenizing function. However, nltk can do a lot more than just tokenizing sentences.

We're going to talk about a few key functions that may be of use for your final projects. Part-of-speech tagging may be especially useful for disambiguiating words.

# Tag the parts-of-speech of my words
`nltk` makes it easy for us to tag our texts with their parts of speech.

This can be useful if you want to only want to look at one version of a homograph.

## e.g. race (v.) vs. race (n.)
Let's take a real example from one of our groups: In the homework assignment on collocation, people found that their collocation results that their results have been made ambiguous by multiple senses of words with the same spelling. "Race" as in track-and-field is not as interesting to a group studying ethnicity as "race," as in the socially constructed phenomenon of grouping people by various characteristics.

We can use part-of-speech tagging to resolve these ambiguities, and find out which parts-of-speech the collocates of our keywords have too. 

# Getting started
We're going to begin by importing `nltk`

In [1]:
import nltk
#nltk.download('book') uncomment this if you are having trouble getting this running

Then, we're going to get our text. For today, we're going to be using James Weldon Johnson's passing narrative, *Autobiography of an Ex-Colored Man* (1912).

In [501]:
import os
fn = '1912_johnson_ex-colored.txt'
fn in os.listdir()

False

In [2]:
johnson = '/Users/e/Downloads/1912_johnson_ex-colored.txt'
text = open(johnson).read()

In [3]:
text[:50]

'The Autobiography of an Ex-Colored Man\n\nJames Weld'

### Contractions
Now, we need to tokenize our text. We can use the `tokenize` function we wrote.

I've added an added feature to deal with the contraction problem we discussed last time, specifically with regard to Obama's State of the Union addresses, and his distinctive words "don" and "t."

In [12]:
contractions = {"'em": '_em',
 "'ll": '_ll',
 "'til": '_til',
 "'tis": '_tis',
 "'twas": '_twas',
 "'tween": '_tween',
 "'twere": '_twere',
 "'twill": '_twill',
 "'twixt": '_twixt',
 "'twould": '_twould',
 "'un": '_un',
 "'ve": '_ve',
 "ain't": 'ain_t',
 "amn't": 'amn_t',
 "an'a": 'an_a',
 "an't": 'an_t',
 "anybody'd": 'anybody_d',
 "ar'n't": 'ar_n_t',
 "aren't": 'aren_t',
 "b'hoy": 'b_hoy',
 "br'er": 'br_er',
 "can't": 'can_t',
 "ch'in": 'ch_in',
 "couldn't": 'couldn_t',
 "d'": 'd_',
 "daren't": 'daren_t',
 "dasn't": 'dasn_t',
 "dassn't": 'dassn_t',
 "didn't": 'didn_t',
 "doesn't": 'doesn_t',
 "don't": 'don_t',
 "don'ts": 'don_ts',
 "e'en": 'e_en',
 "e'er": 'e_er',
 "h'm": 'h_m',
 "ha'": 'ha_',
 "ha'nt": 'ha_nt',
 "hadn't": 'hadn_t',
 "hain't": 'hain_t',
 "han't": 'han_t',
 "hasn't": 'hasn_t',
 "haven't": 'haven_t',
 "he'd": 'he_d',
 "he'll": 'he_ll',
 "he's": 'he_s',
 "her'n": 'her_n',
 "his'n": 'his_n',
 "howe'er": 'howe_er',
 "i'd": 'i_d',
 "i'll": 'i_ll',
 "i'm": 'i_m',
 "i've": 'i_ve',
 "in't": 'in_t',
 "isn't": 'isn_t',
 "it'd": 'it_d',
 "it'll": 'it_ll',
 "lor'": 'lor_',
 "ma'am": 'ma_am',
 "mayn't": 'mayn_t',
 "mightn't": 'mightn_t',
 "mustn't": 'mustn_t',
 "n'gana": 'n_gana',
 "ne'er": 'ne_er',
 "needn't": 'needn_t',
 "nobody'd": 'nobody_d',
 "o'": 'o_',
 "o'clock": 'o_clock',
 "o'er": 'o_er',
 "o'ertop": 'o_ertop',
 "oughtn't": 'oughtn_t',
 "our'n": 'our_n',
 "qur'an": 'qur_an',
 "shan't": 'shan_t',
 "she'd": 'she_d',
 "she'll": 'she_ll',
 "she's": 'she_s',
 "shouldn't": 'shouldn_t',
 "somebody'll": 'somebody_ll',
 "someone'll": 'someone_ll',
 "t'": 't_',
 "t'other": 't_other',
 "that'd": 'that_d',
 "that'll": 'that_ll',
 "there'd": 'there_d',
 "there'll": 'there_ll',
 "they'd": 'they_d',
 "they'll": 'they_ll',
 "they're": 'they_re',
 "they've": 'they_ve',
 "this'll": 'this_ll',
 "tho'": 'tho_',
 "thro'": 'thro_',
 "today'll": 'today_ll',
 "wa'": 'wa_',
 "wasn't": 'wasn_t',
 "we'd": 'we_d',
 "we'll": 'we_ll',
 "we're": 'we_re',
 "we've": 'we_ve',
 "weren't": 'weren_t',
 "what'd": 'what_d',
 "what'll": 'what_ll',
 "what're": 'what_re',
 "what've": 'what_ve',
 "what's": 'what_s',
 "whate'er": 'whate_er',
 "whatsoe'er": 'whatsoe_er',
 "when'd": 'when_d',
 "when'll": 'when_ll',
 "when're": 'when_re',
 "when's": 'when_s',
 "whene'er": 'whene_er',
 "whensoe'er": 'whensoe_er',
 "where'd": 'where_d',
 "where'er": 'where_er',
 "where'll": 'where_ll',
 "where're": 'where_re',
 "where's": 'where_s',
 "where've": 'where_ve',
 "wheresoe'er": 'wheresoe_er',
 "who'd": 'who_d',
 "who'll": 'who_ll',
 "who're": 'who_re',
 "who's": 'who_s',
 "who've": 'who_ve',
 "why'll": 'why_ll',
 "why're": 'why_re',
 "why's": 'why_s',
 "won't": 'won_t',
 "wouldn't": 'wouldn_t',
 "you'd": 'you_d',
 "you'll": 'you_ll',
 "you're": 'you_re',
 "you've": 'you_ve',
 "your'n": 'your_n'}

This is a new function to collapse contractions using the dictionary above:

In [13]:
def collapse_contractions(text, contractions = contractions):
    text = text.lower()
    
    for key,value in contractions.items():
        if key in text:
            text = text.replace(key,value) # the key is the form with the apostrophe; the value does not have it
    
    return text

Did it work?

In [14]:
'ain_t' in collapse_contractions(text)

True

In [15]:
import string
import re

def tokenize(text, keep_punct = False):
    # NEW
    puncts = list(string.punctuation)
    puncts.remove('_') # keep _ characters

    if keep_punct is True:
        for punct in puncts:
            text = text.replace(punct, ' ' + punct + ' ')
    else:
        for punct in puncts:
            text = text.replace(punct, ' ')
    
    # this replaces *any* amount of whitespace with a single space using regular expressions
    text = re.sub('\s+', ' ', text)
    
    result = []
    
    for x in text.lower().split(' '):
        if x.isalpha():
            result.append(x)
        else:
            word = []
            for y in x: # for every character
                if y.isalpha() or y == '_': # retain our underscores
                    word.append(y)
            if len(word) > 0:
                result.append(''.join(word))
                
    return result

In [16]:
example = """Wouldn't it be great if we could've caught all of those shortened words like don't?"""

In [11]:
tokenize(collapse_contractions(example))

['wouldn_t',
 'it',
 'be',
 'great',
 'if',
 'we',
 'could_ve',
 'caught',
 'all',
 'of',
 'those',
 'shortened',
 'words',
 'like',
 'don_t']

## Tokenizing Johnson's novel
We're going to split *Autobiography* up by sentences to make sure that we keep our text grouped semantically:

In [18]:
orig_sents = nltk.sent_tokenize(text)

In [19]:
orig_sents[49]

'I never saw her read one of them.'

In [20]:
# let's preserve contractions
text = collapse_contractions(text)

In [22]:
sents = nltk.sent_tokenize(text)

In [23]:
sents[49]

'i never saw her read one of them.'

Now, we're going to tokenize every sentence in our list of sentences:

In [24]:
tokenized_sents = []

for sent in sents:
    tokenized_sents.append(tokenize(sent))

In [25]:
tokenized_sents[49]

['i', 'never', 'saw', 'her', 'read', 'one', 'of', 'them']

## POS tagging sentences

Let's try running `nltk`'s `pos_tag` function on our tokens in each sentence. The 'universal' tagset gives an easy-to-read description of the parts of speech we're looking at.

(It's also a perfect commentary on the universalizing impulses of computation that Risam critiques in the reading for today.)

In [None]:
#nltk.download('universal_tagset')

In [27]:
# pos is shorthand for 'parts of speech'
nltk.pos_tag(tokenized_sents[49], tagset = 'universal')

[('i', 'NOUN'),
 ('never', 'ADV'),
 ('saw', 'VERB'),
 ('her', 'PRON'),
 ('read', 'VERB'),
 ('one', 'NUM'),
 ('of', 'ADP'),
 ('them', 'PRON')]

Here's the table describing the meaning and examples of nltk's 'universal' tagset:

| ﻿Tag  | Meaning             | English Examples                       |
|------|---------------------|----------------------------------------|
| ADJ  | adjective           | new, good, high, special, big, local   |
| ADP  | adposition          | on, of, at, with, by, into, under      |
| ADV  | adverb              | really, already, still, early, now     |
| CONJ | conjunction         | and, or, but, if, while, although      |
| DET  | determiner, article | the, a, some, most, every, no, which   |
| NOUN | noun                | year, home, costs, time, Africa        |
| NUM  | numeral             | twenty-four, fourth, 1991, 14:24       |
| PRT  | particle            | at, on, out, over per, that, up, with  |
| PRON | pronoun             | he, their, her, its, my, I, us         |
| VERB | verb                | is, say, told, given, playing, would   |
| .    | punctuation marks   | . , ; !                                |
| X    | other               | ersatz, esprit, dunno, gr8, univeristy |

If you run `pos_tag` without the "universal" tagset, you will get results keyed to the Penn-Treebank tags. These contain more information, but also a greater number of possible categories.

In [36]:
nltk.pos_tag(tokenized_sents[49])

[('i', 'NN'),
 ('never', 'RB'),
 ('saw', 'VBD'),
 ('her', 'PRP'),
 ('read', 'VB'),
 ('one', 'CD'),
 ('of', 'IN'),
 ('them', 'PRP')]

And here are the Penn-Treebank tags:

| ﻿Tag  | Description                               | Example                    |
|------|-------------------------------------------|----------------------------|
| CC   | conjunction, coordinating                 | and, or, but               |
| CD   | cardinal number                           | five, three, 13%           |
| DT   | determiner                                | the, a, these              |
| EX   | existential there                         | there were six boys        |
| FW   | foreign word                              | mais                       |
| IN   | conjunction, subordinating or preposition | of, on, before, unless     |
| JJ   | adjective                                 | nice, easy                 |
| JJR  | adjective, comparative                    | nicer, easier              |
| JJS  | adjective, superlative                    | nicest, easiest            |
| LS   | list item marker                          |                            |
| MD   | verb, modal auxillary                     | may, should                |
| NN   | noun, singular or mass                    | tiger, chair, laughter     |
| NNS  | noun, plural                              | tigers, chairs, insects    |
| NNP  | noun, proper singular                     | Germany, God, Alice        |
| NNPS | noun, proper plural                       | we met two Christmases ago |
| PDT  | predeterminer                             | both his children          |
| POS  | possessive ending                         | 's                         |
| PRP  | pronoun, personal                         | me, you, it                |
| PRP£ | pronoun, possessive                       | my, your, our              |
| RB   | adverb                                    | extremely, loudly, hard    |
| RBR  | adverb, comparative                       | better                     |
| RBS  | adverb, superlative                       | best                       |
| RP   | adverb, particle                          | about, off, up             |
| SYM  | symbol                                    | %                          |
| TO   | infinitival to                            | what to do?                |
| UH   | interjection                              | oh, oops, gosh             |
| VB   | verb, base form                           | think                      |
| VBZ  | verb, 3rd person singular present         | she thinks                 |
| VBP  | verb, non-3rd person singular present     | I think                    |
| VBD  | verb, past tense                          | they thought               |
| VBN  | verb, past participle                     | a sunken ship              |
| VBG  | verb, gerund or present participle        | thinking is fun            |
| WDT  | wh-determiner                             | which, whatever, whichever |
| WP   | wh-pronoun, personal                      | what, who, whom            |
| WP£  | wh-pronoun, possessive                    | whose, whosever            |
| WRB  | wh-adverb                                 | where, when                |
| .    | punctuation mark, sentence closer         | .;?*                       |
| ,    | punctuation mark, comma                   | ,                          |
| :    | punctuation mark, colon                   | :                          |
| (    | contextual separator, left paren          | (                          |
| )    | contextual separator, right paren         | )                          |

Penn-Treebank gives you information about tense (past, present, etc.) which may be useful depending on your purposes.

## What sort of data is this result?
`nltk`'s part-of-speech tagger returns a **list of tuples**. We have encountered tuples before: they are immutable objects contained in `()` and separated by `,`. They can contain any sort of data. Here's an example:

In [28]:
('a','b')

('a', 'b')

In [29]:
(1, 'a')

(1, 'a')

They use the same `[]` indexing that we have gotten so accustomed to with lists and data frames:

In [30]:
(1, 'a')[1]

'a'

Each of these items in the list contains the word from the text in the first position, and its part of speech in the second.

In [31]:
result = nltk.pos_tag(tokenized_sents[49])

In [32]:
result

[('i', 'NN'),
 ('never', 'RB'),
 ('saw', 'VBD'),
 ('her', 'PRP'),
 ('read', 'VB'),
 ('one', 'CD'),
 ('of', 'IN'),
 ('them', 'PRP')]

In [526]:
result[0]

('i', 'NN')

That also means that we can do the following commands to get either **piece** of the tuple, which will be useful:

In [33]:
result[0]

('i', 'NN')

In [527]:
result[0][0]

'i'

In [528]:
result[0][1]

'NN'

What we're doing above is getting the tuple via `result[0]` and then asking for position `[0]` or `[1]` from that tuple.

FYI, if you call out of range, you do get an error:

In [34]:
len(result[0])

2

In [35]:
result[0][2]

IndexError: tuple index out of range

## How can we use this?
Let's stick with our example: Can we find all of the sentences in *Autobiography of an Ex-Colored Man* where "race" is a noun?

In [530]:
orig_sents[49]

'I never saw her read one of them.'

In [37]:
tokenized_sents[49]

['i', 'never', 'saw', 'her', 'read', 'one', 'of', 'them']

First, we're going to `pos_tag` all of our sentences:

In [38]:
tagged_sents = []

for sent in tokenized_sents:
    pos = nltk.pos_tag(sent, tagset='universal')
    tagged_sents.append(pos)

In [39]:
tagged_sents[49]

[('i', 'NOUN'),
 ('never', 'ADV'),
 ('saw', 'VERB'),
 ('her', 'PRON'),
 ('read', 'VERB'),
 ('one', 'NUM'),
 ('of', 'ADP'),
 ('them', 'PRON')]

In [42]:
for i,sent in enumerate(tagged_sents[:10]): # using enumerate to count where we are in our sentences, just doing 10
    for tup in sent:
        if tup[0] == 'race' and tup[1] == 'NOUN':
            print(sents[i]) # using the original sents variable to make results easier to read
            print('-'*80)

the autobiography of an ex-colored man

james weldon johnson

boston: sherman, french & company, 1912
copyright, 1912




preface

this vivid and startlingly new picture of conditions brought about by the race question in the united states makes no special plea for the negro, but shows in a dispassionate, though sympathetic, manner conditions as they actually exist between the whites and blacks to-day.
--------------------------------------------------------------------------------
this is because writers, in nearly every instance, have treated the colored american as a whole; each has taken some one group of the race to prove his case.
--------------------------------------------------------------------------------
not before has a composite and proportionate presentation of the entire race, embracing all of its various groups and elements, showing their relations with each other and to the whites, been made.
----------------------------------------------------------------------------

Are there cases in which "race" is not used as a noun?

In [43]:
n_instances = 0

for i,sent in enumerate(tagged_sents):
    for tup in sent:
        if tup[0] == 'race' and tup[1] != 'NOUN':
            n_instances += 1 # checking to see how many times it occurs
            print(sents[i])
            print('-'*80)
    
print(n_instances)

0


What about other homographs such as "like"? It can be a verb, a preposition, a noun...

For this case, I'm going to create a list where I will store the parts-of-speech associated with our target word, `like`, and count the results:

In [44]:
likes = []

for sent in tagged_sents:
    for tup in sent:
        if tup[0] == 'like':
            likes.append(tup[1])
    

In [47]:
d = {}
for x in likes:
    if x not in d:
        d[x] = 1
    else:
        d[x] += 1

In [48]:
d

{'ADP': 52, 'VERB': 1, 'ADJ': 4}

In [49]:
from collections import Counter
Counter(likes)

Counter({'ADP': 52, 'VERB': 1, 'ADJ': 4})

## How about finding sentences with specific grammatical features?
Let's say we wanted to see every sentence in this novel with a past-tense verb, comparative adverbs like "better," and maybe personal wh-pronouns like "who."

This could be a way to see how the protagonist compares his new situation as an "ex-colored man" to his expectations:

In [50]:
# first need to re-tag with the Penn-Treebank set
tagged_sents = []

for sent in tokenized_sents:
    pos = nltk.pos_tag(sent)
    tagged_sents.append(pos)

In [51]:
tagged_sents[49]

[('i', 'NN'),
 ('never', 'RB'),
 ('saw', 'VBD'),
 ('her', 'PRP'),
 ('read', 'VB'),
 ('one', 'CD'),
 ('of', 'IN'),
 ('them', 'PRP')]

The tags for the elements we want are `'VBD'` and `'RBR'` and `WP`:

In [55]:
for i,sent in enumerate(tagged_sents):
    pos = [] # initializing a list to store parts of speech
    for tup in sent:
        pos.append(tup[1])
    if 'VBD' in pos and 'RBR' in pos and 'WP' in pos:
        print(sents[i]) # again, using original sents for ease of reading
        print('-'*80)

i knew later that these letters contained money and, what was to her, more than money.
--------------------------------------------------------------------------------
this was the first word missed, and it seemed to me that some of the scholars were about to lose their senses; some were dancing up and down on one foot with a hand above their heads, the fingers working furiously, and joy beaming all over their faces; others stood still, their hands raised not so high, their fingers working less rapidly, and their faces expressing not quite so much happiness; there were still others who did not move nor raise their hands, but stood with great wrinkles on their foreheads, looking very thoughtful.
--------------------------------------------------------------------------------
i could see that her skin was almost brown, that her hair was not so soft as mine, and that she did differ in some way from the other ladies who came to the house; yet, even so, i could see that she was very beautif

Many of these results are exactly what we were looking for! Comparisons between the protagonist's situation, and his expectations about the world.

Of course you could do the above analysis using any combination of parts-of-speech.

## Counting parts of speech

Finally, what if we simply wanted to see the most frequent examples of certain parts of speech?

Let's look at the comparative adjectives Johnson uses ("better," etc.):

In [56]:
from collections import Counter
d = Counter() # initializing a counter object to make collecting instances easier

In [58]:
for i,sent in enumerate(tagged_sents):
    for tup in sent:
        if tup[1] == 'JJ':
            d[tup[0]] += 1

Then, we can put this data into a familiar Pandas object to sort it:

In [59]:
import pandas as pd
pd.Series(d).sort_values(ascending = False)[:20]

i          203
white      101
great       97
new         83
other       83
colored     75
several     69
first       68
little      67
good        58
many        55
much        52
same        51
old         48
such        47
black       44
few         43
young       42
whole       32
red         31
dtype: int64

I'm very surprised to see `i` at the top of this list; I'm not sure why `nltk` marked these instances as adjectives.

However, many of the other words seem to be both accurately tagged and intellectually useful: It makes sense that the narrator's primary description of experiences would be as `white`.

Of course, you could also use these methods to count *all* of the parts of speech in a given text, or across your entire corpus.

# Named-entity recognition
We can use NLTK to identify named persons, places, and things in our texts through a process called named-entity recognition.

In [60]:
johnson = '/Users/e/Downloads/1912_johnson_ex-colored.txt'
text = open(johnson).read()

In [61]:
# We start with raw sentences:
sents = nltk.sent_tokenize(text)
sents[:3]

['The Autobiography of an Ex-Colored Man\n\nJames Weldon Johnson\n\nBoston: Sherman, French & Company, 1912\nCopyright, 1912\n\n\n\n\nPREFACE\n\nThis vivid and startlingly new picture of conditions brought about by the race question in the United States makes no special plea for the Negro, but shows in a dispassionate, though sympathetic, manner conditions as they actually exist between the whites and blacks to-day.',
 'Special pleas have already been made for and against the Negro in hundreds of books, but in these books either his virtues or his vices have been exaggerated.',
 'This is because writers, in nearly every instance, have treated the colored American as a whole; each has taken some one group of the race to prove his case.']

In [62]:
# let's make a function for this
def ner_sents(sent_list):
    """
    Using NLTK, this function takes a list of sentences, identifies the named entities in it,
    and returns a list of dictionaries, with one dictionary per named entitiy,
    where each dictionary looks like this:
    
    {
        'type': 'PERSON',
        'entity': 'Harry Potter',
        '_sent_num': 7,
        '_sent': 'And why, Snape, is Harry Potter still alive, when you have had him at your mercy for five years?'
    }
    """
    
    # set empty list for output
    output_list = []
    
    # loop over each sentence
    for sent_num, sent in enumerate(sent_list):        
        # we need to get the words
        sent_words = nltk.word_tokenize(sent)
        
        # parts of speech
        sent_pos = nltk.pos_tag(sent_words)
        
        # then "chunk" the results using ne_chunk
        chunks = nltk.ne_chunk(sent_pos)
        
        # loop over chunks...
        for chunk in chunks:
            # if the chunk has a 'label' attribute (i.e. the entity is labeled)
            if hasattr(chunk,'label'):
                
                # get the label
                label = chunk.label()
                
                # get the words in the chunk
                chunk_words = []
                for word,pos in chunk:
                    chunk_words.append(word)
                
                # make a string of the words
                chunk_words_str = ' '.join(chunk_words)
                
                # make a result dictionary
                result_dict = {}
                
                # add NER info
                result_dict['type'] = label
                result_dict['entity'] = chunk_words_str
                
                ### add sentence info
                result_dict['sent_num'] = sent_num
                # add a string of the sentence
                result_dict['sent'] = sent
                
                # add result dictionary to output list
                output_list.append(result_dict)
    
    # return list of dictionaries
    return output_list

In [548]:
# this takes about 60 seconds to run on my computer
test = ner_sents(sents)

In [63]:
import pandas as pd
df = pd.DataFrame(test)

NameError: name 'test' is not defined

*Clearly* `nltk` is imperfect:

In [550]:
df.head()

Unnamed: 0,entity,sent,sent_num,type
0,James Weldon,The Autobiography of an Ex-Colored Man\n\nJame...,0,PERSON
1,Johnson Boston,The Autobiography of an Ex-Colored Man\n\nJame...,0,PERSON
2,Sherman,The Autobiography of an Ex-Colored Man\n\nJame...,0,PERSON
3,French,The Autobiography of an Ex-Colored Man\n\nJame...,0,GPE
4,Company,The Autobiography of an Ex-Colored Man\n\nJame...,0,ORGANIZATION


But that doesn't mean it's not useful:

In [551]:
df['entity'].value_counts()[:30]

South                 45
New York              39
Paris                 33
Negro                 30
Southern              19
United States         17
French                17
London                16
Atlanta               15
Jacksonville          14
Washington            12
Texan                 11
Spanish               10
Europe                10
English               10
American              10
Negroes                9
North                  8
Shiny                  7
Nashville              7
Boston                 6
Bible                  6
Atlanta University     5
America                5
John Brown             5
University             5
Sixth Avenue           5
Connecticut            5
Union                  5
American Negro         5
Name: entity, dtype: int64

In [552]:
df['type'].value_counts()

GPE             409
PERSON          139
ORGANIZATION     98
LOCATION         23
GSP              16
FACILITY          7
Name: type, dtype: int64

## When to use NER?

NER can be very useful if you want to look at features like geography, or frequently referenced presons. For example, we get some real specifics here: Sixth Avenue refers to the specific place in New York, near the top of the list. Note, too, that it is able to identify both single words and multiple words that comprise named entities in the world.

We could use the resulting dataframe to look at the sentences where Sixth Avenue appears:

In [553]:
list(df[df['entity'] == 'Sixth Avenue']['sent'])

['As soon as we landed, four of us went directly to a lodging-house in 27th Street, just west of Sixth Avenue.',
 'We went to Sixth Avenue, walked two blocks, and turned to the west into another street.',
 'We got out of the house about dark, went round to a restaurant on Sixth Avenue and ate something, then walked around for a couple of hours.',
 'My New York was limited to ten blocks; the boundaries were Sixth Avenue from Twenty-third to Thirty-third Streets, with the cross streets one block to the west.',
 'I went to Coney Island and the other resorts; took in the pre-season shows along Broadway, and ate at first class restaurants; but I shunned the old Sixth Avenue district as though it were pest infected.']

Sixth Avenue comes to mean quite a lot to the protagonist after he begins passing for white.

# Saving part-of-speech tagged texts to disk
Since POS tagging requires a good amount of computational power, you may want to save your reuslts if you intend to process the text multiple times.

`nltk` has a standard way of storing POS-tagged data that we can write out to file to reimport later:

In [None]:
tagged_sents[49]

The method `tuple2str` converts a `('word','pos')` tuple to a standard string representation:

In [None]:
for tup in tagged_sents[49]:
    print(nltk.tuple2str(tup))

Then we can write each of these in sequence to a text file for later use using a new filemaking convention:

In [None]:
with open('/Users/e/Downloads/output.txt', mode = 'w') as output:
    for sent in tagged_sents:
        for tup in sent:        
            output.write(nltk.tuple2str(tup) + ' ') # we are adding one space between every word/pos pair

This begins with a file-handler expression, `with`. It opens the file, and the `as` statements assigns a variable name that we can use for that file during this operation.

Just as we would `open()` an existing file on our hard drive and `read()` it into memory, we can also use `open()` to create a new file and add data to it. In order to do that, we have to change the `mode` we use to `open` the file. The default `mode` is `r`, which stands for "read." That is what we have been using all quarter. `mode='w'` stands for "write," and it's what we want to use when we wish to add data to a file that didn't exist before.

**Warning**: Opening a file in mode `w` *deletes everything inside of it*. In the above case this is fine because there was nothing in my Downloads called `output.txt`. You only want to use mode `w` when you are creating new data, or overwriting old data.

## Reimporting our tagged data
Now we can use the reverse function, `str2tuple` to import our part-of-speech tagged data:

In [None]:
with open('/Users/e/Downloads/output.txt', mode = 'r') as data: #note that mode has changed to r since we are reading
    data = data.read().split(' ')
    tagged_text = []
    for tup in data:
        tagged_text.append(nltk.str2tuple(tup))

In [None]:
tagged_text[:50]

When there are processes that take your computer a long time to execute, it is often good practice to write your results to disk somewhere so that you can access them later without having to re-process everything.

## Aggregating part-of-speech data
Finally, let's say that we want to count up instances of words used as specific parts of speech.

One approach we could use would be to use our usual counting techniques *directly on the POS output* we generated above.

For example:

In [None]:
from collections import Counter

with open('/Users/e/Downloads/output.txt') as data:
    words = data.read().split(' ')
    result = Counter(words)

In [None]:
result

We could easily apply the same technique to an entire corpus by performing the following steps:
1. POS tagging each file
2. Writing the POS-tagged versions of each file to disk
3. Reading in each POS-tagged file
4. Counting the POS pairs as above
5. Adding filename information to the dictionary like so:
```python
d['filename'] = 'output.txt'
```