# AnTeDe Lab A: Understanding PoS tags 

## Session goal
The goal of this session is to help you familiarize with PoS tags. We begin by importing the NLTK fragments of the Brown corpus and the Wall Street Journal.

In [1]:
import nltk
nltk.download('brown')
nltk.download('treebank')
nltk.download('universal_tagset')
from nltk.corpus import brown 
from nltk.corpus import treebank 

[nltk_data] Downloading package brown to
[nltk_data]     /Users/davebrunner/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package treebank to
[nltk_data]     /Users/davebrunner/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     /Users/davebrunner/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


Complete the inner loop of the following function as directed by the comments.

In [2]:
def get_ground_truth_distribution(token, corpus, universal_tagset=False):
    
    if universal_tagset:
        corpus_tagset='universal'
    else:
        corpus_tagset=''
        
    sentences = corpus.tagged_sents(tagset=corpus_tagset)
    untagged_sentences = corpus.sents()
    
    # this is going to be a dict where each key is a tag
    tag_freq={}
    
    # sent is an untagged sentence
    # sentences[i] is the corresponding tagged sentence
    
    for i, sent in enumerate(untagged_sentences):
            # if the token we're looking for is in sent
            if token in sent:
                # for each (token, tag) tuple in the tagged sentence
                for pair in sentences[i]:
                    
                    # pair[0] contains the current token
                    # pair[1] contains the corresponding tag
                    
                    # increase tag_freq[pair[1]] by one unit
                    # careful because tag_freq may not yet have a 
                    # key corresponding to pair[1]!
                    
                    if pair[0] == token:
                        try:
                            tag_freq[pair[1]]+=1
                        except:
                            tag_freq[pair[1]]=1
    return tag_freq                        
              

In the following cells, we get the PoS tag distribution of *that* in the Penn treebank and the Brown corpus using the universal and the Penn tagset (for Penn) and the Brown tagset (for Brown).

In [3]:
get_ground_truth_distribution(token='fast', corpus=treebank)

{'RB': 2, 'JJ': 1}

In [4]:
get_ground_truth_distribution(token='back', corpus=treebank)

{'NN': 2, 'VB': 1, 'RB': 15, 'RP': 6, 'JJ': 1}

The following function gives you examples for a specific combination of token and tag. 

In [5]:
def get_ground_truth_examples(token, corpus, tag, num=0, universal_tagset=False):
    
    if universal_tagset:
        corpus_tagset='universal'
    else:
        corpus_tagset=''
    
    sentences =corpus.tagged_sents(tagset=corpus_tagset)
    untagged_sentences = corpus.sents()
    tag_freq={}
    count=0
    visualize=False
    
    
    for i, sent in enumerate(untagged_sentences):

        if num>0:
          if count==num:
            break
        
        if token in sent:
            text=""
            for pair in sentences[i]:
                if 'NONE' not in pair[1]:
                    text = text+" "+pair[0]
                if (pair[0]==token) and (pair[1]==tag):
                    visualize=True    
            if visualize:
                count=count+1
                print (str(count)+' '+text)
                print (str(sentences[i]))
                visualize=False
                          

In [6]:
get_ground_truth_examples(token='fast', corpus=treebank, tag='JJ')

1  The New York Stock Exchange 's attempt to introduce a new portfolio basket is evidence of investors ' desires to make fast and easy transactions of large numbers of shares .
[('The', 'DT'), ('New', 'NNP'), ('York', 'NNP'), ('Stock', 'NNP'), ('Exchange', 'NNP'), ("'s", 'POS'), ('attempt', 'NN'), ('*', '-NONE-'), ('to', 'TO'), ('introduce', 'VB'), ('a', 'DT'), ('new', 'JJ'), ('portfolio', 'NN'), ('basket', 'NN'), ('is', 'VBZ'), ('evidence', 'NN'), ('of', 'IN'), ('investors', 'NNS'), ("'", 'POS'), ('desires', 'NNS'), ('*', '-NONE-'), ('to', 'TO'), ('make', 'VB'), ('fast', 'JJ'), ('and', 'CC'), ('easy', 'JJ'), ('transactions', 'NNS'), ('of', 'IN'), ('large', 'JJ'), ('numbers', 'NNS'), ('of', 'IN'), ('shares', 'NNS'), ('.', '.')]


In [7]:
get_ground_truth_distribution(token='that', corpus=treebank)

{'WDT': 214, 'IN': 513, 'DT': 77, 'RB': 3}

In [8]:
get_ground_truth_distribution(token='Exchange', corpus=treebank)

{'NNP': 50}

In [9]:
get_ground_truth_distribution(token='investor', corpus=treebank)

{'NN': 39}

In [10]:
get_ground_truth_examples(token='that', corpus=treebank, tag='IN', num=5)

1  The finding probably will support those who argue that the U.S. should regulate the class of asbestos including crocidolite more stringently than the common kind of asbestos , chrysotile , found in most schools and other buildings , Dr. Talcott said .
[('The', 'DT'), ('finding', 'NN'), ('probably', 'RB'), ('will', 'MD'), ('support', 'VB'), ('those', 'DT'), ('who', 'WP'), ('*T*-6', '-NONE-'), ('argue', 'VBP'), ('that', 'IN'), ('the', 'DT'), ('U.S.', 'NNP'), ('should', 'MD'), ('regulate', 'VB'), ('the', 'DT'), ('class', 'NN'), ('of', 'IN'), ('asbestos', 'NN'), ('including', 'VBG'), ('crocidolite', 'NN'), ('more', 'RBR'), ('stringently', 'RB'), ('than', 'IN'), ('the', 'DT'), ('common', 'JJ'), ('kind', 'NN'), ('of', 'IN'), ('asbestos', 'NN'), (',', ','), ('chrysotile', 'NN'), (',', ','), ('found', 'VBN'), ('*', '-NONE-'), ('in', 'IN'), ('most', 'JJS'), ('schools', 'NNS'), ('and', 'CC'), ('other', 'JJ'), ('buildings', 'NNS'), (',', ','), ('Dr.', 'NNP'), ('Talcott', 'NNP'), ('said', 'VBD'

In [11]:
get_ground_truth_examples(token='that', corpus=treebank, tag='DT', num=5)

1  `` What matters is what advertisers are paying per page , and in that department we are doing fine this fall , '' said Mr. Spoon .
[('``', '``'), ('What', 'WP'), ('*T*-14', '-NONE-'), ('matters', 'VBZ'), ('is', 'VBZ'), ('what', 'WP'), ('advertisers', 'NNS'), ('are', 'VBP'), ('paying', 'VBG'), ('*T*-15', '-NONE-'), ('per', 'IN'), ('page', 'NN'), (',', ','), ('and', 'CC'), ('in', 'IN'), ('that', 'DT'), ('department', 'NN'), ('we', 'PRP'), ('are', 'VBP'), ('doing', 'VBG'), ('fine', 'RB'), ('this', 'DT'), ('fall', 'NN'), (',', ','), ("''", "''"), ('said', 'VBD'), ('*T*-1', '-NONE-'), ('Mr.', 'NNP'), ('Spoon', 'NNP'), ('.', '.')]
2  State court Judge Richard Curry ordered Edison to make average refunds of about $ 45 to $ 50 each to Edison customers who have received electric service since April 1986 , including about two million customers who have moved during that period .
[('State', 'NN'), ('court', 'NN'), ('Judge', 'NNP'), ('Richard', 'NNP'), ('Curry', 'NNP'), ('ordered', 'VBD'), ('Ed