# Part-of-Speech (POS) Tagging

In [None]:
from nltk.tokenize import TweetTokenizer
from nltk import pos_tag
from nltk.help import upenn_tagset

In [None]:
upenn_tagset()

## Defintion of toy dataset

We can simply use the examples sentence from the Tokenization tutorial.

In [None]:
sentences = ["Text processing with Python is great.", 
             "It isn't (very) complicated to get started.",
             "However,careful to...you know....avoid mistakes.",
             "This is so cooool #nltkrocks :))) :-P <3."]

## Processing of sentences

Since we know that there a lot of informal tokens in the sentences, we can the TweetTokenizer. For any kind of more formal text, the default tokenizer will work just find. Even here, the default tokenizer would suffice since the important token (i.e., the "real" words) are handled correctly.

In [None]:
tweet_tokenizer = TweetTokenizer()

The procrssing itself is just two steps: tokenizing and POS tagging, both provided by available methods. Note that the method `pos_tag()` expects as input a list (of tokens/words) and not a string.

In [None]:
print ('\nOutput of NLTK POS tagger:')
for s in sentences:
    token_list = tweet_tokenizer.tokenize(s)
    pos_tag_list = pos_tag(token_list)
    print ('\n', pos_tag_list)

## POS tagging with spaCy

In [None]:
import spacy

In [None]:
nlp = spacy.load('en_core_web_sm')

In [None]:
print ('\nOutput of spaCy POS tagger:')
for s in sentences:
    doc = nlp(s) # doc is an object, not just a simple list
    # Let's create a list so the output matches the previous ones
    token_list = []
    for token in doc:
        token_list.append((token.text, token.tag_)) # token is also an object, not a string
    print ('\n', token_list)

The results of the NLTK and spaCy POS tagger are not exactly the same. The reasons is that the two packages use different tokenizers but also different models to POS tag the tokens; see particularly the emoticons. In most cases, this doesn't matter, since "normal words" are mostly tagged correctly across different POS taggers.

## Application use case: analysis of restaurant reviews

Knowing the POS tags of tokens/words is useful for various subsequent analyses. In the following example, we want to analyze 1,000 Yelp reviews about the restaurant "Mon Ami Gabi" in Las Vegas (USA) to see which adjectives are most commonly used.

- Link to restaurant on Yelp: https://www.yelp.com/biz/mon-ami-gabi-las-vegas-2

### Load reviews from CSV file

`pandas` is a very popular package for handling structured files like CSV files.

In [None]:
import pandas as pd

`pandas` use the notion of *data frames* (df) to denote data objects

In [None]:
df = pd.read_csv('data/reviews/yelp-reviews-mon-ami-gabi.csv')

df.head()

The CSV file with the reviews and thus the data frame have two columns: the review number and the text of the review. Since we're only interested in the review texts, we can simply extract them into a list of strings.

In [None]:
reviews = df['review'].tolist() # "review" is the name of the column of interest (see above)

### Review analysis

For each review, we perform the following steps:
- Tokenize review and POS tag all token
- Check each token if it is an adjective
- If a token is an adjetive, increase a counter for this adjective

In [None]:
# This dictionary will keep track of the count for each found adjective
adjective_frequencies = {}

# Check each review on by one
for review in reviews:
    # Tokenize the review
    token_list = tweet_tokenizer.tokenize(review)
    # POS tag all words/tokens
    pos_tag_list = pos_tag(token_list)
    # Count the number of all adjectives
    for token, tag in pos_tag_list:
        if tag[0].lower() != 'j':
            # Ignore token if it is not an adjective (recall that JJ, JJR, JJS indicate adjectives)
            continue
        # Convert token to lowercase, otherwise "Good" and "good" are considered differently
        token = token.lower()
        if token not in adjective_frequencies:
            adjective_frequencies[token] = 1.0
        else:
            adjective_frequencies[token] = adjective_frequencies[token] + 1.0

            
# We need to convert the dictionary to a list of tuples for the word cloud generation                
# Before: {"small": 45, "nice": 30, "good": 102, ...}
# After:  [("small", 45), ("nice", 30), ("good", 102), ...]
adjective_frequencies = [ (token, count) for token, count in adjective_frequencies.items() ]                
            
    
# Show the first 5 (word,count) tuples (not sorted because not needed)
#print (adjective_frequencies[:5])

### Visualization of results

We use a readily available Python package (`wordcloud`) for convenience

In [None]:
from utils.plotutil import show_wordcloud

In [None]:
show_wordcloud(adjective_frequencies)