# Pos-Tagging & Feature Extraction
Following normalisation, we can now proceed to the process of pos-tagging and feature extraction. Let's start with pos-tagging.

## POS-tagging
Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes or lexical categories.

The `nltk` library provides its own pre-trained `POS-tagger`. Let's see how it is used.

In [34]:
import pandas as pd
df0 = pd.read_csv("../data/interim/001_normalised_keyed_reviews_100k_sample.csv", sep="\t", low_memory=False)
df0.head()

Unnamed: 0,uniqueKey,reviewText
0,A10000012B7CGYKOMPQ4L##000100039X,"['spiritually', 'mentally', 'inspiring', 'book..."
1,A2S166WSCFIFP5##000100039X,"['one', 'must', 'books', 'masterpiece', 'spiri..."
2,A1BM81XB4QHOA3##000100039X,"['book', 'provides', 'reflection', 'apply', 'l..."
3,A1MOSTXNIO5MPJ##000100039X,"['first', 'read', 'prophet', 'college', 'back'..."
4,A2XQ5LZHTD4AFT##000100039X,"['timeless', 'classic', 'demanding', 'assuming..."


In [35]:
# For monitoring duration of pandas processes
from tqdm import tqdm, tqdm_pandas

# To avoid RuntimeError: Set changed size during iteration
tqdm.monitor_interval = 0

# Register `pandas.progress_apply` and `pandas.Series.map_apply` with `tqdm`
# (can use `tqdm_gui`, `tqdm_notebook`, optional kwargs, etc.)
tqdm.pandas(desc="Progress:")

# Now you can use `progress_apply` instead of `apply`
# and `progress_map` instead of `map`
# can also groupby:
# df.groupby(0).progress_apply(lambda x: x**2)

In [36]:
def convert_text_to_list(review):
    return review.replace("[","").replace("]","").replace("'","").split(",")

In [37]:
# Convert "reviewText" field to back to list
df0['reviewText'] = df0['reviewText'].astype(str)
df0['reviewText'] = df0['reviewText'].progress_apply(lambda text: convert_text_to_list(text));
df0['reviewText'].head()


Progress::   0%|          | 0/99999 [00:00<?, ?it/s][A
Progress::   9%|▊         | 8722/99999 [00:00<00:01, 86814.60it/s][A
Progress::  16%|█▌        | 15759/99999 [00:00<00:01, 81123.16it/s][A
Progress::  23%|██▎       | 23296/99999 [00:00<00:00, 79294.07it/s][A
Progress::  32%|███▏      | 32201/99999 [00:00<00:00, 81987.82it/s][A
Progress::  42%|████▏     | 42196/99999 [00:00<00:00, 86658.10it/s][A
Progress::  50%|████▉     | 49521/99999 [00:00<00:00, 81953.00it/s][A
Progress::  57%|█████▋    | 56831/99999 [00:00<00:00, 77467.55it/s][A
Progress::  64%|██████▍   | 64009/99999 [00:00<00:00, 75179.05it/s][A
Progress::  71%|███████   | 71154/99999 [00:00<00:00, 74011.18it/s][A
Progress::  78%|███████▊  | 78291/99999 [00:01<00:00, 72865.32it/s][A

Progress::  86%|████████▌ | 85804/99999 [00:01<00:00, 29241.90it/s][A
Progress::  93%|█████████▎| 93204/99999 [00:01<00:00, 35723.48it/s][A
Progress:: 100%|█████████▉| 99970/99999 [00:01<00:00, 41616.41it/s][A
Progress:: 100%|████

0    [spiritually,  mentally,  inspiring,  book,  a...
1    [one,  must,  books,  masterpiece,  spirituali...
2    [book,  provides,  reflection,  apply,  life, ...
3    [first,  read,  prophet,  college,  back,  60s...
4    [timeless,  classic,  demanding,  assuming,  t...
Name: reviewText, dtype: object

In [38]:
import nltk
nltk.__version__

'3.2.4'

Follow this link for more info on the tagger: https://nlp.stanford.edu/software/tagger.shtml#History

In [39]:
from nltk.tag import StanfordPOSTagger
from nltk import word_tokenize

# import os
# os.getcwd()

# Add the jar and model via their path (instead of setting environment variables):
jar = '../models/stanford-postagger-full-2017-06-09/stanford-postagger.jar'
model = '../models/stanford-postagger-full-2017-06-09/models/english-left3words-distsim.tagger'

pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8')

In [40]:
# Example
text = pos_tagger.tag(word_tokenize("What's the airspeed of an unladen swallow ?"))
print(text)

[('What', 'WP'), ("'s", 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]


In [41]:
def pos_tag(review):
    if(len(review)>0):
        return pos_tagger.tag(review)

`nltk` provides documentation for each tag, which can be queried using the tag, e.g., `nltk.help.upenn_tagset(‘RB’)`, or a regular expression. `nltk` also provides batch pos-tagging method for document pos-tagging:


In [42]:
tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: pos_tag(review)))
tagged_df.head()

Progress:: 100%|██████████| 50/50 [00:42<00:00,  1.14it/s]


Unnamed: 0,reviewText
0,"[(spiritually, RB), (mentally, RB), (inspiring..."
1,"[(one, PRP), (must, MD), (books, NNS), (master..."
2,"[(book, NN), (provides, VBZ), (reflection, NN)..."
3,"[(first, RB), (read, VB), (prophet, NN), (coll..."
4,"[(timeless, JJ), (classic, JJ), (demanding, VB..."


In [43]:
tagged_df['reviewText'][8]

[('timeless', 'JJ'),
 ('classic', 'JJ'),
 ('years', 'NNS'),
 ('ive', 'JJ'),
 ('given', 'VBN'),
 ('gift', 'NN'),
 ('times', 'NNS'),
 ('count', 'VBP'),
 ('continue', 'VB'),
 ('addresses', 'NNS'),
 ('real', 'JJ'),
 ('life', 'NN'),
 ('issues', 'NNS'),
 ('beautiful', 'JJ'),
 ('way', 'NN'),
 ('makes', 'VBZ'),
 ('us', 'PRP'),
 ('reexamine', 'VB'),
 ('attitude', 'NN'),
 ('see', 'VB'),
 ('happens', 'VBZ'),
 ('lives', 'NNS'),
 ('easy', 'JJ'),
 ('read', 'NN')]

The list of all possible tags appears below:

| Tag  | Description                              |
|------|------------------------------------------|
| CC   | Coordinating conjunction                 |
| CD   | Cardinal number                          |
| DT   | Determiner                               |
| EX   | ExistentialĘthere                        |
| FW   | Foreign word                             |
| IN   | Preposition or subordinating conjunction |
| JJ   | Adjective                                |
| JJR  | Adjective, comparative                   |
| JJS  | Adjective, superlative                   |
| LS   | List item marker                         |
| MD   | Modal                                    |
| NN   | Noun, singular or mass                   |
| NNS  | Noun, plural                             |
| NNP  | Proper noun, singular                    |
| NNPS | Proper noun, plural                      |
| PDT  | Predeterminer                            |
| POS  | Possessive ending                        |
| PRP  | Personal pronoun                         |
| PRP* | Possessive pronoun                       |
| RB   | Adverb                                   |
| RBR  | Adverb, comparative                      |
| RBS  | Adverb, superlative                      |
| RP   | Particle                                 |
| SYM  | Symbol                                   |
| TO   | to                                       |
| UH   | Interjection                             |
| VB   | Verb, base form                          |
| VBD  | Verb, past tense                         |
| VBG  | Verb, gerund or present participle       |
| VBN  | Verb, past participle                    |
| VBP  | Verb, non-3rd person singular present    |
| VBZ  | Verb, 3rd person singular present        |
| WDT  | Wh-determiner                            |
| WP   | Wh-pronoun                               |
| WP*  | Possessive wh-pronoun                    |
| WRB  | Wh-adverb                                |

Notice: where you see `*` replace with `$`.

In [None]:
## Join with Original Key and Persist Locally to avoid RE-processing
uniqueKey_series_df = df0[['uniqueKey']]
uniqueKey_series_df.head()

In [None]:
pos_tagged_keyed_reviews = pd.concat([uniqueKey_series_df, tagged_df], axis=1);
pos_tagged_keyed_reviews.head()

In [None]:
pos_tagged_keyed_reviews.to_csv("../data/interim/002_pos_tagged_keyed_reviews.csv", sep='\t', header=True, index=False);

## Nouns
Nouns generally refer to people, places, things, or concepts, e.g.: woman, Scotland, book, intelligence. Nouns can appear after determiners and adjectives, and can be the subject or object of the verb.

The simplified noun tags are `N` for common nouns like book, and `NP` for proper nouns like Scotland.

In [None]:
def noun_collector(word_tag_list):
    if(len(word_tag_list)>0):
        return [word for (word, tag) in word_tag_list if tag in {'NN', 'NNS', 'NNP', 'NNPS'}]

In [None]:
nouns_df = pd.DataFrame(tagged_df['reviewText'].progress_apply(lambda review: noun_collector(review)))
nouns_df.head()

In [None]:
tagged_df["reviewText"][10]

In [None]:
df0['reviewText'][10]