# Pos-Tagging & Feature Extraction
Following normalisation, we can now proceed to the process of pos-tagging and feature extraction. Let's start with pos-tagging.

## POS-tagging
Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes or lexical categories.

The `nltk` library provides its own pre-trained `POS-tagger`. Let's see how it is used.

In [1]:
import pandas as pd
df0 = pd.read_csv("../data/interim/001_normalised_keyed_reviews_100k_sample.csv", sep="\t", low_memory=False)
df0.head()

Unnamed: 0,uniqueKey,reviewText
0,A10000012B7CGYKOMPQ4L##000100039X,"['spiritually', 'mentally', 'inspiring', 'book..."
1,A2S166WSCFIFP5##000100039X,"['one', 'must', 'books', 'masterpiece', 'spiri..."
2,A1BM81XB4QHOA3##000100039X,"['book', 'provides', 'reflection', 'apply', 'l..."
3,A1MOSTXNIO5MPJ##000100039X,"['first', 'read', 'prophet', 'college', 'back'..."
4,A2XQ5LZHTD4AFT##000100039X,"['timeless', 'classic', 'demanding', 'assuming..."


In [2]:
# For monitoring duration of pandas processes
from tqdm import tqdm, tqdm_pandas

# To avoid RuntimeError: Set changed size during iteration
tqdm.monitor_interval = 0

# Register `pandas.progress_apply` and `pandas.Series.map_apply` with `tqdm`
# (can use `tqdm_gui`, `tqdm_notebook`, optional kwargs, etc.)
tqdm.pandas(desc="Progress:")

# Now you can use `progress_apply` instead of `apply`
# and `progress_map` instead of `map`
# can also groupby:
# df.groupby(0).progress_apply(lambda x: x**2)

In [3]:
def convert_text_to_list(review):
    return review.replace("[","").replace("]","").split(",")

In [4]:
# Convert "reviewText" field to back to list
df0['reviewText'] = df0['reviewText'].astype(str)
df0['reviewText'] = df0['reviewText'].progress_apply(lambda text: convert_text_to_list(text));
df0['reviewText'].head()

Progress:: 100%|██████████| 99999/99999 [00:01<00:00, 95758.99it/s] 


0    ['spiritually',  'mentally',  'inspiring',  'b...
1    ['one',  'must',  'books',  'masterpiece',  's...
2    ['book',  'provides',  'reflection',  'apply',...
3    ['first',  'read',  'prophet',  'college',  'b...
4    ['timeless',  'classic',  'demanding',  'assum...
Name: reviewText, dtype: object

In [6]:
import nltk

def pos_tag(review):
    if(len(review)>0):
        return nltk.pos_tag(review)

In [7]:
tagged_df = pd.DataFrame(df0['reviewText'].progress_apply(lambda review: pos_tag(review)))
tagged_df.head()

Progress:: 100%|██████████| 99999/99999 [05:00<00:00, 332.41it/s]


Unnamed: 0,reviewText
0,"[('spiritually', POS), ( 'mentally', NN), ( 'i..."
1,"[('one', POS), ( 'must', NN), ( 'books', NNP),..."
2,"[('book', POS), ( 'provides', NN), ( 'reflecti..."
3,"[('first', POS), ( 'read', NN), ( 'prophet', N..."
4,"[('timeless', POS), ( 'classic', NN), ( 'deman..."


`nltk` provides documentation for each tag, which can be queried using the tag, e.g., `nltk.help.upenn_tagset(‘RB’)`, or a regular expression. `nltk` also provides batch pos-tagging method for document pos-tagging:


In [8]:
## Join with Original Key and Persist Locally to avoid RE-processing
uniqueKey_series_df = df0[['uniqueKey']]
uniqueKey_series_df.head()

Unnamed: 0,uniqueKey
0,A10000012B7CGYKOMPQ4L##000100039X
1,A2S166WSCFIFP5##000100039X
2,A1BM81XB4QHOA3##000100039X
3,A1MOSTXNIO5MPJ##000100039X
4,A2XQ5LZHTD4AFT##000100039X


In [9]:
pos_tagged_keyed_reviews = pd.concat([uniqueKey_series_df, tagged_df], axis=1);
pos_tagged_keyed_reviews.head()

Unnamed: 0,uniqueKey,reviewText
0,A10000012B7CGYKOMPQ4L##000100039X,"[('spiritually', POS), ( 'mentally', NN), ( 'i..."
1,A2S166WSCFIFP5##000100039X,"[('one', POS), ( 'must', NN), ( 'books', NNP),..."
2,A1BM81XB4QHOA3##000100039X,"[('book', POS), ( 'provides', NN), ( 'reflecti..."
3,A1MOSTXNIO5MPJ##000100039X,"[('first', POS), ( 'read', NN), ( 'prophet', N..."
4,A2XQ5LZHTD4AFT##000100039X,"[('timeless', POS), ( 'classic', NN), ( 'deman..."


In [10]:
pos_tagged_keyed_reviews.to_csv("../data/interim/002_pos_tagged_keyed_reviews.csv", sep='\t', header=True, index=False);

In [11]:
## END_OF_FILE