# Pos-Tagging & Feature Extraction
Following normalisation, we can now proceed to the process of pos-tagging and feature extraction. Let's start with pos-tagging.

## POS-tagging
Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes or lexical categories.

The `nltk` library provides its own pre-trained `POS-tagger`. Let's see how it is used.

In [4]:
import pandas as pd
df0 = pd.read_csv("../data/interim/001_normalised_keyed_reviews_100k_sample.csv", sep="\t", low_memory=False)
df0.head()

FileNotFoundError: File b'../data/interim/001_normalised_keyed_reviews_100k_sample.csv' does not exist

In [None]:
# For monitoring duration of pandas processes
from tqdm import tqdm, tqdm_pandas

# To avoid RuntimeError: Set changed size during iteration
tqdm.monitor_interval = 0

# Register `pandas.progress_apply` and `pandas.Series.map_apply` with `tqdm`
# (can use `tqdm_gui`, `tqdm_notebook`, optional kwargs, etc.)
tqdm.pandas(desc="Progress:")

# Now you can use `progress_apply` instead of `apply`
# and `progress_map` instead of `map`
# can also groupby:
# df.groupby(0).progress_apply(lambda x: x**2)

In [None]:
df0['reviewText'] = df0['reviewText'].tolist();
df0.head()

In [None]:
# Convert "reviewText" field to back to list
df0['reviewText'] = df0['reviewText'].progress_apply(lambda text: text.replace("[","").replace("]","").split(","));
df0['reviewText'].head()

In [None]:
import nltk

tagged_df = pd.DataFrame(df0[0:1000]['reviewText'].progress_apply(lambda review: nltk.pos_tag(review)))
tagged_df.head()

`nltk` provides documentation for each tag, which can be queried using the tag, e.g., `nltk.help.upenn_tagset(‘RB’)`, or a regular expression. `nltk` also provides batch pos-tagging method for document pos-tagging:


In [None]:
## Join with Original Key and Persist Locally to avoid RE-processing
uniqueKey_series_df = df0[['uniqueKey']]
uniqueKey_series_df.head()

In [None]:
pos_tagged_keyed_reviews = pd.concat([uniqueKey_series_df, tagged_df], axis=1);
pos_tagged_keyed_reviews.head()

In [None]:
pos_tagged_keyed_reviews.to_csv("../data/interim/002_pos_tagged_keyed_reviews.csv", sep='\t', header=True, index=False);