### Section 3.2: Tokenization and Tagging

In this section, we will explore the process of tokenization and POS (Part-of-Speech) tagging as foundational steps for creating GCN-ready NLP data. Tokenization splits text into words or phrases (tokens), while POS tagging assigns syntactic roles to each token. Both of these steps are crucial for constructing syntactic trees and graphs, which serve as input to GCN models in NLP.

**Contents:**

1. **Understanding Tokenization in NLP**
2. **POS Tagging for Syntactic Information**
3. **Implementing Tokenization and POS Tagging with NLTK**
4. **Using spaCy for Advanced Parsing**
5. **Code Walkthrough**

---



### 1. Understanding Tokenization in NLP

- **Definition**: Tokenization is the process of splitting text into smaller units, typically words or subwords, known as tokens.
- **Importance**: Tokenization is the first step in text preprocessing, as it breaks down the text for further analysis, including POS tagging and dependency parsing.
- **Types of Tokenization**:
  - **Word Tokenization**: Splits text into individual words.
  - **Subword Tokenization**: Breaks down words into smaller meaningful units (used in transformer models).
  - **Sentence Tokenization**: Splits text into sentences.

**Example**:
  - Text: “The cat sat on the mat.”
  - Tokens: `[‘The’, ‘cat’, ‘sat’, ‘on’, ‘the’, ‘mat’]`



#### Code Example: Basic Tokenization with NLTK



In [5]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')


# Sample text for tokenization
text = "The cat sat on the mat."

# Tokenize the sentence into individual words
tokens = word_tokenize(text)

# Display the tokenized words
print("Tokens:", tokens)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...


Tokens: ['The', 'cat', 'sat', 'on', 'the', 'mat', '.']


[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Explanation:

- Tokenization: This code splits the text into individual words, known as tokens, which are the basic units for further analysis such as POS tagging, parsing, or feature extraction. Tokenizing at the word level is essential in NLP tasks as it allows each word to be processed individually, laying the foundation for text analysis.


### 2. POS Tagging for Syntactic Information

- **Definition**: POS tagging assigns syntactic roles to each token in a sentence (e.g., noun, verb, adjective).
- **Purpose**: POS tags provide grammatical information that is essential for dependency parsing and building a syntactic tree for GCNs.
- **Common POS Tags**:
  - **NN**: Noun (e.g., cat, dog)
  - **VB**: Verb (e.g., run, sit)
  - **JJ**: Adjective (e.g., big, small)

**Example**:
  - Tokens: `[‘The’, ‘cat’, ‘sat’, ‘on’, ‘the’, ‘mat’]`
  - POS Tags: `[('The', 'DT'), ('cat', 'NN'), ('sat', 'VBD'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN')]`



#### Code Example: POS Tagging with NLTK



In [6]:
from nltk import pos_tag

# Perform POS tagging on the tokenized words
pos_tags = pos_tag(tokens)

# Display each word with its corresponding POS tag
print("POS Tags:", pos_tags)


POS Tags: [('The', 'DT'), ('cat', 'NN'), ('sat', 'VBD'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', '.')]



**Explanation**:
- **POS Tagging**: Assigns a part-of-speech (POS) tag to each token, indicating its grammatical role (e.g., noun, verb).
- **Usage**: POS tags help in understanding sentence structure, which is essential for dependency parsing, entity recognition, and building dependency relations for GCN processing in NLP.


### 3. Implementing Tokenization and POS Tagging with NLTK

We can combine tokenization and POS tagging to prepare text data for graph-based representation.



#### Code Example: Tokenization and POS Tagging Pipeline


In [7]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

# Sample sentence for tokenization and POS tagging
sentence = "The cat sat on the mat."

# Step 1: Tokenize the sentence into individual words
tokens = word_tokenize(sentence)
print("Tokens:", tokens)  # Display tokenized words

# Step 2: Perform POS Tagging to assign a syntactic role to each token
pos_tags = pos_tag(tokens)
print("POS Tags:", pos_tags)  # Display each word with its POS tag


Tokens: ['The', 'cat', 'sat', 'on', 'the', 'mat', '.']
POS Tags: [('The', 'DT'), ('cat', 'NN'), ('sat', 'VBD'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', '.')]



**Explanation**:
- **Tokenization**: Breaks the sentence into words, preparing each token for further analysis.
- **POS Tagging**: Assigns a syntactic role (e.g., noun, verb) to each token, providing information on grammatical structure.
- **Usage**: The POS tags help in syntactic parsing, facilitating the creation of an adjacency matrix in future steps by clarifying relationships between tokens for dependency analysis.


### 4. Using spaCy for Advanced Parsing

While NLTK provides basic tokenization and POS tagging, **spaCy** is more advanced and includes built-in dependency parsing. Dependency parsing identifies syntactic relationships, such as subjects and objects, which is crucial for building adjacency matrices for GCNs.



1. **Install spaCy**:


In [8]:
!pip install spacy





2. **Using spaCy for Tokenization, POS Tagging, and Dependency Parsing**:


In [9]:
import spacy

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# Sample sentence for analysis
sentence = "The cat sat on the mat."
doc = nlp(sentence)

# Perform tokenization, POS tagging, and dependency parsing
for token in doc:
    print(f"Token: {token.text}, POS: {token.pos_}, Dependency: {token.dep_}, Head: {token.head.text}")


Token: The, POS: DET, Dependency: det, Head: cat
Token: cat, POS: NOUN, Dependency: nsubj, Head: sat
Token: sat, POS: VERB, Dependency: ROOT, Head: sat
Token: on, POS: ADP, Dependency: prep, Head: sat
Token: the, POS: DET, Dependency: det, Head: mat
Token: mat, POS: NOUN, Dependency: pobj, Head: on
Token: ., POS: PUNCT, Dependency: punct, Head: sat



**Explanation**:
- **Token**: Each individual word or phrase in the sentence, separated for analysis.
- **POS (Part of Speech)**: The grammatical role of each token (e.g., noun, verb), helping identify the function of each word.
- **Dependency**: The syntactic role of the token, such as subject or object, indicating how it relates to other words in the sentence.
- **Head**: Identifies the "parent" word each token depends on, enabling the creation of dependency graphs, which are essential for visualizing syntactic structure in NLP.



**Advantages of Using spaCy**:
- **Built-in Dependency Parsing**: Provides syntactic relations for constructing adjacency matrices.
- **Efficiency**: Faster and more accurate than NLTK for complex parsing tasks.

---



### 5. Code Walkthrough: Tokenization and Tagging Pipeline with spaCy

Let’s combine the steps from tokenization, POS tagging, and dependency parsing using spaCy to prepare data for GCN processing.



In [10]:
import spacy

# Load the spaCy English model
nlp = spacy.load("en_core_web_sm")

# Define a sentence for processing
sentence = "The cat sat on the mat."
doc = nlp(sentence)

# Extract information: tokens, POS tags, and dependency relationships
tokens = [token.text for token in doc]
pos_tags = [(token.text, token.pos_) for token in doc]
dependencies = [(token.text, token.dep_, token.head.text) for token in doc]

# Display the extracted information
print("Tokens:", tokens)               # List of tokens in the sentence
print("POS Tags:", pos_tags)            # List of tuples with token and POS tag
print("Dependencies:", dependencies)    # List of tuples with token, dependency role, and head word


Tokens: ['The', 'cat', 'sat', 'on', 'the', 'mat', '.']
POS Tags: [('The', 'DET'), ('cat', 'NOUN'), ('sat', 'VERB'), ('on', 'ADP'), ('the', 'DET'), ('mat', 'NOUN'), ('.', 'PUNCT')]
Dependencies: [('The', 'det', 'cat'), ('cat', 'nsubj', 'sat'), ('sat', 'ROOT', 'sat'), ('on', 'prep', 'sat'), ('the', 'det', 'mat'), ('mat', 'pobj', 'on'), ('.', 'punct', 'sat')]



**Explanation**:
- **Tokens**: Extracted words that form the basic units of the sentence.
- **POS Tags**: Each token’s syntactic role, which can serve as features for the GCN.
- **Dependencies**: Pairs of related words (along with dependency type), essential for building an adjacency matrix in later steps.

---



### Summary and Key Takeaways

- **Tokenization and POS Tagging**: Key steps in preparing text data, helping segment sentences and assign syntactic roles.
- **Dependency Parsing with spaCy**: spaCy’s advanced parsing provides hierarchical syntactic structures, enabling the construction of dependency-based adjacency matrices.
- **GCN-Ready NLP Data**: Tokenized and tagged data is now ready for further processing, like building adjacency matrices and extracting feature vectors for GCNs.

With tokenization, tagging, and dependency parsing completed, we’re ready to proceed to the next step—creating adjacency matrices and feature representations that GCNs can process for NLP tasks.