### Section 3.1: NLP Tree Representation with GCNs

In this section, we will explore how to represent natural language data as graphs that can be processed by Graph Convolutional Networks (GCNs). This involves converting sentences into tree structures, extracting features, and setting up adjacency matrices. We’ll use libraries like **NLTK** and **spaCy** for parsing and tree generation, which are essential steps for applying GCNs to NLP tasks.

**Contents:**

1. **Introduction to NLP Graph Representation**
2. **Converting Sentences to Dependency Trees**
3. **Extracting Features from Parsed Trees**
4. **Building the Adjacency Matrix**
5. **Code Walkthrough**

---



### 1. Introduction to NLP Graph Representation

- **Goal**: Represent text as a graph where each word or phrase becomes a node, and syntactic dependencies form the edges.
- **Why Use Trees?**: NLP data has an inherent structure, such as dependency relations between words. Trees capture hierarchical relationships, which are useful for tasks like sentiment analysis, translation, and information extraction.

**Example**: In the sentence, “The cat sat on the mat,”:
  - **Nodes**: Each word is a node (e.g., “The,” “cat,” “sat”).
  - **Edges**: Edges represent grammatical relationships, such as the subject (cat) linked to the verb (sat).



### 2. Converting Sentences to Dependency Trees

We’ll use **NLTK** for tokenizing and parsing sentences into syntactic trees.

#### Steps:
1. **Tokenize** the sentence to split it into words.
2. **POS Tagging**: Apply Part-of-Speech (POS) tagging to each token.
3. **Chunking**: Use chunking (phrase structure parsing) to convert the tagged tokens into a tree structure.
4. **Dependency Parsing**: Establish dependency relations to build a dependency tree.



#### Code Example: Sentence to Tree Representation using NLTK


In [1]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk

# Download necessary NLTK resources (only needed once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Example sentence for processing
sentence = "The cat sat on the mat."

# Step 1: Tokenize the sentence into individual words
tokens = word_tokenize(sentence)
print("Tokens:", tokens)  # Output each token for verification

# Step 2: Perform POS Tagging to assign each word a syntactic role
pos_tags = pos_tag(tokens)
print("POS Tags:", pos_tags)  # Display the word and its associated POS tag

# Step 3: Named Entity Chunking to group words into meaningful entities
# This creates a tree structure with entities based on their POS tags
tree = ne_chunk(pos_tags)
print("Tree Structure:", tree)  # Shows the hierarchical structure of entities


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


Tokens: ['The', 'cat', 'sat', 'on', 'the', 'mat', '.']
POS Tags: [('The', 'DT'), ('cat', 'NN'), ('sat', 'VBD'), ('on', 'IN'), ('the', 'DT'), ('mat', 'NN'), ('.', '.')]
Tree Structure: (S The/DT cat/NN sat/VBD on/IN the/DT mat/NN ./.)



**Explanation**:
1. **POS Tagging**:
   - Assigns a syntactic role to each word, which is essential for understanding relationships and dependencies.
   
2. **Named Entity Chunking**:
   - Organizes words into a hierarchical tree based on POS tags, identifying entities or groups of related words.
   - This step is particularly valuable for GCNs in NLP, as it captures dependency structures useful for relational understanding.


### 3. Extracting Features from Parsed Trees

After converting the sentence to a tree structure, we need to extract features that represent each node (word).



#### Node Features:
- **One-Hot Encoding**: Each word is represented by a unique vector where only one element is “1” (e.g., the word “cat” as `[0, 1, 0, ...]`).
- **POS Tags**: Use POS tags as features for each word.
- **Word Embeddings**: Optional, using pre-trained embeddings (like Word2Vec or GloVe) for richer representations.



#### Code Example: Feature Extraction


In [3]:
# Define a simple vocabulary for one-hot encoding
vocab = ["The", "cat", "sat", "on", "the", "mat", "."] # Add the period to the vocabulary
# Create a dictionary that maps each word to a unique index
vocab_dict = {word: i for i, word in enumerate(vocab)}

# One-hot encode tokens based on the vocabulary
one_hot_features = []
for token in tokens:
    # Initialize a zero vector with the length of the vocabulary
    feature = [0] * len(vocab)
    # Set the index corresponding to the token to 1
    # Check if the token exists in the vocab before accessing it
    if token in vocab_dict:
        feature[vocab_dict[token]] = 1
    # Append the one-hot encoded feature vector for each token
    one_hot_features.append(feature)

# Print the one-hot encoded features for each token in the sentence
print("One-Hot Features:", one_hot_features)

One-Hot Features: [[1, 0, 0, 0, 0, 0, 0], [0, 1, 0, 0, 0, 0, 0], [0, 0, 1, 0, 0, 0, 0], [0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 0, 0, 0, 1]]


 Explanation:
1. **Vocabulary Dictionary**:
   - Maps each unique word to an index for one-hot encoding.
   
2. **One-Hot Encoding**:
   - Creates a vector where only the index of the current token is set to `1`.
   - Each token in the sentence is transformed into a one-hot encoded vector based on the predefined vocabulary.

3. **Output**:
   - Displays the one-hot encoded representation for each word in `tokens`, allowing numerical processing in models.


### 4. Building the Adjacency Matrix

The **adjacency matrix** represents relationships between words in the tree. Each entry \( A_{ij} \) is “1” if there’s a connection between word \( i \) and word \( j \) (e.g., based on dependencies).



#### Steps to Create the Adjacency Matrix:
1. Use dependency parsing to determine relationships between words.
2. Create a square matrix where rows and columns correspond to words.
3. Populate the matrix based on dependency relations from the parsed tree.



#### Code Example: Building an Adjacency Matrix


In [4]:
import numpy as np

# Initialize an empty adjacency matrix for the sentence with 6 tokens
adj_matrix = np.zeros((len(tokens), len(tokens)), dtype=int)

# Define dependencies manually as (start_node, end_node) pairs
# Each pair represents a dependency relationship between tokens (e.g., (0, 1) implies a link between token 0 and token 1)
dependencies = [(0, 1), (1, 2), (2, 3), (3, 4), (4, 5)]  # Example dependencies for the sentence

# Populate the adjacency matrix based on defined dependencies
for i, j in dependencies:
    adj_matrix[i][j] = 1  # Create a directed edge from node i to node j
    adj_matrix[j][i] = 1  # Create a bidirectional edge for undirected graph representation

# Print the adjacency matrix to visualize token dependencies
print("Adjacency Matrix:")
print(adj_matrix)


Adjacency Matrix:
[[0 1 0 0 0 0 0]
 [1 0 1 0 0 0 0]
 [0 1 0 1 0 0 0]
 [0 0 1 0 1 0 0]
 [0 0 0 1 0 1 0]
 [0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0]]



#### Explanation:
1. **Adjacency Matrix**:
   - Initializes a zero matrix where each token is treated as a graph node.
   
2. **Dependencies**:
   - Manually defines dependencies to simulate a syntactic structure, with each tuple `(i, j)` representing a connection between tokens.
   
3. **Bidirectional Edges**:
   - Sets both `A[i][j]` and `A[j][i]` to 1 for undirected relationships, commonly used for symmetrical dependency representation.

4. **Output**:
   - Prints the adjacency matrix, displaying the token-to-token dependencies in matrix form.


### 5. Code Walkthrough: Putting It All Together

Let’s combine the steps above into a full pipeline that takes a sentence, generates a tree, extracts features, and builds an adjacency matrix.


In [6]:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag, ne_chunk
import numpy as np

# Download necessary NLTK resources (only needs to be run once)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

def process_sentence(sentence):
    # Step 1: Tokenize the sentence into individual words
    tokens = word_tokenize(sentence)

    # Step 2: Perform POS tagging to assign each word a syntactic role
    pos_tags = pos_tag(tokens)

    # Step 3: Named Entity Chunking to group words into meaningful entities
    tree = ne_chunk(pos_tags)

    # Step 4: Create One-Hot Features for each token based on unique vocabulary
    vocab = list(set(tokens))  # Identify unique words in the sentence
    vocab_dict = {word: i for i, word in enumerate(vocab)}  # Map each word to a unique index
    one_hot_features = []
    for token in tokens:
        # Create a one-hot encoded vector for each token
        feature = [0] * len(vocab)
        feature[vocab_dict[token]] = 1
        one_hot_features.append(feature)

    # Step 5: Define an Adjacency Matrix based on example dependencies
    adj_matrix = np.zeros((len(tokens), len(tokens)), dtype=int)  # Initialize an empty matrix
    dependencies = [(0, 1), (1, 2), (2, 3), (3, 4), (4, 5)]  # Manually define dependencies
    for i, j in dependencies:
        adj_matrix[i][j] = 1  # Set edge from node i to node j
        adj_matrix[j][i] = 1  # Symmetric edge for undirected graph representation

    # Output the processed components
    return tokens, one_hot_features, adj_matrix

# Example sentence
sentence = "The cat sat on the mat."
tokens, features, adj_matrix = process_sentence(sentence)

# Display the results
print("Tokens:", tokens)
print("One-Hot Features:", features)
print("Adjacency Matrix:\n", adj_matrix)


[nltk_data] Downloading package punkt to /root/nltk_data...


Tokens: ['The', 'cat', 'sat', 'on', 'the', 'mat', '.']
One-Hot Features: [[0, 0, 0, 0, 0, 0, 1], [0, 0, 1, 0, 0, 0, 0], [1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0], [0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 1, 0, 0], [0, 1, 0, 0, 0, 0, 0]]
Adjacency Matrix:
 [[0 1 0 0 0 0 0]
 [1 0 1 0 0 0 0]
 [0 1 0 1 0 0 0]
 [0 0 1 0 1 0 0]
 [0 0 0 1 0 1 0]
 [0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0]]


[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!



### Summary and Key Takeaways

- **Tree Structure Representation**: Transforming sentences into syntactic trees captures linguistic relationships that can be leveraged by GCNs.
- **Node Features**: Basic one-hot encoding is simple, but POS tags or word embeddings can enhance the feature space.
- **Adjacency Matrix**: Constructed based on dependency relations, this matrix defines the connections that GCNs use for message passing.
- **GCN-Ready Data**: After creating the feature matrix and adjacency matrix, the data is ready for GCN processing in subsequent sections.

In the next section, we’ll apply the generated trees and adjacency matrices in a GCN model to explore how it can perform NLP tasks like sentence classification or translation.