# Extracting fine-grained grammatical feature with POS and dependency information

This notebook walk you through parsing and identifying grammatical constructions using POS and dependency information.
Most of the functionality (except the actual extraction rules) is already coded for you.
This notebook does not assume prior knowledge in Python, but requires your understanding of the course materials (particularly POS and dependency labels and how grammatical structure is operationalized.)

Now let me explain the approach we will take in this demo.

# Design of the current pipeline


The current analysis pipeline is very basic one in the sense that **no prior coding experience or skill** is required.
It is meant for educational purpose, meaning that you might want to use more sophisticated tools such as TAASSC, when conducting actual studies.

However, if you understand the following design principle very well, you have a good understanding of grammatical dependency analysis through spaCy Python package–What it can offer. You can start thinking more deeply about their implication to your research.

## Algorithm used in this notebook.

In this notebook, the following analysis pipeline is implemented for you.

- Your input is `file path` to yout corpus files.
- The current code loads the corpus files onto colab.
- It then iterate through the corpus files one by one to do the following:
  - Parse the sentence using spacy
  - Conduct basic analysis (such as calculating the number of tokens, sentences, etc.)
  - **Count the number of specific grammatical structures** (**MAIN FEATURE**)
  - Store the results into a Python dictionary
- After every corpus file is processed, it can create a dataset to export.
- You can export the results for further analysis



## What you are expected to do

You are expected to do the following:

- Articulate extraction rules that can be used to identify desired grammatical structure
- Identify which extraction template functions to use in order to extract the grammatical structure
- Follow the instruction of the current notebook to run the analysis.

# Loading necessary packages



Okay. Let's start! First we will import necessary package.

In [23]:
import os
import spacy #spacy for language analysis
import pandas as pd # import pandas and name it `pd`. Pandas allow easy handling of dataset structure.

In [24]:
# Initialize spaCy model
nlp = spacy.load("en_core_web_sm")

## Define functions

In the following I will define necessary functions for this pipeline to work.
Let me know if you are curious to know what each does

In [25]:
def load_file(filepath: str):
    """Load text from file"""
    with open(filepath, 'r', encoding='utf-8') as f:
        return f.read()


def find_filename(filename: str):
    new_name = os.path.split(filename)[-1]
    return new_name


def update_results(index_name: str, result_dictionary):
    if index_name in result_dictionary:
        result_dictionary[index_name] += 1
    else:
        result_dictionary[index_name] = 1

In [26]:
def run_basic_stats(doc):
    basic_stats = {}
    
    basic_stats["nToken"] = len(doc)
    basic_stats["nSentence"] = len(list(doc.sents))
    return basic_stats

# Defining Extraction template functions

### Basic extraction rules (difficulty ★☆☆)

The first extraction function is very simple rule to identify the construction by:

- one dependency label only

For example, this can identify `amod` (adjectival modifier).

In [27]:
def extract_by_simple_dependency(result_dictionary: dict,token, dep_rel: str, index_name: str):

    if token.dep_ == dep_rel: #if the dependency label is the same as the specified one.
        update_results(index_name, result_dictionary)

In [28]:
def extract_by_pos(result_dictionary: dict, token, pos: str, index_name: str):

    if token.pos_ == pos:
        update_results(index_name, result_dictionary)

In [29]:
def extract_by_tag(result_dictionary: dict, token, tag: str, index_name: str):

    if token.tag_ == tag:
        update_results(index_name, result_dictionary)

In [30]:
def extract_by_dependency_and_head_pos(result_dictionary: dict, token, dep_rel: str, head_pos: str, index_name: str):
    """
    Extract token when it has specific dependency relation AND its head has specific POS
    Example: Find adjectival modifiers (amod) whose head is a NOUN
    """
    if token.dep_ == dep_rel and token.head.pos_ == head_pos:
        update_results(index_name, result_dictionary)


In [31]:
def extract_by_dependency_and_child_features(result_dictionary: dict, token, dep_rel: str, child_lemma: str, child_pos: str, index_name: str):
    """
    Extract token when it has specific dependency AND has a child with specific lemma and POS
    
    """
    if token.dep_ == dep_rel:
        for child in token.children:
            if child.lemma_ == child_lemma and child.pos_ == child_pos:
                update_results(index_name, result_dictionary)
                break  # Found one match, don't count multiple times


## Change the rule in the `run_extraction` below!



In [32]:
def run_grammar_extraction(doc, filepath: str):
    global extraction_results
    extraction_results = {}

    for token in doc:
        # Extract simple dependency-based features
        extract_by_simple_dependency(extraction_results, token, dep_rel="amod", index_name="adjectival_modifier")
        extract_by_simple_dependency(extraction_results, token, dep_rel="dobj", index_name="direct_object")
        ## Add more rule here

        # Extract POS-based features
        extract_by_pos(extraction_results, token, pos= "NOUN", index_name= "Nouns")
        ## Add more rule here

        # Extract tag-based features
        extract_by_tag(extraction_results, token, tag="VBZ", index_name="Third-Person Singular Verbs")
        ## Add more rule here


    return extraction_results


## Wanna test?

You can test if you will get the desired result by passing example sentence and your rules.

### Run the main analysis loop

This where the processing happens.

In [33]:
# Main processing loop
def main(CORPUS_FILES):
    results = {}

    for file in CORPUS_FILES:  # Process first 5 files for testing
        global extraction_results
        # 1. Load the corpus file
        text = load_file(file)
        filename = find_filename(file)
        
        # 2. Parse the sentences
        doc = nlp(text)
        
        # 3. run extraction pipeline
        basic_stats = run_basic_stats(doc)
        result = run_grammar_extraction(doc, filename)
        
        # 4. Append results
        results[filename] = basic_stats
        results[filename].update(result)
        
        print(f"Processed: {file}")
    return results

In [34]:
CORPUS_FILES = ["../../corpus_data/brown_single/ca_ca01.txt"]
results = main(CORPUS_FILES)

Processed: ../../corpus_data/brown_single/ca_ca01.txt


Check the results

In [35]:
results

{'ca_ca01.txt': {'nToken': 2376,
  'nSentence': 88,
  'Nouns': 458,
  'adjectival_modifier': 112,
  'direct_object': 95,
  'Third-Person Singular Verbs': 34}}

## Write the results into dataset

In [36]:
def results_to_dataframe(results):
    """
    Convert results dictionary to pandas DataFrame with additional options
    
    Parameters:
    - results: dictionary with corpus_size, unigram, and bigram data
    - min_freq: minimum collocation frequency to include (default: 1)
    - include_dep_rel: whether to include dependency relation in output (default: True)
    """
    rows = []
    
    for filename, grammar_info in results.items():
            
        row = {
            "filename": filename,
              }
        
        for key, value in grammar_info.items():
            print(key, value)
            row.update({key:value})
        
        rows.append(row)
    
    # Create DataFrame and sort by collocation frequency
    df = pd.DataFrame(rows)
    
    return df

In [37]:
df = results_to_dataframe(results)
df

nToken 2376
nSentence 88
Nouns 458
adjectival_modifier 112
direct_object 95
Third-Person Singular Verbs 34


Unnamed: 0,filename,nToken,nSentence,Nouns,adjectival_modifier,direct_object,Third-Person Singular Verbs
0,ca_ca01.txt,2376,88,458,112,95,34
