# Linguistic Feature Calculation
This notebook calculates linguistic features from the processed and formatted transcripts. For a full list of relevant features and their definitions, see [this document](README.md). The files output by this notebook contain additional features beyond what is utilized in classification; these files are cleaned to produce a final file with only the included features in [this notebook](feature_processing.ipynb).

In [87]:
# TODO: I'd like to add further comments in the code to describe the functions, as well as specific examples for each one (when applicable)

In [88]:
# This cell can be skipped if the needed libraries are already installed
# !pip install transformers
# !pip install minicons

In [89]:
# Importing necessary libraries
import warnings
warnings.filterwarnings("ignore", message="NotOpenSSLWarning")
warnings.filterwarnings("ignore", message = r"\[W007\].*", category = UserWarning)
from typing import Tuple
import numpy as np
import pandas as pd
import re
import spacy
import gensim.downloader as api
from gensim.models import KeyedVectors
from minicons import scorer
import csv
import textdescriptives as td
import contractions
import stanza

### Data Loading
This section loads the narrative data necessary for the remainder of the script, including the text transcripts in a variety of formats (cleaned with punctuation, cleaned without punctuation, or separated by utterance). Additionally, it loads the language models and datasets for lexical feature analysis (SpaCy, word2vec, and minicons GPT2 surprisal scorer).

In [90]:
# Loading the processed transcripts (cleaned and formatted, but retains original punctuation for SpaCy tokenization)
transcripts_punc = pd.read_csv('processed_data/transcripts_spacy_formatted.csv')

# Creating a linguistic feature DataFrame that includes coherence scores
linguistic_features = transcripts_punc[['Coherence']].copy()

In [91]:
# Loading the transcripts with no punctuation
transcripts_no_punc = pd.read_csv('processed_data/transcripts_no_punc_formatted.csv', index_col = 0)

In [92]:
# Loading the utterance-based transcriptions into a list
story_lists = []
with open('processed_data/sentences.csv', 'r', newline='') as csv_file:
    csv_reader = csv.reader(csv_file)
    for row in csv_reader:
        story_lists.append(row)

In [93]:
# Loading the SpaCy language processing model
nlp = spacy.load("en_core_web_sm")

In [94]:
# Loading the word2vec model
path = api.load("word2vec-google-news-300", return_path=True)
sem_sim_model = KeyedVectors.load_word2vec_format(path, binary=True)

In [95]:
# Loading the model for surprisal
surprisal_model = scorer.IncrementalLMScorer('gpt2')

In [96]:
# Loading the stanza model
stanza.download('en')
stanza_model = stanza.Pipeline(lang='en', processors='tokenize, pos, constituency')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json:   0%|  …

2025-12-04 19:09:28 INFO: Downloaded file to /Users/emicatx/stanza_resources/resources.json
2025-12-04 19:09:28 INFO: Downloading default packages for language: en (English) ...
2025-12-04 19:09:29 INFO: File exists: /Users/emicatx/stanza_resources/en/default.zip
2025-12-04 19:09:32 INFO: Finished downloading models and saved to /Users/emicatx/stanza_resources
2025-12-04 19:09:32 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES


Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.11.0.json:   0%|  …

2025-12-04 19:09:32 INFO: Downloaded file to /Users/emicatx/stanza_resources/resources.json
2025-12-04 19:09:33 INFO: Loading these models for language: en (English):
| Processor    | Package             |
--------------------------------------
| tokenize     | combined            |
| mwt          | combined            |
| pos          | combined_charlm     |
| constituency | ptb3-revised_charlm |

2025-12-04 19:09:33 INFO: Using device: cpu
2025-12-04 19:09:33 INFO: Loading: tokenize
2025-12-04 19:09:33 INFO: Loading: mwt
2025-12-04 19:09:33 INFO: Loading: pos
2025-12-04 19:09:34 INFO: Loading: constituency
2025-12-04 19:09:35 INFO: Done loading processors!


### Feature Calculations
This section contains functions and code to calculate each linguistic feature. It uses SpaCy tokenization, Stanza parsing, and the textdescriptives library to determine lexical, syntactic, and macrolinguistic metrics.

In [97]:
def part_of_speech_features(input_df: pd.DataFrame, feature_df: pd.DataFrame) -> pd.DataFrame:

    """ 
        Calculates the proportion of part-of-speech tags, open-closed class ratio, and propositional density
        in a transcript
        
        Parameters 
        ----------
        input_df: pd.DataFrame
            Input DataFrame with column storing transcripts as 'Text'
        feature_df: pd.DataFrame
            DataFrame storing coherence scores and linguistic features

        Returns
        -------
        pd.DataFrame
            Feature DataFrame with columns for part-of-speech proportions, open-closed class ratio, and 
            propositional density
    """

    pos_tags = ['ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 'PUNCT', 'SCONJ', 'SPACE', 'SYM', 'VERB', 'X']
    open_class_tags = ['NOUN', 'VERB', 'ADJ', 'ADV']
    closed_class_tags = ['ADP', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NUM', 'PART', 'PRON', 'SCONJ']
    propositional_density_tags = ['ADJ', 'ADV', 'ADP', 'CCONJ', 'SCONJ', 'VERB']

    for i in feature_df.index:  

        string = input_df['Text'].iloc[i]
        doc = nlp(string)
        testing = []
        open_class = 0
        closed_class = 0
        prop_density = 0

        for token in doc:
            testing.append(token.pos_)
            if token.pos_ in open_class_tags:
                open_class += 1
            elif token.pos_ in closed_class_tags:
                closed_class += 1
            if token.pos_ in propositional_density_tags:
                prop_density += 1
        
        for _, pos in enumerate(pos_tags):
            try:
                feature_df.loc[i, f'POS:{pos}'] = testing.count(pos)/len(testing)
            except ZeroDivisionError:
                print("No tokens have been identified in the input")

        try:
            feature_df.loc[i, 'open_closed_ratio'] = open_class / closed_class
        except ZeroDivisionError:
            print(f"Index {i} has no closed class words; check input")

        feature_df.loc[i, 'propositional_density'] = prop_density / len(testing)

    return feature_df

In [98]:
# Applying the part-of-speech feature function to the transcripts cleaned with punctuation
linguistic_features = part_of_speech_features(transcripts_punc ,linguistic_features)

In [99]:
def logical_operators(input_lists: list, feature_df: pd.DataFrame) -> pd.DataFrame:
    
    """ 
        Calculates the sum of logical operators (and, or, not, if...then) in an utterance, summed
        across all utterances in a transcript
        
        Parameters 
        ----------
        input_lists: list
            Input lists with each sub-list containing story transcripts separated by utterance
        feature_df: pd.DataFrame
            DataFrame storing coherence scores and linguistic features
            
        Returns
        -------
        pd.DataFrame
            Feature DataFrame with column for the number of logical operators
    """

    logical_operator_list = [
        (r'^And'),
        (r' and[!?\-,.:;\'"\s]'),
        (r' not[!?\-,.:;\'"\s]'),
        (r' or[!?\-,.:;\'"\s]'),
        (r' (if) (.*?)(then)[!?\-…,.:;\'"\s]')
    ]

    for i, story in enumerate(input_lists):
        logical_operator_number = []
        for string in story:
            for pattern in logical_operator_list:
                match = re.findall(pattern, string, flags = re.IGNORECASE)
                n_operators = len(match)
                logical_operator_number.append(n_operators)
        feature_df.loc[i, 'n_logical_operators'] = np.sum(logical_operator_number)

    return feature_df

In [100]:
# Applying the logical operators function to the transcripts separated by utterance
linguistic_features = logical_operators(story_lists, linguistic_features)

In [101]:
def string_ttr(string: str) -> Tuple[float, float]:

    """ 
        Calculates the type-token ratio (number of unique words divided by total number of words)
        of a string
        
        Parameters 
        ----------
        string: str
            String containing text without punctuation
            
        Returns
        -------
        Tuple[float, float]
            Value of type-token ratio and number of tokens in the string

        Raises
        ------
        ValueError 
            If type-token ratio is greater than 1
        ValueError
            If type-token ratio is negative
    """

    string_lower = string.lower()
    string_split = string_lower.split(' ')
    string_set = set(string_split)
    n_tokens = len(string_split)
    types = len(string_set)

    try:
        ttr = types / n_tokens
    except ZeroDivisionError:
        print(f"Input string has no tokens (empty string)")

    if ttr > 1:
        raise ValueError('Type-token ratio cannot exceed 1')
    if ttr < 0:
        raise ValueError('Type-token ratio cannot be negative')

    return ttr, n_tokens


def word_count_features(input_df: pd.DataFrame, feature_df: pd.DataFrame) -> pd.DataFrame:

    """ 
        Calculates the number of words and type-token ratio for each transcript
        
        Parameters 
        ----------
        input_df: pd.DataFrame
            Input DataFrame with column storing transcripts as 'Text'
        feature_df: pd.DataFrame
            DataFrame storing coherence scores and linguistic features

        Returns
        -------
        pd.DataFrame
            Feature DataFrame with columns for type-token ratio and total number of words
    """

    for i in input_df.index:
        string = input_df['Text'].iloc[i]
        feature_df.loc[i, 'type_token_ratio'], feature_df.loc[i, 'n_words'] = string_ttr(string)

    return feature_df

In [102]:
# Applying the word count function to the transcripts without punctuations
linguistic_features = word_count_features(transcripts_no_punc, linguistic_features)

In [103]:
def lexical_features(dictionary: dict, feature: str, feature_df: pd.DataFrame, input_df: pd.DataFrame = transcripts_no_punc) -> pd.DataFrame:

    """ 
        Base function for calculating a series of lexical features (used for word frequency
        and semantic diversity)
        
        Parameters 
        ----------
        dictionary: dict:
            Dictionary containing words and associated pre-defined values, imported from
            external sources
        feature: str
            Feature used to name column of feature DataFrame
        feature_df: pd.DataFrame
            DataFrame storing coherence scores and linguistic features
        input_df: pd.DataFrame
            Input DataFrame with column storing transcripts as 'Text', defaults to transcripts
            with no punctuation

        Returns
        -------
        pd.DataFrame
            Feature DataFrame with columns for the relevant feature
    """

    for i in input_df.index:
        string = input_df['Text'].iloc[i]
        string_split = (string.lower()).split(' ')
        values = []

        for item in string_split:
            value = dictionary.get(item)

            if value is not None:
                values.append(value)
        mean = np.nanmean(np.array(values))

        feature_df.loc[i, feature] = mean

    return feature_df

In [104]:
# Creating a dictionary for word frequency based on prior data
frequency = pd.read_csv('lexical_feature_data/SUBTLEXusExcel2007.csv')
frequency_dict_original = dict(zip(frequency['Word'], frequency['Lg10WF']))
frequency_dict = {(word.lower() if isinstance(word, str) else word): value for word, value in frequency_dict_original.items()}

# Applying the lexical features function for word frequency to the transcripts without punctuations
linguistic_features = lexical_features(frequency_dict, 'log10_freq_mean', linguistic_features)

In [105]:
# Creating a dictionary for semantic diversity based on prior data
semantic_diversity = pd.read_csv('lexical_feature_data/13428_2012_278_MOESM1_ESM.csv', dtype = str)
semantic_diversity = (semantic_diversity.rename(columns = {'Supplementary Materials: SemD values': 'word', 'Unnamed: 2': 'sem_diversity'})).drop(0)
semantic_diversity_dict_original = dict(zip(semantic_diversity['word'], semantic_diversity['sem_diversity']))
semantic_diversity_dict = {(word.lower() if isinstance(word, str) else word): (float(value)) for word, value in semantic_diversity_dict_original.items()}

# Applying the lexical features function for semantic diversity to the transcripts without punctuations
linguistic_features = lexical_features(semantic_diversity_dict, 'semantic_diversity_mean', linguistic_features)

In [106]:
def semantic_thematic_similarity(input_df: pd.DataFrame, feature_df: pd.DataFrame) -> pd.DataFrame:

    """ 
        Calculates semantic thematic similarity (cosine similarity between each adjacent word pair
        in a transcript)
        
        Parameters
        ----------
        input_df: pd.DataFrame
            Input DataFrame with column storing transcripts as 'Text'
        feature_df: pd.DataFrame
            DataFrame storing coherence scores and linguistic features

        Returns
        -------
        pd.DataFrame
            Feature DataFrame with column for semantic thematic similarity
    """
   
    for i in input_df.index:  

        string = input_df['Text'].iloc[i]
        string_split = (string.lower()).split(' ')
        sem_sim = []

        for j in range(len(string_split) - 1):
            try:
                sem_sim_value = sem_sim_model.similarity(string_split[j], string_split[j + 1])
            except KeyError:
                continue
            
            sem_sim.append(float(sem_sim_value))
        
        sem_thematic_distance = np.nanmean(np.array(sem_sim))

        feature_df.loc[i, 'semantic_thematic_similarity'] = sem_thematic_distance

    return feature_df

In [107]:
# Applying the semantic thematic similarity function to the transcripts without punctuations
linguistic_features = semantic_thematic_similarity(transcripts_no_punc, linguistic_features)

In [108]:
def surprisal(input_lists: list, feature_df: pd.DataFrame) -> pd.DataFrame:

    """ 
        Calculates average surprisal for a transcript
        
        Parameters 
        ----------
        input_lists: list
            Input lists with each sub-list containing story transcripts separated by utterance
        feature_df: pd.DataFrame
            DataFrame storing coherence scores and linguistic features

        Returns
        -------
        pd.DataFrame
            Feature DataFrame with column for surprisal
    """
    
    for i, story in enumerate(input_lists):
        transcript_surprisal = []
        
        for sentence in story:
            sentence_surprisal = surprisal_model.sequence_score(sentence, reduction = lambda x: -x.mean(0).item())
            transcript_surprisal.append(sentence_surprisal)
        
        transcript_surprisal_array = np.array(transcript_surprisal)
        surprisal_mean = np.nanmean(transcript_surprisal_array)

        feature_df.loc[i, 'surprisal'] = surprisal_mean
    
    return feature_df

In [110]:
# Applying the surprisal function to the transcripts separated by utterance; this cell takes some time to run
linguistic_features = surprisal(story_lists, linguistic_features)

In [111]:
def noun_cohesion_arrays(input_lists: list) -> list:
    
    """ 
        Calculates noun cohesion/co-reference arrays for a transcript
        
        Parameters 
        ----------
        input_lists: list
            Input lists with each sub-list containing story transcripts separated by utterance

        Returns
        -------
        list
            List of arrays, where each array corresponds to a transcript and every element in the 
            array indicates whether the (i, j) pair of utterances in the transcript shares a 
            common noun (1) or has no shared elements (0)

        Raises
        ------
        ValueError
            If array is not symmetric, indicating an error in calculation
    """
    
    noun_pos = ['NOUN', 'PROPN', 'PRON']

    story_lists_updated = []
    for story in input_lists:
        story_nouns = []
        for sentence in story:
            sentence_nouns = []
            sentence_str = str(sentence)
            doc = nlp(contractions.fix(sentence_str.lower()))
            for token in doc:
                if token.pos_ in noun_pos:
                    sentence_nouns.append(str(token))
            story_nouns.append(sentence_nouns)
        story_lists_updated.append(story_nouns)

        
    all_cohesion_arrays = np.ndarray((len(input_lists),), dtype = np.ndarray)

    for k in range(len(story_lists_updated)):
        string = story_lists_updated[k]
        cohesion_array = np.zeros((len(string), len(string)))
        for i in range(len(string)):
            current_string = string[i]
            for j in range(len(string)):
                new_string = string[j]
                if any(item in new_string for item in current_string):
                    cohesion_array[i, j] = 1
        all_cohesion_arrays[k] = cohesion_array

    for array in all_cohesion_arrays:
        if not np.array_equal(array, array.T):
            raise ValueError("Array must be symmetric; check input")

    return all_cohesion_arrays

def local_global_cohesion(input_lists: list, feature_df: pd.DataFrame) -> pd.DataFrame:

    """ 
        Calculates local and global cohesion scores for each transcript
        
        Parameters 
        ----------
        input_lists: list
            Input lists with each sub-list containing story transcripts separated by utterance
        feature_df: pd.DataFrame
            DataFrame storing coherence scores and linguistic features

        Returns
        -------
        pd.DataFrame
            Feature DataFrame with columns for local and global cohesion
    """

    input_array = noun_cohesion_arrays(input_lists)

    for n, array in enumerate(input_array):
        counter = 0
        for i in range(array.shape[0] - 1):
                counter += array[i, i+1]
        try:
            coref_local = counter / (array.shape[0] - 1)
        except ZeroDivisionError:
            coref_local = 0 

        feature_df.loc[n, 'coref_local'] = coref_local

    feature_df = feature_df.sort_index()

    np.seterr('raise')

    for i, array in enumerate(input_array):
        n = array.shape[0]
        counter = (array.sum() - np.diag(array).sum()) / 2
        try:
            coref_global = counter / (n * (n - 1) / 2)
        except Exception as e:
            coref_global = 0 

        feature_df.loc[i, 'coref_global'] = coref_global

    return feature_df

In [112]:
# Applying the local and global cohesion function to the transcripts separated by utterance
linguistic_features = local_global_cohesion(story_lists, linguistic_features)

In [113]:
def parser(input_df: pd.DataFrame) -> list:

    """
        Uses Stanza to create constituent parse trees for each story
        
        Parameters 
        ----------
        input_df: pd.DataFrame
            Input DataFrame with column storing transcripts as 'Text'

        Returns
        -------
        list
            List of lists, where each sublist contains a list with constituent
            parsing for each sentence in a transcript
    """

    parsed_transcripts = []

    for i in input_df.index:
        string = input_df['Text'].loc[i]
        doc = stanza_model(string)
        story_updated = []
        for sentence in doc.sentences:
            tree = sentence.constituency
            string_tree = str(tree).split(' ')
            story_updated.append(string_tree)

        parsed_transcripts.append(story_updated)
    
    return parsed_transcripts

In [114]:
# Applying the constituency parsing function to the transcripts with punctuation; this cell takes some time to run
parsed_transcripts = parser(transcripts_punc)

In [115]:
def constituent_counter(input_lists: list, feature_df: pd.DataFrame) -> pd.DataFrame:

    """ 
        Calculates the number of sentence-level constituents (noun phrases, verb phrases, 
        subordinate clauses, and prepositional phrases) in a transcript
        
        Parameters 
        ----------
        input_lists: list
            Input lists with each sub-list containing story transcripts separated by utterance
        feature_df: pd.DataFrame
            DataFrame storing coherence scores and linguistic features

        Returns
        -------
        pd.DataFrame
            Feature DataFrame with column for overall number of high-level constituents
    """

    counter_pos = ["(NP", "(VP", "(SBAR", "(PP"]

    for i, story in enumerate(input_lists):
        counter = 0
        for sentence in story:
            for item in sentence:
                if item in counter_pos:
                    counter += 1

        feature_df.loc[i, 'n_constituents'] = counter

    return feature_df


def phrase_parser(input_lists: list) -> Tuple[list, list]:
    
    """ 
        Used to isolate noun and verb phrases from parsed and segmented utterances
        
        Parameters 
        ----------
        input_lists: list
            Input lists with each sub-list containing story transcripts separated by utterance

        Returns
        -------
        Tuple[list, list]
            List containing each noun phrase for each transcript, and list containing each verb
            phrase for each transcript
    """

    noun_phrases_corpus = []
    verb_phrases_corpus = []
    closing_pattern = r'\)'

    phrasal_pos = ["(NP", "(VP"]

    for story in input_lists:
        noun_phrases_overall = []
        verb_phrases_overall = []
        for sentence in story:
            noun_phrases = []
            verb_phrases = []
            for i in range(len(sentence)):
                item = sentence[i]
                if item in phrasal_pos:
                    open_counter = 1
                    for j in range(i + 1, len(sentence)):
                        first_element = sentence[j][0]
                        match = re.findall(closing_pattern, sentence[j])
                        closed = len(match)
                        if first_element == '(':
                            open_counter += 1
                        open_counter -= closed
                        if open_counter <= 0:
                            break
                    phrase = [item for item in sentence[i:j + 1] if item[-1] == ')']
                    if item == phrasal_pos[0]:
                        noun_phrases.append(phrase)
                    else:
                        verb_phrases.append(phrase)

            noun_phrases_overall.append(noun_phrases)
            verb_phrases_overall.append(verb_phrases)

        noun_phrases_corpus.append(noun_phrases_overall)
        verb_phrases_corpus.append(verb_phrases_overall)

    return noun_phrases_corpus, verb_phrases_corpus

def phrase_length(phrase_list: list) -> list:
    
    """ 
        Used to calculate the average length of a given phrase type
        
        Parameters 
        ----------
        phrase_list: list
            Input lists containing each phrase with words as string elements for each
            transcript

        Returns
        -------
        list
            List containing the mean length of each phrase for a series of transcripts
    """

    phrase_lengths = []

    for story in phrase_list:
        length_list = []
        for sentence in story:
            for phrase in sentence:
                length = len(phrase)
                length_list.append(length)

        length_array = np.array(length_list)
        mean = length_array.mean()
        phrase_lengths.append(float(mean))

    return phrase_lengths

def phrase_mean_length(input_lists: list, feature_df: pd.DataFrame) -> pd.DataFrame:
    
    """ 
        Calculates the mean length of noun phrases and verb phrases for all stories
        in the corpus
        
        Parameters 
        ----------
        input_lists: list
            Input lists with each sub-list containing story transcripts separated 
            by utterance
        feature_df: pd.DataFrame
            DataFrame storing coherence scores and linguistic features

        Returns
        -------
        pd.DataFrame
            Feature DataFrame with columns for mean length of noun and verb phrases
    """

    noun_phrases, verb_phrases = phrase_parser(input_lists)

    noun_phrase_lengths = phrase_length(noun_phrases)
    verb_phrase_length = phrase_length(verb_phrases)
    
    feature_df['mean_np_length'] = noun_phrase_lengths
    feature_df['mean_vp_length'] = verb_phrase_length

    return feature_df

In [116]:
# Applying the constituent counter and mean phrase length functions to the transcripts with constituency parsing completed
linguistic_features = constituent_counter(parsed_transcripts, linguistic_features)
linguistic_features = phrase_mean_length(parsed_transcripts, linguistic_features)

In [117]:
# Saving the linguistic feature DataFrame
linguistic_features.to_csv('linguistic_features/linguistic_features.csv')

In [118]:
def text_descriptions(input_df: pd.DataFrame) -> pd.DataFrame:
    
    """ 
        Calculates a variety of linguistic features (mean dependency distance, mean proportion of adjacent
        dependency relations, first-order coherence, and second-order coherence) using the textdescription
        library
        
        Parameters 
        ----------
        input_df: pd.DataFrame
            Input DataFrame with column storing transcripts as 'Text'

        Returns
        -------
        pd.DataFrame
            Feature DataFrame with columns for each derived feature
    """
     
    text_description = pd.DataFrame()

    for i in input_df.index:  
        string = input_df['Text'].loc[i]
        text_description_df = td.extract_metrics(text = string, metrics = ["readability", "dependency_distance", "coherence"], spacy_model = 'en_core_web_sm')
        text_description = pd.concat([text_description_df, text_description])

    text_description_ordered = (text_description[::-1]).reset_index()
    text_description_ordered = text_description_ordered[['text', 'dependency_distance_mean', 'prop_adjacent_dependency_relation_mean', 'first_order_coherence', 'second_order_coherence']]
    
    return text_description_ordered

In [119]:
# Applying the text description feature function to the transcripts with punctuation, and creating a separate DataFrame to store results
text_description_ordered = text_descriptions(transcripts_punc)

In [120]:
# Saving the text description DataFrame
text_description_ordered.to_csv('linguistic_features/text_description_ordered.csv')