Syntactic structure is fundamental to understanding natural language, representing the hierarchical organization of words in a sentence and their inter-relationships.  To formally define syntactic structure, we can consider a sentence $ S $ as a sequence of words $ w_1, w_2, ..., w_n $.  A syntactic structure, denoted $ \mathcal{G}(S) $, for a sentence $ S $ can be defined as a set of relations $ R $ over the words in $ S $, encoding how words combine to form phrases and sentences, reflecting the grammatical rules of a language.

**1. Syntactic Structure: Consistency and Dependency**

**Consistency** in syntactic structure refers to the adherence of a sentence's structure to the grammatical rules of a given language.  A consistent syntactic structure is one that is well-formed according to the grammar.  Mathematically, let $ \mathbb{G} $ be the set of grammatical rules for a language. A syntactic structure $ \mathcal{G}(S) $ for sentence $ S $ is considered consistent if and only if $ \mathcal{G}(S) $ can be derived or validated by the rules in $ \mathbb{G} $.  This can be formalized by defining a function $ \mathcal{V} $, which validates a syntactic structure against the grammar:

$$ \mathcal{V}(\mathcal{G}(S), \mathbb{G}) = \begin{cases} 1 & \text{if } \mathcal{G}(S) \text{ is consistent with } \mathbb{G} \\ 0 & \text{otherwise} \end{cases} $$

A structure is consistent if $ \mathcal{V}(\mathcal{G}(S), \mathbb{G}) = 1 $.  In practice, consistency checking involves ensuring that the syntactic relations present in $ \mathcal{G}(S) $ are permissible under the rules of grammar $ \mathbb{G} $.

**Dependency** in syntactic structure focuses on directed relationships between words.  In a dependency structure, for each word in a sentence (except for the root word), there is exactly one word that it depends on, called its head. This dependency relation can be formally defined as a binary relation $ \mathcal{D} $ on the words of a sentence $ S $.  For a sentence $ S = w_1, w_2, ..., w_n $, a dependency structure $ \mathcal{D}(S) $ is a set of ordered pairs $ (w_i, w_j) $ where $ w_i $ is the head and $ w_j $ is the dependent.

Mathematically, a dependency structure can be represented as a directed graph $ G = (V, E) $, where $ V = \{w_1, w_2, ..., w_n\} $ is the set of words in the sentence, and $ E \subseteq V \times V $ is the set of dependency arcs.  Each arc $ (w_i, w_j) \in E $ represents that $ w_i $ is the head of $ w_j $.

For a valid dependency tree, it must satisfy the following properties:

1. **Single Head**: Each word $ w_j $ (except for a designated root word $ w_r $) has exactly one head $ w_i $.  We can define a head function $ h: V \setminus \{w_r\} \rightarrow V $.  This means for each $ w_j \neq w_r $, there is a unique $ w_i = h(w_j) $ such that $ (w_i, w_j) \in E $.

2. **Connectedness**: The graph $ G $ must be connected. Specifically, for every word $ w_j $, there should be a path from the root word $ w_r $ to $ w_j $ in the undirected version of $ G $.

3. **Acyclicity**: The graph $ G $ must be acyclic.  There should be no directed cycles in $ G $.  This ensures a hierarchical structure and prevents scenarios where a word directly or indirectly depends on itself.

With these properties, a dependency structure forms a rooted tree, where the root node has no incoming arcs and represents the main verb or central element of the sentence.

**2. Dependency Grammar and Treebanks**

**Dependency Grammar (DG)** is a linguistic formalism where syntactic structure is described by dependency relations between words in a sentence. As defined previously, it models sentence structure based on binary, asymmetrical relations called dependencies.  Unlike phrase structure grammar that decomposes sentences into nested constituents, dependency grammar directly represents the relationships between words, showing which words modify or govern others.

Formally, a dependency grammar can be defined as a tuple $ DG = (\Sigma, \mathcal{R}) $, where $ \Sigma $ is a set of words (vocabulary) and $ \mathcal{R} $ is a set of dependency relation types.  For each dependency arc $ (w_i, w_j) $ in the dependency graph, there is an associated relation type $ r \in \mathcal{R} $, which labels the type of dependency between $ w_i $ and $ w_j $ (e.g., subject, object, modifier). Thus, a dependency structure is more precisely a set of triples $ (w_i, r, w_j) $, indicating that $ w_i $ is the head of $ w_j $ with relation type $ r $.

**Treebanks** are annotated corpora that provide examples of syntactic structure and dependency relations for real-world sentences. They are essential resources for developing and evaluating dependency parsers. A dependency treebank, specifically, is a collection of sentences where each sentence is paired with its manually annotated dependency tree.

Mathematically, a treebank $ \mathcal{T} $ can be seen as a set of pairs $ (S_k, D_k) $ for $ k = 1, 2, ..., N $, where $ S_k $ is the $ k^{th} $ sentence in the corpus and $ D_k $ is its corresponding dependency tree.  Each $ D_k $ is represented using dependency arcs, often labeled with relation types.

For example, given a sentence "The cat sat on the mat". A dependency treebank entry might represent this as:

```
(sat, root)
(cat, nsubj, sat)
(The, det, cat)
(on, prep, sat)
(mat, pobj, on)
(the, det, mat)
```

This representation specifies that 'sat' is the root, 'cat' is the nominal subject of 'sat', 'The' is a determiner for 'cat', 'on' is a preposition modifying 'sat', 'mat' is the object of preposition 'on', and 'the' is a determiner for 'mat'.

Treebanks serve multiple critical purposes:

1. **Gold Standard Data**: Treebanks provide gold standard annotations for training supervised dependency parsers. Parsers learn to predict dependency trees by minimizing the difference between their output and the treebank annotations.

2. **Evaluation Benchmark**: Treebanks allow for standardized evaluation of parser performance. Metrics like Unlabeled Attachment Score (UAS) and Labeled Attachment Score (LAS) are calculated by comparing the parser's output against the gold standard trees in the treebank.

3. **Linguistic Resource**: Treebanks are invaluable resources for linguistic research, providing empirical data on syntactic structures and dependency patterns in different languages.

**3. Transition-based Dependency Parsing**

Transition-based dependency parsing is an approach that parses a sentence by performing a sequence of transitions, starting from an initial state and moving towards a final state, where a complete dependency tree is built.  It frames parsing as a state transition process.

A transition-based parser operates using a state configuration $ c = (\sigma, \beta, A) $, where:

- $ \sigma $ (stack): A stack of words that are currently being processed. Initial state: $ \sigma = [root]$ (start with a dummy root node).
- $ \beta $ (buffer): A buffer of input words yet to be processed. Initial state: $ \beta = [w_1, w_2, ..., w_n] $ (the input sentence).
- $ A $ (arc set): A set of dependency arcs constructed so far. Initial state: $ A = \emptyset $.

The parser transitions from one state to another by applying a set of predefined transition actions.  Common sets of transitions include:

For **Arc-Standard** transition system:

1. **SHIFT**: Moves the first word from the buffer $ \beta $ to the top of the stack $ \sigma $. Condition: $ \beta \neq [] $.  Transition: $ (\sigma, w|\beta, A) \rightarrow (\sigma|w, \beta, A) $.

2. **LEFT-ARC (l)**: Creates a dependency arc from the second word on the stack $ \sigma_2 $ to the top of the stack $ \sigma_1 $ with label $ l $, and removes $ \sigma_2 $ from the stack. Condition: $ |\sigma| \ge 2 $ and $ \sigma_2 $ has no head yet. Transition: $ (\sigma_2|\sigma_1|\sigma', \beta, A) \rightarrow (\sigma_1|\sigma', \beta, A \cup \{(\sigma_1, l, \sigma_2)\}) $.

3. **RIGHT-ARC (l)**: Creates a dependency arc from the top of the stack $ \sigma_1 $ to the second word on the stack $ \sigma_2 $ with label $ l $, and removes $ \sigma_1 $ from the stack. Condition: $ |\sigma| \ge 2 $ and $ \sigma_1 $ has no head yet. Transition: $ (\sigma_2|\sigma_1|\sigma', \beta, A) \rightarrow (\sigma_2|\sigma', \beta, A \cup \{(\sigma_2, l, \sigma_1)\}) $.

For **Arc-Eager** transition system:

1. **SHIFT**: Moves the first word from the buffer to the top of the stack. Condition: $ \beta \neq [] $. Same transition as in Arc-Standard.

2. **LEFT-ARC (l)**: Creates a dependency arc from the top of the stack to the first word in the buffer with label $ l $, and removes the top of the stack from the stack. Condition: $ \sigma \neq [] $ and $ \sigma $ has no head yet. Transition: $ (\sigma|\sigma', w|\beta, A) \rightarrow (\sigma', w|\beta, A \cup \{(w, l, \sigma)\}) $.

3. **RIGHT-ARC (l)**: Creates a dependency arc from the top of the stack to the first word in the buffer with label $ l $, and moves the word from the buffer to the stack. Condition: $ \sigma \neq [] $ and $ w $ has no head yet. Transition: $ (\sigma|\sigma', w|\beta, A) \rightarrow (\sigma|w|\sigma', \beta, A \cup \{(\sigma, l, w)\}) $.

4. **REDUCE**: Removes the top word from the stack if it has already found its head. Condition: $ \sigma \neq [] $ and top of stack already has head. Transition: $ (\sigma|\sigma', \beta, A) \rightarrow (\sigma', \beta, A) $.

The process continues until the buffer is empty and the stack contains only the root node. The sequence of transitions determines the dependency tree.  To decide which transition to apply at each step, a classifier is trained. Mathematically, at each state $ c $, we want to choose a transition $ t $ from a set of possible transitions $ T(c) $ that leads to the correct parse. This is done by learning a scoring function $ score(c, t) $ and choosing the transition $ t^* = \arg \max_{t \in T(c)} score(c, t) $.  Traditional classifiers used features extracted from the current state $ c $ (e.g., words and POS tags on the stack and buffer).

**4. Neural Dependency Parsing**

Neural dependency parsing leverages neural networks to learn feature representations and perform parsing decisions, overcoming the limitations of feature engineering in traditional methods.

In **Neural Transition-based Parsing**, neural networks are used to predict the next transition in a transition-based parsing system.  Instead of manually defined features, neural networks learn representations of the parser state directly from the input words and their context.

A common architecture uses word embeddings and part-of-speech (POS) tag embeddings as inputs.  For a given parser state $ c = (\sigma, \beta, A) $, we can extract a set of words and POS tags from the stack and buffer.  Let $ \mathbf{w}_i $ be the embedding for word $ w_i $ and $ \mathbf{p}_i $ be the embedding for its POS tag $ p_i $.  We can represent a state by concatenating embeddings from words and POS tags in the stack and buffer. For example, we can use the top elements of the stack and the first few elements of the buffer:

State representation $ \mathbf{x} = [\mathbf{w}_{stack[0]}; \mathbf{p}_{stack[0]}; \mathbf{w}_{stack[1]}; \mathbf{p}_{stack[1]}; ... ; \mathbf{w}_{buffer[0]}; \mathbf{p}_{buffer[0]}; ...] $ , where $ [;] $ denotes concatenation.

This state representation $ \mathbf{x} $ is then fed into a neural network, typically a multi-layer perceptron (MLP), to predict a probability distribution over possible transitions.

Let $ \mathbf{h}^{(1)} = ReLU(\mathbf{W}^{(1)} \mathbf{x} + \mathbf{b}^{(1)}) $
$ \mathbf{h}^{(2)} = ReLU(\mathbf{W}^{(2)} \mathbf{h}^{(1)} + \mathbf{b}^{(2)}) $
$ \mathbf{o} = \mathbf{W}^{(3)} \mathbf{h}^{(2)} + \mathbf{b}^{(3)} $
$ p(t|c) = \text{softmax}(\mathbf{o})_t $

Here, $\mathbf{W}^{(i)}$ and $\mathbf{b}^{(i)}$ are weight matrices and bias vectors for layer $ i $, and $ ReLU $ is the Rectified Linear Unit activation function. $ p(t|c) $ is the probability of applying transition $ t $ in state $ c $. The transition with the highest probability is chosen.  The network is trained to maximize the probability of the correct transition sequence leading to the gold dependency tree, using cross-entropy loss.

In **Graph-based Neural Dependency Parsing**, neural networks are used to directly score dependency arcs.  For each pair of words $ (w_i, w_j) $ in a sentence, a neural network computes a score $ s_{ij} $ representing the likelihood that $ w_i $ is the head of $ w_j $.  These scores can be computed using bilinear functions and deep neural networks.

For example, for each word $ w_i $, we can obtain a head representation $ \mathbf{h}_i^{head} $ and a dependent representation $ \mathbf{h}_i^{dep} $ using neural networks (e.g., BiLSTM, feed-forward networks).  The score for a dependency arc from $ w_i $ to $ w_j $ can be calculated as:

$ s_{ij} = (\mathbf{h}_i^{head})^T \mathbf{U} (\mathbf{h}_j^{dep}) + (\mathbf{v}^{head})^T \mathbf{h}_i^{head} + (\mathbf{v}^{dep})^T \mathbf{h}_j^{dep} + b $

where $ \mathbf{U}, \mathbf{v}^{head}, \mathbf{v}^{dep}, b $ are parameters to be learned.  After computing scores $ s_{ij} $ for all possible arcs, we can construct the dependency tree using algorithms like Maximum Spanning Tree (MST) algorithm to find the tree that maximizes the total score of the selected arcs while satisfying the constraints of a dependency tree (single head, connected, acyclic).

For **projective parsing**, where dependency arcs do not cross, efficient algorithms like the **Eisner algorithm** can be used to find the highest-scoring projective dependency tree in $ O(n^3) $ time, where $ n $ is the sentence length. This algorithm uses dynamic programming to compute scores for subtrees.

For **non-projective parsing**, which allows for crossing arcs and is necessary for languages with more flexible word order, the **Chu-Liu/Edmonds algorithm** (or its variations) can be employed to find the maximum spanning tree in a directed graph. The complexity of Chu-Liu/Edmonds algorithm is generally higher than for projective methods but is crucial for handling complex linguistic structures.

The training process for graph-based neural parsers involves maximizing the score of the gold dependency tree while minimizing the scores of other possible trees.  This is often achieved using a structured learning objective, such as max-margin loss or cross-entropy loss adapted for structured prediction.

In summary, neural dependency parsing, both transition-based and graph-based, has significantly advanced the field by automating feature engineering through neural networks. They learn rich, contextualized representations of words and dependencies, leading to state-of-the-art parsing accuracy.

**Conclusions Technical (Pro's, Con's, Research Improvement)**

**Pro's of Syntactic Structure and Dependency Parsing:**

1. **Semantic Interpretation**:  Syntactic structure, especially dependency structure, provides a crucial intermediary step towards semantic understanding. By explicitly representing relationships like subject-verb-object, it facilitates the extraction of meaning from sentences.  Formally, if we have a dependency tree $ D(S) $ for sentence $ S $, we can define a function $ \mathcal{I}(D(S)) $ that maps the syntactic structure to a semantic representation, enabling tasks like semantic role labeling and relation extraction.

2. **Robustness and Generalization**: Dependency parsing is relatively robust to word order variations, especially compared to phrase structure parsing.  This is advantageous for languages with flexible word order and for processing noisy text.  Dependency representations are designed to capture functional relationships, which are often more consistent across different linguistic constructions.

3. **Efficiency**: Transition-based parsers, particularly neural ones, offer a good balance between accuracy and speed.  With linear or near-linear time complexity in sentence length, they are efficient for processing large volumes of text. Graph-based parsers, while generally more accurate, can be computationally more expensive, often with cubic time complexity for projective parsing.

4. **Foundation for Downstream Tasks**: Syntactic parsing serves as a foundational module for a wide range of NLP applications, including machine translation, information extraction, question answering, and text summarization.  Accurate syntactic analysis improves the performance of these downstream systems by providing structured information about the input text.

**Con's of Syntactic Structure and Dependency Parsing:**

1. **Ambiguity Resolution**: Natural language is inherently ambiguous, and syntactic parsing must grapple with lexical, structural, and attachment ambiguities.  While parsers have become very sophisticated, resolving all ambiguities correctly remains a challenge. Mathematically, if $ \mathcal{A}(S) $ is the set of possible syntactic structures for sentence $ S $, a parser aims to select the correct structure $ \mathcal{G}^*(S) \in \mathcal{A}(S) $. However, distinguishing $ \mathcal{G}^*(S) $ from other plausible but incorrect structures in $ \mathcal{A}(S) $ can be difficult.

2. **Data Dependency**: Supervised dependency parsers heavily rely on treebanks for training. The performance of a parser is often limited by the size, quality, and domain relevance of the treebank it is trained on.  For low-resource languages where treebanks are scarce or non-existent, building accurate dependency parsers is a significant challenge.

3. **Complexity of Non-projectivity**: Handling non-projective dependencies, while crucial for some languages, increases the complexity of parsing algorithms, particularly in graph-based approaches.  While algorithms like Chu-Liu/Edmonds exist, they are computationally more demanding.

4. **Error Propagation**: Errors in parsing can propagate to downstream NLP tasks. An incorrect parse tree can lead to misinterpretations in semantic analysis, information extraction, and other applications.  The cascade of errors can negatively impact the overall performance of NLP systems.

**Research Improvement Directions:**

1. **Contextualized Representations**:  Further improve the integration of contextual information into dependency parsing. While current neural parsers utilize word embeddings and neural network architectures like LSTMs and Transformers to capture context, research can explore more effectively modeling long-range dependencies and discourse-level context to enhance parsing accuracy.  This can involve developing novel neural architectures or incorporating attention mechanisms that better capture relevant contextual features.

2. **Low-Resource Parsing**: Develop methods for dependency parsing in low-resource settings where treebanks are limited or unavailable.  This includes exploring unsupervised or weakly supervised parsing techniques, transfer learning from high-resource languages, and active learning strategies to efficiently utilize limited annotation resources.  Mathematically, this could involve minimizing reliance on fully supervised learning by incorporating techniques like semi-supervised learning or unsupervised pre-training followed by fine-tuning with minimal supervision.

3. **Parsing Speed and Efficiency**: Enhance the speed and efficiency of dependency parsers, especially for real-time applications and processing massive text datasets.  Research can focus on optimizing parsing algorithms, developing more efficient neural network architectures for parsing, and exploring techniques like pruning search spaces in transition-based parsing or efficient MST algorithms in graph-based parsing.

4. **Handling Complex Linguistic Phenomena**: Improve parsers' ability to handle complex linguistic phenomena such as coordination, ellipsis, and long-distance dependencies.  This requires deeper linguistic analysis and potentially incorporating more explicit linguistic constraints into parsing models.  Research can explore integrating grammatical theories more directly into neural parser architectures or developing specialized modules to address specific types of complex syntactic structures.

5. **Evaluation Metrics and Diagnostics**: Develop more comprehensive evaluation metrics and diagnostic tools for dependency parsing.  Beyond UAS and LAS, metrics that evaluate the parser's ability to handle specific types of dependencies or linguistic structures would be valuable.  Error analysis and visualization tools can help identify systematic errors and guide further improvements in parsing models.

By addressing these challenges and pursuing these research directions, the field of syntactic structure and dependency parsing can continue to advance, leading to more robust, accurate, and linguistically insightful natural language processing systems.

-------------
# Syntactic Structure and Dependency Parsing

## Syntactic Structure: Consistency and Dependency

Syntactic structure represents the hierarchical organization of linguistic units within a sentence, formally defined as a directed graph $G = (V, E)$ where $V$ is the set of words and $E$ is the set of syntactic relations. Consistency in syntactic structure requires that the graph satisfies the well-formedness constraint $\forall e_i, e_j \in E, e_i \neq e_j \Rightarrow \text{head}(e_i) \neq \text{head}(e_j) \vee \text{dependent}(e_i) \neq \text{dependent}(e_j)$, ensuring each word has at most one head.

Dependency relations are formally expressed as binary asymmetric relations between lexical items. Given words $w_i$ and $w_j$ in sentence $S = \{w_1, w_2, ..., w_n\}$, a dependency relation $r(w_i, w_j)$ indicates that $w_i$ is the head and $w_j$ is the dependent. The mathematical foundation is captured by:

$$D(S) = \{(i, j, r) | w_i \text{ is head of } w_j \text{ with relation } r\}$$

The dependency structure must satisfy projectivity constraints in many formalisms, expressed as:

$$\forall (i, j, r) \in D(S), \forall k \in \mathbb{Z}: \min(i,j) < k < \max(i,j) \Rightarrow \exists (p, k, q) \in D(S) \text{ where } \min(i,j) \leq p \leq \max(i,j)$$

## Dependency Grammar and Treebanks

Dependency grammar is formalized as a 4-tuple $G = (N, T, S, R)$ where $N$ is a set of non-terminal symbols, $T$ is a set of terminal symbols, $S$ is the start symbol, and $R$ is a set of rewriting rules. The dependency grammar differs from constituency grammar in that it directly represents word-to-word relationships.

The probability of a dependency tree $T$ for sentence $S$ can be computed as:

$$P(T|S) = \prod_{(i,j,r) \in D(T)} P(r|w_i,w_j,pos_i,pos_j,C)$$

where $C$ represents contextual information, $pos_i$ and $pos_j$ are part-of-speech tags.

Treebanks are formally defined as annotated corpora $TB = \{(S_1,T_1), (S_2,T_2), ..., (S_n,T_n)\}$ where each pair $(S_i,T_i)$ consists of a sentence and its corresponding dependency tree. Inter-annotator agreement is measured using metrics like Cohen's $\kappa$:

$$\kappa = \frac{P_o - P_e}{1 - P_e}$$

where $P_o$ is the observed agreement proportion and $P_e$ is the expected agreement by chance.

## Transition-based Dependency Parsing

Transition-based parsing is formalized as a state machine defined by the tuple $M = (C, T, c_s, C_t, \delta)$ where:
- $C$ is a set of configurations
- $T$ is a set of transitions
- $c_s$ is the initial configuration
- $C_t$ is the set of terminal configurations
- $\delta: C \times T \rightarrow C$ is the transition function

Each configuration is represented as $c = (\sigma, \beta, A)$ where $\sigma$ is the stack, $\beta$ is the buffer, and $A$ is the set of dependency arcs constructed so far.

For a sentence $w_1, w_2, ..., w_n$, the initial configuration is $c_0 = ([ROOT], [w_1, w_2, ..., w_n], \emptyset)$. The transition set typically includes:

$$\text{SHIFT}: (\sigma, w_i|\beta, A) \vdash (\sigma|w_i, \beta, A)$$
$$\text{LEFT-ARC}_r: (\sigma|w_i|w_j, \beta, A) \vdash (\sigma|w_j, \beta, A \cup \{(w_j, w_i, r)\})$$
$$\text{RIGHT-ARC}_r: (\sigma|w_i|w_j, \beta, A) \vdash (\sigma|w_i, \beta, A \cup \{(w_i, w_j, r)\})$$

The time complexity is $O(n)$ where $n$ is the sentence length, assuming constant-time transitions.

## Neural Dependency Parsing

Neural dependency parsing employs vector representations and neural architectures to predict transitions or arcs. The formal definition involves a mapping function $f_\theta: X \rightarrow Y$ where $X$ represents parser configurations and $Y$ represents parser actions.

For a BiLSTM-based parser, token representations are computed as:

$$\overrightarrow{h_i} = \text{LSTM}_f(x_i, \overrightarrow{h_{i-1}})$$
$$\overleftarrow{h_i} = \text{LSTM}_b(x_i, \overleftarrow{h_{i+1}})$$
$$h_i = [\overrightarrow{h_i}; \overleftarrow{h_i}]$$

The scoring function for a transition $t$ given configuration $c$ is:

$$\text{score}(t|c) = W_2 \cdot \text{ReLU}(W_1 \cdot \phi(c) + b_1) + b_2$$

where $\phi(c)$ extracts features from configuration $c$:

$$\phi(c) = [h_{s_1}; h_{s_2}; ...; h_{s_k}; h_{b_1}; h_{b_2}; ...; h_{b_m}]$$

The objective function minimizes the negative log-likelihood:

$$\mathcal{L}(\theta) = -\sum_{i=1}^{N} \log P(a_i|c_i;\theta)$$

where $P(a_i|c_i;\theta) = \frac{\exp(\text{score}(a_i|c_i))}{\sum_{a' \in A} \exp(\text{score}(a'|c_i))}$

Graph-based neural dependency parsers directly score arc factorized dependencies:

$$\text{score}(i \rightarrow j) = MLP([h_i; h_j])$$

The Maximum Spanning Tree (MST) algorithm finds the optimal tree with complexity $O(n^2)$ for projective parsing and $O(n^3)$ for non-projective parsing.

Recent advancements include self-attention mechanisms where dependency scores are computed as:

$$e_{ij} = \frac{(W_Q h_i)^T (W_K h_j)}{\sqrt{d}}$$

The integration of pretrained language models has yielded state-of-the-art results with empirical improvements of 3-5% in attachment scores across diverse languages and domains.