## Dependency parsing with spaCy

This script takes Unicode plain text and outputs its dependencies in CoNLL10 format. It was originally written to prepare input files for named/non-named entity extraction with [xrenner](https://corpling.uis.georgetown.edu/xrenner/).

For installation instructions for spaCy, see https://spacy.io/docs#getting-started.

In [1]:
import spacy

### Load the English tagger

**Note:** Loading the tagger is expensive. [The documentation](https://spacy.io/docs#english-init) says it can take 10-20 seconds and 2-3 GB of RAM.

In [2]:
nlp = spacy.load('en')

### Give spaCy some input text, then process it.

**Note:** SpaCy input has to be in Unicode.

In [3]:
text = u'''1. Cato's family got its first lustre and fame from his great-grandfather Cato (a man whose virtue gained him the greatest reputation and influence among the Romans, as has been written in his Life), but the death of both parents left him an orphan, together with his brother Caepio and his sister Porcia. Cato had also a half-sister, Servilia, the daughter of his mother.1 All these children were brought up in the home of Livius Drusus, their uncle on the mother's side, who at that time was a leader in the conduct of public affairs; for he was a most powerful speaker, in general a man of the greatest discretion, and yielded to no Roman in dignity of purpose.
[2] We are told that from his very childhood Cato displayed, in speech, in countenance, and in his childish sports, a nature that was inflexible, imperturbable, and altogether steadfast. He set out to accomplish his purposes with a vigour beyond his years, and while he was harsh and repellent to those who would flatter him, he was still more masterful towards those who tried to frighten him. It was altogether difficult to make him laugh, although once in a while he relaxed his features so far as to smile; and he was not quickly nor easily moved to anger, though once angered he was inexorable.'''

In [4]:
doc = nlp(text)

### CoNLL10 output

SpaCy's output--in particular, its token IDs--takes some massaging in order to produce a well-formed CoNLL10 document. 

The column layout is described [here](https://corpling.uis.georgetown.edu/xrenner/doc/using.html#input-format).

In [5]:
for sent in doc.sents:
    # Create lookup dict for token IDs.
    ids = {}
    for i, token in enumerate(sent):
        ids[token.idx] = i+1
        
    for token in sent:
        # Clean up token attributes
        token_id = str(ids[token.idx]).strip()
        token_text = str(token).strip()
        lemma = str(token.lemma_).strip()
        pos_tag = str(token.tag_).strip()
        depend = str(token.dep_).strip()
        
        # Set head ID correctly for root of sentence.
        if token.dep_ == 'ROOT':
            head_id = str(0)
        else:
            head_id = str(ids[token.head.idx]).strip()
        
        # CoNLL10 output
        # Comments below are modified from https://corpling.uis.georgetown.edu/xrenner/doc/using.html#input-format
        print(token_id + '\t' +      # token ID w/in sentence
              token_text + '\t' +    # token text
              lemma + '\t' +         # lemmatized token
              pos_tag + '\t' +       # part of speech tag for token
              pos_tag + '\t' +       # part of speech tag for token
              '_' + '\t' +           # placeholder for morphological information
              head_id + '\t' +       # ID of head token
              depend + '\t' +        # dependency function
              '_' + '\t' + '_')      # two unused columns
              

1	1	1	CD	CD	_	0	ROOT	_	_
2	.	.	.	.	_	1	punct	_	_
1	Cato	cato	NNP	NNP	_	3	poss	_	_
2	's	's	POS	POS	_	1	case	_	_
3	family	family	NN	NN	_	4	nsubj	_	_
4	got	get	VBD	VBD	_	0	ROOT	_	_
5	its	its	PRP$	PRP$	_	7	poss	_	_
6	first	first	JJ	JJ	_	7	amod	_	_
7	lustre	lustre	NN	NN	_	4	dobj	_	_
8	and	and	CC	CC	_	7	cc	_	_
9	fame	fame	NN	NN	_	7	conj	_	_
10	from	from	IN	IN	_	4	prep	_	_
11	his	his	PRP$	PRP$	_	15	poss	_	_
12	great	great	JJ	JJ	_	14	amod	_	_
13	-	-	HYPH	HYPH	_	14	punct	_	_
14	grandfather	grandfather	NN	NN	_	15	compound	_	_
15	Cato	cato	NNP	NNP	_	10	pobj	_	_
16	(	(	-LRB-	-LRB-	_	15	punct	_	_
17	a	a	DT	DT	_	18	det	_	_
18	man	man	NN	NN	_	15	appos	_	_
19	whose	whose	WP$	WP$	_	20	poss	_	_
20	virtue	virtue	NN	NN	_	21	nsubj	_	_
21	gained	gain	VBD	VBD	_	18	relcl	_	_
22	him	him	PRP	PRP	_	21	dobj	_	_
23	the	the	DT	DT	_	25	det	_	_
24	greatest	great	JJS	JJS	_	25	amod	_	_
25	reputation	reputation	NN	NN	_	21	dobj	_	_
26	and	and	CC	CC	_	25	cc	_	_
27	influence	influence	NN	NN	_	25	conj	_	_
28	among	among	IN	