# Samuels Processor

Interface to a processor for the Samuels output - annotate with Spacy dependency relations and extract co-occurrence distributions

In [1]:
import SamuelsCorpus as Sam

Set up some filenames:

In [2]:
parentdir="/Users/juliewe/Dropbox/oldbailey/speech_corpora/theft/all/samuels_tagged"
filenames={'f_nonl':['corpus_1800_1820_theft_f_def_semtagged','corpus_1800_1820_theft_f_wv_semtagged_new'],'m_nonl':['corpus_1800_1820_theft_m_def_semtagged_new','corpus_1800_1820_theft_m_wv_semtagged'],'m_leg':['corpus_1800_1820_theft_m_lj_semtagged']}
           
          

Load (and view sample) Samuels file.  

* The Samuels Processor has one required arguemnt - this is a list of filenames (of Samuels tagged files) to read.  

* It can also take a parent directory (where these files are stored, if not the current directory)

* It can also take an outfile argument.  This is the stem of the filenames for the associated output files.  If not given, the default is to use the first filename in the list of inputs.

In [3]:
f=Sam.Processor(filenames['f_nonl'],parentdir=parentdir,outfile='f_nonl')
f.get_dataframe().head(10)

Initialising Spacy
Reading /Users/juliewe/Dropbox/oldbailey/speech_corpora/theft/all/samuels_tagged/corpus_1800_1820_theft_f_def_semtagged
Reading /Users/juliewe/Dropbox/oldbailey/speech_corpora/theft/all/samuels_tagged/corpus_1800_1820_theft_f_wv_semtagged_new
Read 213816, errors = 2
9150 chunks of sentences


Unnamed: 0,fileid,chunk,sentence,id,vard_lower,vard,LEMMA,POS,SEMTAG1,MWE,SEMTAG2,SEMTAG3
0,2,0,0,-1,s_begin,S_BEGIN,,,Z99,0,,
1,3,0,0,0,it,It,it,PPH1,Z8,0,04.06 [],ZF [Pronoun]
2,4,0,0,1,was,was,be,VBDZ,A3+,0,01.11.01.07 [Be/remain in specific state/condi...,AK.01.g [State/condition]
3,5,0,0,2,none,none,none,PN,Z6/Z8c,0,04.04 [],ZD [Negative]
4,6,0,0,3,of,of,of,IO,Z5,0,04.03 [Grammatical],ZC [Grammatical Item]
5,7,0,0,4,mr.,Mr.,mr.,NNB,Z1m,1:2:1,04.01.01 [Personal Name],ZA01 [Personal Name]
6,8,0,0,5,jones,Jones,jones,NP1,Z1m,1:2:2,04.01.01 [Personal Name],ZA01 [Personal Name]
7,9,0,0,6,'s,'s,'s,GE,Z5,0,04.03 [],ZC [Grammatical Item]
8,10,0,0,7,property,property,property,NN1,A9+,0,02.06.03-22.04.07 [0.94444444] [a landed prope...,AW.01.a [Possessions]
9,11,0,0,8,;,;,PUNC,YSCOL,PUNC,0,,


Having initialised the Samuels Processor with the inputfiles, it is now times to run it.

* The first step is to take each chunk (normally one sentence) and run Spacy.  Spacy is given a pre-tokenised list of tokens from the Samuels tagger so there can be no disagreements over tokenisation.  Spacy will add annotations regarding POS and grammatical relations and the result is output to csvfile.  There may be some "Ignoring new sentence number,1" warnings - don't worry about this (provided the second number is 1).  This is due to Spacy occasionally thinking that a chunk should be split into multiple sentences (however, this is ignored so that the sentence splitting by Samuels is used).

* Step 2 is to run a distributional feature extractor.  Co-occurrence features (based on grammatical relations) are extracted for every row of the file (using both forward and backward relations).  Totals are calculated over the whole corpus and weights calculated using PPMI.  The results are output to files.  The default field is 'SEMTAG3' but others can be used e.g., 'vard'


In [4]:
f.run(field='SEMTAG3',measure="lpmi")

Adding Spacy annotations
Extracting tokens
Running spacy
Extracting dependency features
Extracting features
Processed 10000 rows
Processed 20000 rows
Processed 30000 rows
Processed 40000 rows
Processed 50000 rows
Processed 60000 rows
Processed 70000 rows
Processed 80000 rows
Processed 90000 rows
Processed 100000 rows
Processed 110000 rows
Processed 120000 rows
Processed 130000 rows
Processed 140000 rows
Processed 150000 rows
Processed 160000 rows
Processed 170000 rows
Processed 180000 rows
Processed 190000 rows
Processed 200000 rows
Processed 210000 rows
Converting to lpmi
Normalising vectors to unit length
Completed successfully, writing f_nonl_combined.csv, f_nonl_cooccurrence.json,f_nonl_cooccurrence_byrel.json and f_nonl_rel.json


Now lets process the male corpus in the same way.  All that we need to do is initialise a processor with the filenames (and parentdir and outfile-stem) and then run it.

In [5]:
m=Sam.Processor(filenames['m_nonl'],parentdir=parentdir,outfile='m_nonl')
m.run(measure='lpmi')

Initialising Spacy
Reading /Users/juliewe/Dropbox/oldbailey/speech_corpora/theft/all/samuels_tagged/corpus_1800_1820_theft_m_def_semtagged_new
Reading /Users/juliewe/Dropbox/oldbailey/speech_corpora/theft/all/samuels_tagged/corpus_1800_1820_theft_m_wv_semtagged
Read 1093881, errors = 2
44854 chunks of sentences
Adding Spacy annotations
Extracting tokens
Running spacy


Extracting dependency features
Extracting features
Processed 10000 rows
Processed 20000 rows
Processed 30000 rows
Processed 40000 rows
Processed 50000 rows
Processed 60000 rows
Processed 70000 rows
Processed 80000 rows
Processed 90000 rows
Processed 100000 rows
Processed 110000 rows
Processed 120000 rows
Processed 130000 rows
Processed 140000 rows
Processed 150000 rows
Processed 160000 rows
Processed 170000 rows
Processed 180000 rows
Processed 190000 rows
Processed 200000 rows
Processed 210000 rows
Processed 220000 rows
Processed 230000 rows
Processed 240000 rows
Processed 250000 rows
Processed 260000 rows
Processed 270000 rows
Processed 280000 rows
Processed 290000 rows
Processed 300000 rows
Processed 310000 rows
Processed 320000 rows
Processed 330000 rows
Processed 340000 rows
Processed 350000 rows
Processed 360000 rows
Processed 370000 rows
Processed 380000 rows
Processed 390000 rows
Processed 400000 rows
Processed 410000 rows
Processed 420000 rows
Processed 430000 rows
Processed 44

Processed 1040000 rows
Processed 1050000 rows
Processed 1060000 rows
Processed 1070000 rows
Processed 1080000 rows
Converting to lpmi
Normalising vectors to unit length
Completed successfully, writing m_nonl_combined.csv, m_nonl_cooccurrence.json,m_nonl_cooccurrence_byrel.json and m_nonl_rel.json


In [6]:
legal=Sam.Processor(filenames['m_leg'],parentdir=parentdir,outfile='m_leg')
legal.run(measure='lpmi')

Initialising Spacy
Reading /Users/juliewe/Dropbox/oldbailey/speech_corpora/theft/all/samuels_tagged/corpus_1800_1820_theft_m_lj_semtagged
Read 184107, errors = 1
9141 chunks of sentences
Adding Spacy annotations
Extracting tokens
Running spacy


Extracting dependency features
Extracting features
Processed 10000 rows
Processed 20000 rows
Processed 30000 rows
Processed 40000 rows
Processed 50000 rows
Processed 60000 rows
Processed 70000 rows
Processed 80000 rows
Processed 90000 rows
Processed 100000 rows
Processed 110000 rows
Processed 120000 rows
Processed 130000 rows
Processed 140000 rows
Processed 150000 rows
Processed 160000 rows
Converting to lpmi
Normalising vectors to unit length
Completed successfully, writing m_leg_combined.csv, m_leg_cooccurrence.json,m_leg_cooccurrence_byrel.json and m_leg_rel.json
