# Corpus analysis

The docuscospacy package supports the generation of:

* Token frequency tables
* Ngram tables
* Collocation tables around a node word
* Keyword comparisions against a reference corpus

Most importantly, **outputs can be contolled either by part-of-speech or by DocuScope tag**. Thus, *can* as noun and *can* as verb, for example, can be disambiguated.

Additionally, tagged multi-token sequencies are aggregatated for analysis. So, for example, where *in spite of* is tagged as a token sequence, it is combined into a signle token.

In [1]:
import warnings
warnings.filterwarnings('ignore')

## Processing a corpus

Before we generate any counts or tables, we need to load a corpus and tokenize it. Be sure you have downloaded the `en_docusco_spacy` model from [the huggingface model repository](https://huggingface.co/browndw/en_docusco_spacy).

We will also load `Corpus`, `vocabulary_size` and `corpus_num_tokens` from **tmtoolkit**. If you aren't familiar with the package, be sure to [familiarize yourself with it.](https://tmtoolkit.readthedocs.io/en/latest/text_corpora.html).

We will also import `re` for some simple pre-processing.

In [2]:
import spacy
from tmtoolkit.corpus import Corpus, vocabulary_size, corpus_num_tokens
import re

First, we need to load a spacy instance from the model.

In [4]:
nlp = spacy.load('en_docusco_spacy')

Next, we will define a simple pre-processing function. **For accurate tagging**, possessive *its* should be split into two tokens. The last part of the function will eliminate carriage returns, tabs, extra spaces, etc.

Note that you can also pass other functions as part of the `raw_preproc` argument in a list. For example: `raw_preproc=[pre_process, simplify_unicode_chars]` would add a function built in to **tmtoolkit** that replaces accented with non accented characters.

In [5]:
def pre_process(txt):
    txt = re.sub(r'\bits\b', 'it s', txt)
    txt = re.sub(r'\bIts\b', 'It s', txt)
    txt = " ".join(txt.split())
    return(txt)

The target corpus is sample of academic papers available from the [**docuscospacy** repository](https://github.com/browndw/docuscospacy/tree/main/docs/source/data). Note the token attributes being returned: `spacy_token_attrs=['tag', 'ent_iob', 'ent_type', 'is_punct']`:

In [6]:
%%time
corp = Corpus.from_folder('data/tar_corpus', spacy_instance=nlp, raw_preproc=[pre_process], spacy_token_attrs=['tag', 'ent_iob', 'ent_type', 'is_punct'])

CPU times: user 8.33 s, sys: 1.02 s, total: 9.35 s
Wall time: 9.5 s


It is simple to calculate and store some basic information about the corpus. These numbers will be useful later.

In [7]:
corpus_total = corpus_num_tokens(corp)
corpus_types = vocabulary_size(corp)
total_punct = []
for i in range(0,len(corp)):
    total_punct.append(sum(corp[i]['is_punct']))
total_punct = sum(total_punct)
non_punct = corpus_total - total_punct

In [8]:
print('Aphanumeric tokens:', non_punct, '\nPunctuation tokens:', total_punct, '\nTotal tokens:', corpus_total, '\nToken types:', corpus_types)

Aphanumeric tokens: 134410 
Punctuation tokens: 18821 
Total tokens: 153231 
Token types: 13645


## Converting a corpus

Before we generate any tables, we first need to convert the corpus into a convenient object that we can manipulate. From `docuscospacy.corpus_analysis` we will import a number of functions including `convert_corpus`. The function simply takes the object produced by the `Corpus.from_folder` function.

In [9]:
from docuscospacy.corpus_analysis import convert_corpus, frequency_table, tags_table, ngrams_table, coll_table, tags_dtm, kwic_center_node, keyness_table

In [10]:
tp = convert_corpus(corp)

The result is a dictionary, whose keys are the names of the corpus files:

In [11]:
list(tp.keys())[:9]

['acad_23',
 'acad_37',
 'acad_36',
 'acad_22',
 'acad_34',
 'acad_20',
 'acad_08',
 'acad_09',
 'acad_21']

And the values are lists of nltk-like tuples:

In [12]:
list(tp.values())[1][:9]

[('often', 'RR', 'B-Narrative'),
 ('referred', 'VVN', 'B-InformationReportVerbs'),
 ('to', 'II', 'I-InformationReportVerbs'),
 ('as', 'II', 'I-InformationReportVerbs'),
 ('the', 'AT', 'B-Citation'),
 ('"', 'Y', 'I-Citation'),
 ('cartesian', 'NN1', 'B-AcademicTerms'),
 ('circle', 'NN1', 'O-'),
 ('"', 'Y', 'O-')]

## Frequency tables

Frequency tables are produced by the `frequency_table` function, which takes a converted corpus object, a count against which to normalze and a `count_by` arguement that is one of **'pos'** or **'ds'** for part-of-speech or DocuScope category.

In addition to being trained on DocuScope, the spaCy model was trained on the [CLAWS7 tagset](https://ucrel.lancs.ac.uk/claws7tags.html). Those tags are default counting method.

Here, we use `non_punct` (or the total number of tokens that are not punctuation), as the part-of-speech token count omits tokens tagged as punctuation.

In [13]:
wc = frequency_table(tp, non_punct)

The table returns a column of tokens, tags, absoulte frequency, relative frequency (per million tokens) and the range of text in which the token appears:

In [14]:
wc.sort_values('RF', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Token,Tag,AF,RF,Range
the,AT,9600,71423.26,100.0
of,IO,5063,37668.33,100.0
and,CC,3673,27326.84,100.0
in,II,2892,21516.26,100.0
a,AT1,2562,19061.08,100.0
to,TO,2190,16293.43,100.0
is,VBZ,1783,13265.38,98.0
that,CST,1563,11628.6,100.0
to,II,1313,9768.62,100.0
for,IF,1103,8206.23,100.0


The resulting data frame is easy to filter and sort. So, here, we filter for an absolute frequency greater than 10 and tokens tags as verbs (starting with 'V'):

In [15]:
wc.query('AF > 10 and Tag.str.startswith("V")').sort_values('RF', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Token,Tag,AF,RF,Range
is,VBZ,1783,13265.38,98.0
be,VBI,960,7142.33,98.0
are,VBR,763,5676.66,96.0
was,VBDZ,594,4419.31,92.0
will,VM,513,3816.68,82.0
can,VM,423,3147.09,94.0
were,VBDR,385,2864.37,84.0
has,VHZ,334,2484.93,86.0
have,VH0,296,2202.22,78.0
would,VM,288,2142.7,90.0


Here, we sort for adverbs. Note that multi-word units tagged as a sequence are aggregated into a single token (like *for example*):

In [16]:
wc.query('Tag.str.startswith("R")').sort_values('RF', ascending=False).head(20).style.hide(axis='index').format(precision=2)

Token,Tag,AF,RF,Range
also,RR,302,2246.86,98.0
more,RGR,273,2031.1,86.0
et al,RA,201,1495.42,12.0
however,RR,183,1361.51,80.0
only,RR,163,1212.71,84.0
then,RT,130,967.19,82.0
most,RGT,120,892.79,70.0
how,RRQ,110,818.39,70.0
out,RP,100,743.99,72.0
even,RR,83,617.51,64.0


Similarly, we can generate a frequncy table of DocuScope tokens by setting `count_by='ds'`. Note that here we normalize by 'corpus_total' as DocuScope includes punctuation in its tagging system:

In [17]:
wc = frequency_table(tp, n_tokens=corpus_total, count_by='ds')

In [18]:
wc.sort_values('RF', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Token,Tag,AF,RF,Range
the,Syntactic Complexity,3800,24799.16,100.0
and,Syntactic Complexity,3021,19715.33,100.0
of,Syntactic Complexity,1784,11642.55,100.0
in,Syntactic Complexity,1387,9051.69,100.0
to,Syntactic Complexity,1357,8855.91,100.0
a,Syntactic Complexity,906,5912.64,100.0
of the,Syntactic Complexity,714,4659.63,98.0
for,Syntactic Complexity,684,4463.85,98.0
that,Syntactic Complexity,657,4287.64,98.0
the,O,604,3941.76,98.0


As with part-of-speech tags, we can easily filter the data frame for the desired [DocuScope category](https://docuscospacy.readthedocs.io/en/latest/docuscope.html#Categories). Here, we sort by 'Character':

In [19]:
wc.query('Tag.str.startswith("Character")').sort_values('RF', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Token,Tag,AF,RF,Range
their,Character,358,2336.34,90.0
his,Character,230,1501.0,52.0
he,Character,130,848.39,46.0
students,Character,120,783.13,20.0
participants,Character,103,672.19,14.0
children,Character,88,574.3,16.0
american,Character,78,509.04,22.0
workers,Character,78,509.04,14.0
figure,Character,72,469.88,26.0
people,Character,71,463.35,60.0


Or by 'Public Terms':

In [20]:
wc.query('Tag.str.startswith("Public")').sort_values('RF', ascending=False).head(20).style.hide(axis='index').format(precision=2)

Token,Tag,AF,RF,Range
national,Public Terms,98,639.56,32.0
political,Public Terms,68,443.77,28.0
society,Public Terms,55,358.94,28.0
citizenship,Public Terms,48,313.25,6.0
population,Public Terms,44,287.15,26.0
discussion,Public Terms,43,280.62,34.0
organizations,Public Terms,43,280.62,10.0
god,Public Terms,36,234.94,6.0
lesson,Public Terms,34,221.89,6.0
government,Public Terms,28,182.73,14.0


## Tags tables

Rather than counting tokens, we can generate counts of the tags **only** by using the `tags_table` function. It works just like the `frequency_table` function, taking a dictionary created by the `convert_corpus` function, an integer agaist which to normalize, and a `count_by` argument of either 'pos' or 'ds'.

In [21]:
tc = tags_table(tp, non_punct)

In [22]:
tc.sort_values('RF', ascending=False).head(20).style.hide(axis='index').format(precision=2)

Tag,AF,RF,Range
NN1,24532,18.25,100.0
JJ,11234,8.36,100.0
AT,9715,7.23,100.0
II,9449,7.03,100.0
NN2,8903,6.62,100.0
IO,5063,3.77,100.0
NP1,4562,3.39,100.0
CC,4183,3.11,100.0
RR,3870,2.88,100.0
VVN,3306,2.46,100.0


And by DocuScope category:

In [23]:
dc = tags_table(tp, corpus_total, count_by='ds')

In [24]:
dc.sort_values('RF', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Tag,AF,RF,Range
Syntactic Complexity,20343,13.28,100.0
Academic Terms,9315,6.08,100.0
Character,7878,5.14,100.0
Description,6933,4.52,100.0
Narrative,6538,4.27,100.0
Information Exposition,5060,3.3,100.0
Information Topics,3547,2.31,98.0
Negative,3462,2.26,100.0
Metadiscourse Cohesive,2684,1.75,100.0
Positive,2667,1.74,100.0


## Ngram tables

Ngrams (between bigrams and 5-grams) can be calculated using the `ngrams_table` function. It works much like the `frequency_table` function but with the addition of a span argument `ng_span` consisting of an integer between 2 and 5.

This will return a table of 3-grams:

In [25]:
nc = ngrams_table(tp, 3, non_punct, count_by='pos')

The returned data frame includes both the sequence of tokens, as well as the sequence of tags:

In [26]:
nc.sort_values('RF', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Token1,Token2,Token3,Tag1,Tag2,Tag3,AF,RF,Range
part,time,faculty,NN1,NNT1,NN1,126,937.43,2.0
of,part,time,IO,NN1,NNT1,53,394.32,2.0
one,of,the,MC1,IO,AT,41,305.04,48.0
the,pardoner,'s,AT,NP1,GE,41,305.04,2.0
the,fact,that,AT,NN1,CST,34,252.96,36.0
the,number,of,AT,NN1,IO,32,238.08,18.0
more,likely,to,RGR,JJ,TO,31,230.64,16.0
the,effects,of,AT,NN2,IO,30,223.2,20.0
there,is,a,EX,VBZ,AT1,30,223.2,44.0
at,community,colleges,II,NN1,NN2,28,208.32,2.0


This allows for useful filtering. For example, looking at ngrams that start with a verb:

In [27]:
nc.query('Tag1.str.startswith("V")').sort_values('RF', ascending=False).head(20).style.hide(axis='index').format(precision=2)

Token1,Token2,Token3,Tag1,Tag2,Tag3,AF,RF,Range
based,on,the,VVN,II,AT,20,148.8,28.0
is,important,to,VBZ,JJ,TO,19,141.36,16.0
be,able,to,VBI,JK,TO,19,141.36,20.0
were,able,to,VBDR,JK,TO,19,141.36,10.0
can,be,seen,VM,VBI,VVN,17,126.48,16.0
would,like,to,VM,VVI,TO,17,126.48,12.0
used,in,the,VVN,II,AT,14,104.16,18.0
be,seen,in,VBI,VVN,II,13,96.72,8.0
can,not,be,VM,XX,VBI,13,96.72,20.0
will,not,be,VM,XX,VBI,12,89.28,8.0


Or sequences that end with a past participle ('VVN') preceded by a *to be* verb ('VB'), thus showing passive constructions:

In [28]:
nc.query('Tag3.str.startswith("VVN") and Tag2.str.startswith("VB")').sort_values('RF', ascending=False).head(20).style.hide(axis='index').format(precision=2)

Token1,Token2,Token3,Tag1,Tag2,Tag3,AF,RF,Range
can,be,seen,VM,VBI,VVN,17,126.48,16.0
to,be,used,TO,VBI,VVN,10,74.4,14.0
can,be,used,VM,VBI,VVN,10,74.4,14.0
should,be,noted,VM,VBI,VVN,7,52.08,8.0
could,be,used,VM,VBI,VVN,7,52.08,10.0
will,be,asked,VM,VBI,VVN,7,52.08,8.0
has,been,shown,VHZ,VBN,VVN,6,44.64,8.0
will,be,discussed,VM,VBI,VVN,5,37.2,8.0
have,been,used,VH0,VBN,VVN,5,37.2,10.0
can,be,observed,VM,VBI,VVN,5,37.2,4.0


Similar ngram tables can be created for DocuScope sequences. Here we generate bigrams:

In [29]:
nc = ngrams_table(tp, 2, corpus_total, count_by='ds')

In [30]:
nc.sort_values('RF', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Token1,Token2,Tag1,Tag2,AF,RF,Range
in,the,SyntacticComplexity,SyntacticComplexity,366,2388.55,96.0
time,faculty,InformationTopics,AcademicTerms,130,848.39,2.0
to,the,SyntacticComplexity,SyntacticComplexity,120,783.13,84.0
for,the,SyntacticComplexity,SyntacticComplexity,111,724.4,76.0
to,be,O,O,110,717.87,76.0
part,time,O,InformationTopics,84,548.19,4.0
from,the,SyntacticComplexity,SyntacticComplexity,84,548.19,68.0
by,the,SyntacticComplexity,SyntacticComplexity,83,541.67,72.0
that,the,SyntacticComplexity,SyntacticComplexity,81,528.61,76.0
in,a,SyntacticComplexity,SyntacticComplexity,78,509.04,70.0


We can, for example, find sequence tagged as 'Positive' on the right but filtering out untagged ('O') and 'Syntactic Complexity' on the right:

In [31]:
nc.query('Tag1.str.startswith("Positive") and (~Tag2.str.startswith("Syntactic") and ~Tag2.str.startswith("O"))').sort_values('RF', ascending=False).head(20).style.hide(axis='index').format(precision=2)

Token1,Token2,Tag1,Tag2,AF,RF,Range
civil,society,Positive,PublicTerms,14,91.37,4.0
free,fall,Positive,Description,11,71.79,2.0
moral,dilemmas,Positive,Narrative,9,58.73,2.0
free,will,Positive,Future,7,45.68,2.0
consent,form,Positive,AcademicTerms,6,39.16,4.0
health,professionals,Positive,Positive,4,26.1,2.0
our,design,Positive,Strategic,4,26.1,2.0
civil rights,movement,Positive,AcademicTerms,3,19.58,2.0
health,departments,Positive,PublicTerms,3,19.58,2.0
efficient,layout,Positive,AcademicTerms,3,19.58,4.0


## Collocations

Collocations within a span (left and right) of a node word can be calculated according to several association measures.

The default span is 4 tokens to the left and 4 tokens to the right of the node word.

Like `frequency_table`, `coll_table` requires a dictionary of the type generated by the `convert_corpus` function. It also requires a node word, a node tag, and an association measure statistic. 

In [32]:
ct = coll_table(tp, 'can', node_tag='V', statistic='pmi', count_by='pos')

In [33]:
ct.sort_values('MI', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Token,Tag,Freq Span,Freq Total,MI
undone,VVN,2,1,6.43
deicing,VV0,2,1,6.43
deployed,VVN,2,2,5.73
gt,NP1,2,2,5.73
comprehend,VVI,2,2,5.73
schematic,NN1,2,2,5.73
hinder,VVI,2,2,5.73
dip,VVI,2,2,5.73
overnight,RT,1,1,5.73
chart,VVI,1,1,5.73


In [34]:
ct.query('`Freq Total` > 5 and MI > 3 and Tag.str.startswith("V")').sort_values('MI', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Token,Tag,Freq Span,Freq Total,MI
assume,VVI,6,9,5.33
arise,VVI,3,6,5.04
occur,VVI,11,23,5.0
seen,VVN,18,39,4.96
achieved,VVN,3,7,4.89
doubt,VVI,3,7,4.89
expect,VVI,5,12,4.86
studied,VVN,3,8,4.75
happen,VVI,2,6,4.64
investigated,VVN,2,7,4.48


In [35]:
ct = coll_table(tp, 'can', node_tag='V', statistic='npmi', count_by='pos')
ct.sort_values('MI', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Token,Tag,Freq Span,Freq Total,MI
be,VBI,189,960,0.63
deicing,VV0,2,1,0.58
undone,VVN,2,1,0.58
seen,VVN,18,39,0.56
assume,VVI,6,9,0.53
occur,VVI,11,23,0.53
gt,NP1,2,2,0.52
schematic,NN1,2,2,0.52
comprehend,VVI,2,2,0.52
hinder,VVI,2,2,0.52


In [36]:
ct = coll_table(tp, 'people', node_tag='Character', statistic='npmi', count_by='ds')
ct.sort_values('MI', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Token,Tag,Freq Span,Freq Total,MI
diagnosed,Inquiry,2,1,0.74
that year,Narrative,2,1,0.74
. young,Character,2,1,0.74
falsely,Negative,2,3,0.63
believing that,Character,2,3,0.63
came to a,Narrative,1,1,0.63
can have a,ConfidenceHedged,1,1,0.63
i will mount,FirstPerson,1,1,0.63
from the past,Narrative,1,1,0.63
freedom to,Reasoning,1,1,0.63


We can also calculate collocations, while ignoring tags completely by setting `tag_ignore` to 'True':

In [64]:
ct = coll_table(tp, 'data', tag_ignore=True, statistic='npmi')
ct.sort_values('MI', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Token,Freq Span,Freq Total,MI
collection,18,23,0.72
collected,13,15,0.71
conjunctions,2,1,0.66
weighting,2,1,0.66
achieves,3,3,0.62
gathered,3,3,0.62
qualitative,12,32,0.61
ample,2,2,0.6
split,2,2,0.6
recorded,9,24,0.59


## Document-term matrices for tags

Document-term matrices are basic data structures for text analysis. Each row is a document (observation) and each column is a token (variable). These [can be produced by **tmtoolkit**](https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Generating-a-sparse-document-term-matrix-(DTM))) using the `dtm` function.

The **docuscopspacy** package allows for the creation of dtms with tag counts (rather than token counts) as variables.

These are produced by the `tags_dtm` function, which takes a dictionary created by the `convert_corpus` function and a `count_by` argument of either 'pos' or 'ds'.

In [37]:
tm = tags_dtm(tp)

Note that the first column, 'doc_id', contains the names of the document files. The count that is returned is the raw count.

In [38]:
tm.head(10).style.hide(axis='index').format(precision=0)

doc_id,APPGE,AT,AT1,BCL,CC,CCB,CS,CSA,CSN,CST,CSW,DA,DA1,DA2,DAR,DAT,DB,DB2,DD,DD1,DD2,DDQ,DDQGE,DDQV,EX,FO,FW,GE,IF,II,IO,IW,JJ,JJR,JJT,JK,MC,MC1,MC2,MCMC,MD,ND1,NN,NN1,NN2,NNB,NNL1,NNO,NNO2,NNT1,NNT2,NNU,NNU1,NNU2,NP1,NPD1,NPM1,PN,PN1,PNQO,PNQS,PNQV,PNX1,PPGE,PPH1,PPHO1,PPHO2,PPHS1,PPHS2,PPIO1,PPIO2,PPIS1,PPIS2,PPX1,PPX2,PPY,RA,REX,RG,RGQ,RGQV,RGR,RGT,RL,RP,RPK,RR,RRQ,RRQV,RRR,RRT,RT,TO,UH,VBDR,VBDZ,VBG,VBI,VBM,VBN,VBR,VBZ,VD0,VDD,VDG,VDI,VDN,VDZ,VH0,VHD,VHG,VHI,VHN,VHZ,VM,VMK,VV0,VVD,VVG,VVGK,VVI,VVN,VVZ,XX,ZZ1
acad_23,7,105,36,1,55,1,15,3,0,12,0,5,0,2,1,0,8,4,1,28,8,2,0,1,4,2,0,6,23,73,46,13,119,0,1,1,16,6,0,0,6,0,14,253,115,0,0,0,0,21,7,0,0,0,10,4,2,0,4,0,0,0,0,0,7,1,1,0,3,0,1,0,20,0,0,6,3,0,3,0,0,2,2,0,2,0,26,2,0,0,0,1,32,0,2,3,0,19,0,1,17,18,0,0,0,0,0,0,4,0,0,0,0,3,47,0,24,5,13,0,62,42,16,6,13
acad_37,10,48,15,3,26,4,13,5,0,28,0,4,2,2,0,0,0,3,4,14,9,7,0,1,1,0,0,1,5,45,27,8,60,1,0,0,3,1,0,0,10,0,0,104,21,0,0,0,0,6,0,0,0,0,21,0,0,0,10,0,0,0,0,0,23,1,3,11,3,0,6,2,32,0,0,0,0,0,10,0,0,0,0,2,4,0,51,4,0,3,0,2,16,0,0,2,1,8,1,2,5,21,2,2,0,0,1,3,1,0,0,1,0,3,16,0,19,2,8,0,29,7,27,13,0
acad_36,13,204,56,3,52,4,23,8,2,37,2,5,1,4,1,1,3,4,6,7,4,30,1,0,4,0,0,27,8,103,84,8,159,0,0,2,13,3,0,0,7,0,4,391,65,0,0,0,0,1,0,0,0,0,58,0,0,0,6,0,1,0,0,1,15,2,2,16,6,0,0,2,0,0,0,1,5,5,12,0,0,7,0,3,3,0,89,4,0,0,0,3,42,0,1,1,2,34,0,0,15,57,1,0,0,0,2,7,2,1,0,4,0,8,46,0,25,7,21,0,59,52,50,23,19
acad_22,51,175,83,3,123,10,22,15,3,24,0,5,1,6,1,0,1,4,2,7,3,18,1,1,0,1,10,32,14,215,103,21,203,2,0,0,9,3,0,1,3,2,3,455,163,1,0,0,0,9,5,0,0,0,298,0,0,0,3,3,11,0,0,1,7,5,2,40,3,0,3,2,0,2,0,0,14,2,6,0,0,2,3,2,10,0,68,9,0,6,0,11,21,0,3,3,1,4,0,2,1,9,2,3,1,1,0,6,1,2,0,12,1,3,24,0,31,55,36,0,29,52,76,5,3
acad_34,3,68,38,0,55,10,12,6,1,16,1,0,0,1,1,0,2,0,0,22,3,6,0,0,1,0,0,2,9,74,51,11,125,4,4,2,2,1,0,0,2,0,2,210,89,0,0,0,0,2,0,0,0,0,29,0,0,0,1,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,1,4,0,0,2,0,2,6,0,45,0,0,2,0,1,25,0,0,2,0,10,0,0,2,23,1,0,0,0,1,3,1,0,1,2,0,3,16,0,10,6,15,0,33,16,26,8,0
acad_20,6,91,27,0,29,1,7,1,0,7,0,2,2,1,0,0,1,1,2,3,6,5,0,0,0,15,0,10,11,109,41,9,113,1,0,0,9,0,1,0,3,0,4,222,79,1,0,0,1,4,5,1,0,0,36,0,4,0,0,0,1,0,0,0,4,1,1,0,3,0,0,1,0,0,0,0,0,1,1,0,0,1,2,1,5,0,28,0,0,1,0,2,12,0,3,8,0,0,0,0,0,1,0,3,0,0,0,0,0,6,0,1,0,0,3,0,6,39,27,0,15,15,7,4,1
acad_08,20,51,25,0,44,1,8,4,0,18,0,5,1,0,0,1,12,1,0,10,2,6,0,1,1,0,0,20,12,101,26,8,56,1,2,0,2,2,0,0,4,0,2,171,54,0,0,0,0,11,2,0,0,0,75,0,0,0,16,0,1,0,0,0,13,0,1,6,3,0,0,0,1,3,1,0,0,0,1,2,0,0,0,2,7,0,33,6,0,2,0,4,41,0,1,7,2,4,0,0,7,39,1,0,0,0,0,1,3,1,0,3,0,6,20,0,5,9,28,0,52,22,19,4,0
acad_09,65,196,42,1,83,3,41,8,2,45,0,5,1,7,2,1,5,0,2,17,7,4,0,0,1,0,0,16,36,160,94,24,150,0,2,1,2,0,0,0,4,2,23,346,93,2,0,0,0,2,1,0,0,0,262,0,0,0,6,3,13,1,0,0,17,12,2,33,7,0,1,0,0,4,1,0,0,0,5,1,0,2,1,2,8,0,70,4,0,3,0,5,50,0,10,53,1,13,0,9,1,10,0,3,1,2,0,0,0,17,3,16,0,0,27,0,10,99,36,0,45,74,24,15,0
acad_21,4,97,24,1,25,5,7,4,0,12,0,1,0,1,0,0,2,1,0,5,4,6,0,0,0,7,0,10,3,56,77,8,140,2,0,0,6,1,6,0,5,0,3,193,67,0,0,0,0,3,0,0,0,0,60,0,0,0,0,0,3,0,0,0,1,0,1,4,1,0,0,0,3,1,0,0,1,3,4,0,0,2,3,3,6,0,31,3,0,0,0,1,18,0,6,3,0,2,0,1,3,8,0,0,1,1,0,0,3,1,1,1,0,3,3,0,11,20,20,0,18,19,19,3,0
acad_35,46,323,84,1,151,2,29,8,2,25,3,7,1,1,4,0,3,8,23,33,12,18,0,0,5,1,1,9,39,255,159,24,290,1,0,2,21,6,0,0,18,0,31,892,296,0,0,0,0,16,7,0,0,0,69,0,5,0,1,1,13,0,0,0,19,1,3,4,13,0,1,0,0,1,0,4,0,2,3,2,0,1,0,1,9,0,65,5,0,3,0,4,80,0,0,3,1,85,0,1,17,55,7,0,0,1,1,2,8,1,1,9,0,9,136,1,47,15,69,0,130,109,34,23,57


The resulting data frame can be passed to **tmtoolkit** functions to create normalized counts (using the `tf_proportions` function), [tf-idf values](https://tmtoolkit.readthedocs.io/en/latest/bow.html#Term-frequency%E2%80%93inverse-document-frequency-transformation-(tf-idf)) (using the `tfidf` function), or other kids of data structures.

In [39]:
from tmtoolkit.bow.bow_stats import tf_proportions, tfidf

In order generate these, note that the 'doc_id' column must either be dropped or set as the row names. The `tags_dtm` function does not do this initally as a saftey feature. Row names **must** be unique. Setting the doc ids as a column allows users to account for any duplicates before proceeding.

So here we move the 'doc_id' to row names:

In [40]:
tm.set_index('doc_id', inplace=True)
tm.head(10).style.format(precision=0)

Unnamed: 0_level_0,APPGE,AT,AT1,BCL,CC,CCB,CS,CSA,CSN,CST,CSW,DA,DA1,DA2,DAR,DAT,DB,DB2,DD,DD1,DD2,DDQ,DDQGE,DDQV,EX,FO,FW,GE,IF,II,IO,IW,JJ,JJR,JJT,JK,MC,MC1,MC2,MCMC,MD,ND1,NN,NN1,NN2,NNB,NNL1,NNO,NNO2,NNT1,NNT2,NNU,NNU1,NNU2,NP1,NPD1,NPM1,PN,PN1,PNQO,PNQS,PNQV,PNX1,PPGE,PPH1,PPHO1,PPHO2,PPHS1,PPHS2,PPIO1,PPIO2,PPIS1,PPIS2,PPX1,PPX2,PPY,RA,REX,RG,RGQ,RGQV,RGR,RGT,RL,RP,RPK,RR,RRQ,RRQV,RRR,RRT,RT,TO,UH,VBDR,VBDZ,VBG,VBI,VBM,VBN,VBR,VBZ,VD0,VDD,VDG,VDI,VDN,VDZ,VH0,VHD,VHG,VHI,VHN,VHZ,VM,VMK,VV0,VVD,VVG,VVGK,VVI,VVN,VVZ,XX,ZZ1
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1,Unnamed: 102_level_1,Unnamed: 103_level_1,Unnamed: 104_level_1,Unnamed: 105_level_1,Unnamed: 106_level_1,Unnamed: 107_level_1,Unnamed: 108_level_1,Unnamed: 109_level_1,Unnamed: 110_level_1,Unnamed: 111_level_1,Unnamed: 112_level_1,Unnamed: 113_level_1,Unnamed: 114_level_1,Unnamed: 115_level_1,Unnamed: 116_level_1,Unnamed: 117_level_1,Unnamed: 118_level_1,Unnamed: 119_level_1,Unnamed: 120_level_1,Unnamed: 121_level_1,Unnamed: 122_level_1,Unnamed: 123_level_1,Unnamed: 124_level_1,Unnamed: 125_level_1
acad_23,7,105,36,1,55,1,15,3,0,12,0,5,0,2,1,0,8,4,1,28,8,2,0,1,4,2,0,6,23,73,46,13,119,0,1,1,16,6,0,0,6,0,14,253,115,0,0,0,0,21,7,0,0,0,10,4,2,0,4,0,0,0,0,0,7,1,1,0,3,0,1,0,20,0,0,6,3,0,3,0,0,2,2,0,2,0,26,2,0,0,0,1,32,0,2,3,0,19,0,1,17,18,0,0,0,0,0,0,4,0,0,0,0,3,47,0,24,5,13,0,62,42,16,6,13
acad_37,10,48,15,3,26,4,13,5,0,28,0,4,2,2,0,0,0,3,4,14,9,7,0,1,1,0,0,1,5,45,27,8,60,1,0,0,3,1,0,0,10,0,0,104,21,0,0,0,0,6,0,0,0,0,21,0,0,0,10,0,0,0,0,0,23,1,3,11,3,0,6,2,32,0,0,0,0,0,10,0,0,0,0,2,4,0,51,4,0,3,0,2,16,0,0,2,1,8,1,2,5,21,2,2,0,0,1,3,1,0,0,1,0,3,16,0,19,2,8,0,29,7,27,13,0
acad_36,13,204,56,3,52,4,23,8,2,37,2,5,1,4,1,1,3,4,6,7,4,30,1,0,4,0,0,27,8,103,84,8,159,0,0,2,13,3,0,0,7,0,4,391,65,0,0,0,0,1,0,0,0,0,58,0,0,0,6,0,1,0,0,1,15,2,2,16,6,0,0,2,0,0,0,1,5,5,12,0,0,7,0,3,3,0,89,4,0,0,0,3,42,0,1,1,2,34,0,0,15,57,1,0,0,0,2,7,2,1,0,4,0,8,46,0,25,7,21,0,59,52,50,23,19
acad_22,51,175,83,3,123,10,22,15,3,24,0,5,1,6,1,0,1,4,2,7,3,18,1,1,0,1,10,32,14,215,103,21,203,2,0,0,9,3,0,1,3,2,3,455,163,1,0,0,0,9,5,0,0,0,298,0,0,0,3,3,11,0,0,1,7,5,2,40,3,0,3,2,0,2,0,0,14,2,6,0,0,2,3,2,10,0,68,9,0,6,0,11,21,0,3,3,1,4,0,2,1,9,2,3,1,1,0,6,1,2,0,12,1,3,24,0,31,55,36,0,29,52,76,5,3
acad_34,3,68,38,0,55,10,12,6,1,16,1,0,0,1,1,0,2,0,0,22,3,6,0,0,1,0,0,2,9,74,51,11,125,4,4,2,2,1,0,0,2,0,2,210,89,0,0,0,0,2,0,0,0,0,29,0,0,0,1,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,1,4,0,0,2,0,2,6,0,45,0,0,2,0,1,25,0,0,2,0,10,0,0,2,23,1,0,0,0,1,3,1,0,1,2,0,3,16,0,10,6,15,0,33,16,26,8,0
acad_20,6,91,27,0,29,1,7,1,0,7,0,2,2,1,0,0,1,1,2,3,6,5,0,0,0,15,0,10,11,109,41,9,113,1,0,0,9,0,1,0,3,0,4,222,79,1,0,0,1,4,5,1,0,0,36,0,4,0,0,0,1,0,0,0,4,1,1,0,3,0,0,1,0,0,0,0,0,1,1,0,0,1,2,1,5,0,28,0,0,1,0,2,12,0,3,8,0,0,0,0,0,1,0,3,0,0,0,0,0,6,0,1,0,0,3,0,6,39,27,0,15,15,7,4,1
acad_08,20,51,25,0,44,1,8,4,0,18,0,5,1,0,0,1,12,1,0,10,2,6,0,1,1,0,0,20,12,101,26,8,56,1,2,0,2,2,0,0,4,0,2,171,54,0,0,0,0,11,2,0,0,0,75,0,0,0,16,0,1,0,0,0,13,0,1,6,3,0,0,0,1,3,1,0,0,0,1,2,0,0,0,2,7,0,33,6,0,2,0,4,41,0,1,7,2,4,0,0,7,39,1,0,0,0,0,1,3,1,0,3,0,6,20,0,5,9,28,0,52,22,19,4,0
acad_09,65,196,42,1,83,3,41,8,2,45,0,5,1,7,2,1,5,0,2,17,7,4,0,0,1,0,0,16,36,160,94,24,150,0,2,1,2,0,0,0,4,2,23,346,93,2,0,0,0,2,1,0,0,0,262,0,0,0,6,3,13,1,0,0,17,12,2,33,7,0,1,0,0,4,1,0,0,0,5,1,0,2,1,2,8,0,70,4,0,3,0,5,50,0,10,53,1,13,0,9,1,10,0,3,1,2,0,0,0,17,3,16,0,0,27,0,10,99,36,0,45,74,24,15,0
acad_21,4,97,24,1,25,5,7,4,0,12,0,1,0,1,0,0,2,1,0,5,4,6,0,0,0,7,0,10,3,56,77,8,140,2,0,0,6,1,6,0,5,0,3,193,67,0,0,0,0,3,0,0,0,0,60,0,0,0,0,0,3,0,0,0,1,0,1,4,1,0,0,0,3,1,0,0,1,3,4,0,0,2,3,3,6,0,31,3,0,0,0,1,18,0,6,3,0,2,0,1,3,8,0,0,1,1,0,0,3,1,1,1,0,3,3,0,11,20,20,0,18,19,19,3,0
acad_35,46,323,84,1,151,2,29,8,2,25,3,7,1,1,4,0,3,8,23,33,12,18,0,0,5,1,1,9,39,255,159,24,290,1,0,2,21,6,0,0,18,0,31,892,296,0,0,0,0,16,7,0,0,0,69,0,5,0,1,1,13,0,0,0,19,1,3,4,13,0,1,0,0,1,0,4,0,2,3,2,0,1,0,1,9,0,65,5,0,3,0,4,80,0,0,3,1,85,0,1,17,55,7,0,0,1,1,2,8,1,1,9,0,9,136,1,47,15,69,0,130,109,34,23,57


And convert to normalized counts:

In [41]:
tf_proportions(tm).head(10)

Unnamed: 0_level_0,APPGE,AT,AT1,BCL,CC,CCB,CS,CSA,CSN,CST,...,VMK,VV0,VVD,VVG,VVGK,VVI,VVN,VVZ,XX,ZZ1
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
acad_23,0.004834,0.072514,0.024862,0.000691,0.037983,0.000691,0.010359,0.002072,0.0,0.008287,...,0.0,0.016575,0.003453,0.008978,0.0,0.042818,0.029006,0.01105,0.004144,0.008978
acad_37,0.011507,0.055236,0.017261,0.003452,0.029919,0.004603,0.01496,0.005754,0.0,0.032221,...,0.0,0.021864,0.002301,0.009206,0.0,0.033372,0.008055,0.03107,0.01496,0.0
acad_36,0.00628,0.098551,0.027053,0.001449,0.025121,0.001932,0.011111,0.003865,0.000966,0.017874,...,0.0,0.012077,0.003382,0.010145,0.0,0.028502,0.025121,0.024155,0.011111,0.009179
acad_22,0.018791,0.06448,0.030582,0.001105,0.045321,0.003685,0.008106,0.005527,0.001105,0.008843,...,0.0,0.011422,0.020265,0.013265,0.0,0.010685,0.01916,0.028003,0.001842,0.001105
acad_34,0.002627,0.059545,0.033275,0.0,0.048161,0.008757,0.010508,0.005254,0.000876,0.014011,...,0.0,0.008757,0.005254,0.013135,0.0,0.028897,0.014011,0.022767,0.007005,0.0
acad_20,0.005581,0.084651,0.025116,0.0,0.026977,0.00093,0.006512,0.00093,0.0,0.006512,...,0.0,0.005581,0.036279,0.025116,0.0,0.013953,0.013953,0.006512,0.003721,0.00093
acad_08,0.01759,0.044855,0.021988,0.0,0.038698,0.00088,0.007036,0.003518,0.0,0.015831,...,0.0,0.004398,0.007916,0.024626,0.0,0.045734,0.019349,0.016711,0.003518,0.0
acad_09,0.026231,0.079096,0.016949,0.000404,0.033495,0.001211,0.016546,0.003228,0.000807,0.01816,...,0.0,0.004036,0.039952,0.014528,0.0,0.01816,0.029863,0.009685,0.006053,0.0
acad_21,0.003697,0.089649,0.022181,0.000924,0.023105,0.004621,0.00647,0.003697,0.0,0.011091,...,0.0,0.010166,0.018484,0.018484,0.0,0.016636,0.01756,0.01756,0.002773,0.0
acad_35,0.011532,0.080973,0.021058,0.000251,0.037854,0.000501,0.00727,0.002006,0.000501,0.006267,...,0.000251,0.011782,0.00376,0.017298,0.0,0.03259,0.027325,0.008523,0.005766,0.014289


Or tf-idf:

In [42]:
tfidf(tm).head(10)

Unnamed: 0_level_0,APPGE,AT,AT1,BCL,CC,CCB,CS,CSA,CSN,CST,...,VMK,VV0,VVD,VVG,VVGK,VVI,VVN,VVZ,XX,ZZ1
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
acad_23,0.003351,0.049548,0.016988,0.000663,0.025954,0.0005,0.007078,0.001479,0.0,0.005663,...,0.0,0.011325,0.002359,0.006135,0.0,0.029257,0.019819,0.007659,0.002914,0.009409
acad_37,0.007976,0.037742,0.011795,0.003316,0.020444,0.003335,0.010222,0.004107,0.0,0.022016,...,0.0,0.01494,0.001573,0.00629,0.0,0.022803,0.005504,0.021536,0.010521,0.0
acad_36,0.004353,0.067339,0.018485,0.001392,0.017165,0.0014,0.007592,0.002759,0.000758,0.012213,...,0.0,0.008252,0.002311,0.006932,0.0,0.019476,0.017165,0.016743,0.007814,0.009619
acad_22,0.013025,0.044059,0.020897,0.001062,0.030967,0.00267,0.005539,0.003945,0.000867,0.006042,...,0.0,0.007805,0.013847,0.009064,0.0,0.007301,0.013092,0.01941,0.001296,0.001158
acad_34,0.001821,0.040687,0.022737,0.0,0.032908,0.006345,0.00718,0.00375,0.000687,0.009573,...,0.0,0.005983,0.00359,0.008975,0.0,0.019745,0.009573,0.015781,0.004927,0.0
acad_20,0.003869,0.057842,0.017162,0.0,0.018433,0.000674,0.004449,0.000664,0.0,0.004449,...,0.0,0.003814,0.024789,0.017162,0.0,0.009534,0.009534,0.004514,0.002617,0.000975
acad_08,0.012193,0.030649,0.015024,0.0,0.026442,0.000637,0.004808,0.002511,0.0,0.010817,...,0.0,0.003005,0.005409,0.016827,0.0,0.03125,0.013221,0.011583,0.002474,0.0
acad_09,0.018182,0.054046,0.011581,0.000388,0.022887,0.000877,0.011306,0.002304,0.000633,0.012409,...,0.0,0.002757,0.027299,0.009927,0.0,0.012409,0.020405,0.006713,0.004257,0.0
acad_21,0.002562,0.061257,0.015156,0.000888,0.015788,0.003348,0.004421,0.002639,0.0,0.007578,...,0.0,0.006947,0.01263,0.01263,0.0,0.011367,0.011999,0.012172,0.00195,0.0
acad_35,0.007993,0.055328,0.014389,0.000241,0.025866,0.000363,0.004968,0.001431,0.000393,0.004282,...,0.000526,0.008051,0.002569,0.011819,0.0,0.022268,0.018671,0.005908,0.004055,0.014975


## KWIC tables

There is also a function for generating Key Word in Context (KWIC) tables. The **tmtoolkit** package has [its own KWIC functions](https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Keywords-in-context-(KWIC)-and-general-filtering-methods). The only difference is that this function produced a table with the node word in a center column with context columns to the left and right. The **tmtoolkit** functions produce tables with a single column that includes the node word.

For display purposes the `kwic_center_node` function trims the context columns to 75 characters maximum.

The function requires a **corpus** of the type generated by the `Corpus.from_dictionary` function. A node word needs to be set and there is the option to ignore the case of the node word.

In [43]:
kcn = kwic_center_node(corp, 'data', ignore_case=True)

In [44]:
kcn.head(10).style.set_properties(subset=["Post-Node"], **{'text-align': 'left'}).set_properties(subset=["Node"], **{'text-align': 'center'})

Unnamed: 0,Doc,Pre-Node,Node,Post-Node
0,acad_23,the current and proposed systems . This analysis will include,data,"collection , current system modeling and simulation , facility layout"
1,acad_23,demand increases . To achieve this goal we will collect,data,on current procedures and use that data to standardize the
2,acad_23,we will collect data on current procedures and use that,data,to standardize the assembly and packaging processes . We will
3,acad_23,and processes we will create simulations based on real world,data,". This data will include employee recommendations , standardized work"
4,acad_23,will create simulations based on real world data . This,data,"will include employee recommendations , standardized work element times ,"
5,acad_23,our own time studies and compare the results to the,data,gathered by the workers . We will statistically analyze the
6,acad_23,by the workers . We will statistically analyze the two,data,sets to identify significant . If such differences exist we
7,acad_23,on all existing processes . For this portion of the,data,collection we will need access to the E - Dining
8,acad_23,"facility , though production need not be stopped for the",data,"gathering . To create an accurate simulation , the dimensions"
9,acad_23,- business day . With the gathered process and layout,data,we will construct a computer simulation of the current state


There is also an option allowing for glob-style searching by setting the `glob` argument to 'True':

In [45]:
kwc = kwic_center_node(corp, 'house*', ignore_case=True, glob=True)

In [46]:
kwc.head(10).style.set_properties(subset=["Post-Node"], **{'text-align': 'left'}).set_properties(subset=["Node"], **{'text-align': 'center'})

Unnamed: 0,Doc,Pre-Node,Node,Post-Node
0,acad_22,his baptismal name -- then al - Asad maintained a,household,"of three , a triad that Davis suggests included a"
1,acad_22,"the Joannes Leo on the census , and whom the",household,"listed may have included . Here , as she consistently"
2,acad_20,support also enabled the Liberals to strip power from the,House,"of Lords in the 1909 "" People 's Budget """
3,acad_20,( which ) found it s living symbol in the,House,"of Lords . "" Britain 's social and political establishment"
4,acad_35,of and type of relation with people living in the,household,", ( j ) number of children , ( k"
5,acad_35,"( m ) employment status , ( n ) monthly",household,"income , ( o ) age of onset of initial"
6,acad_19,an unfavorable review of a production of Shaw 's Heartbreak,House,", sharpening his critique from a linguistic to a nationalistic"
7,acad_50,", which also might include redistribution , reciprocity , and",householding,( pp . 51 ) . Unlike the liberal creed
8,acad_15,"age , gender , and presence of other breadwinners in",household,) as well as faculty divisions ( such as business
9,acad_10,learned of the events of the symposium at Agathon 's,house,"from Aristodemus , who was actually present . These degrees"


## Keyword tables

[Keywords](https://eprints.lancs.ac.uk/id/eprint/140803/1/Rayson_2019_CorpusAnalysisofKeyWords_Submitted.pdf) are common method for profiling corpora by statstically comparing token frequencies in one corpus (a target corpus) to those in another (a reference corpus).

To generate a keyword list, we first need to process our reference corpus, in this case a small corpus of news articles.

In [47]:
%%time
corp_ref = Corpus.from_folder('data/ref_corpus', spacy_instance=nlp, raw_preproc=pre_process, spacy_token_attrs=['tag', 'ent_iob', 'ent_type', 'is_punct'])

CPU times: user 2.2 s, sys: 238 ms, total: 2.44 s
Wall time: 2.48 s


We will also store various counts of tokens:

In [48]:
ref_total = corpus_num_tokens(corp_ref)
ref_types = vocabulary_size(corp_ref)
ref_punct = []
for i in range(0,len(corp_ref)):
    ref_punct.append(sum(corp_ref[i]['is_punct']))
ref_punct = sum(ref_punct)
ref_nonpunct = ref_total - ref_punct

In [49]:
print('Aphanumeric tokens (Reference corpus):', ref_nonpunct, '\nPunctuation tokens (Reference corpus):', ref_punct, '\nTotal tokens (Reference corpus):', ref_total, '\nTypes (Reference corpus):', ref_types)

Aphanumeric tokens (Reference corpus): 31950 
Punctuation tokens (Reference corpus): 4742 
Total tokens (Reference corpus): 36692 
Types (Reference corpus): 6364


As before, we will use the `convert_corpus` function to prepare our data for further analysis:

In [50]:
tp_ref = convert_corpus(corp_ref)

Finally, we will use `frequency_table` to generate 2 tables, both normalized by total counts of non-punctuation tokens:

In [51]:
wc_target = frequency_table(tp, non_punct)
wc_ref = frequency_table(tp_ref, ref_nonpunct)

To generate a table of key words, we will use `keyness_table`, which takes both our target and reference frequency tables. An arguement can also be set for using the Yates correction by setting the `correct` argument to 'True'. Here will leave the default, which is for no correction.

In [52]:
kw = keyness_table(wc_target, wc_ref)

The table returns the frequency data for both corpora, with a column for [log-likehood](https://ucrel.lancs.ac.uk/llwizard.html) (the test of significance), as well as [Log Ratio](http://cass.lancs.ac.uk/log-ratio-an-informal-introduction/) (an effect size measure), and the *p*-value.

In [54]:
kw.sort_values('LR', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Token,Tag,AF,RF,Range,AF Ref,RF Ref,Range Ref,LL,LR,PV
et al,RA,201.0,1495.42,12.0,0.0,0.0,0.0,85.79,6.58,0.0
faculty,NN1,186.0,1383.83,4.0,1.0,31.3,2.0,70.23,5.47,0.0
colleges,NN2,88.0,654.71,2.0,0.0,0.0,0.0,37.56,5.39,0.0
germanicus,NP1,86.0,639.83,2.0,0.0,0.0,0.0,36.71,5.35,0.0
current,JJ,80.0,595.19,38.0,0.0,0.0,0.0,34.15,5.25,0.0
hiv,NP1,75.0,557.99,6.0,0.0,0.0,0.0,32.01,5.16,0.0
pardoner,NP1,74.0,550.55,2.0,0.0,0.0,0.0,31.58,5.14,0.0
strain,NN1,74.0,550.55,14.0,0.0,0.0,0.0,31.58,5.14,0.0
thus,RR,73.0,543.11,36.0,0.0,0.0,0.0,31.16,5.12,0.0
racial,JJ,68.0,505.91,10.0,0.0,0.0,0.0,29.02,5.02,0.0


The table can be sorted according to various criteria, like absolute frequencies and *p*-value thresholds:

In [55]:
kw.query('AF > 5 and `AF Ref` > 5 and PV < 0.01').sort_values('LR', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Token,Tag,AF,RF,Range,AF Ref,RF Ref,Range Ref,LL,LR,PV
within,II,119.0,885.35,48.0,6.0,187.79,12.0,22.44,2.24,0.0
these,DD2,356.0,2648.61,96.0,18.0,563.38,32.0,66.98,2.23,0.0
different,JJ,116.0,863.03,66.0,6.0,187.79,10.0,21.46,2.2,0.0
such as,II,112.0,833.27,68.0,6.0,187.79,10.0,20.16,2.15,0.0
more,RGR,273.0,2031.1,86.0,15.0,469.48,24.0,48.15,2.11,0.0
between,II,236.0,1755.82,84.0,13.0,406.89,18.0,41.54,2.11,0.0
part,NN1,228.0,1696.3,52.0,14.0,438.18,20.0,36.53,1.95,0.0
system,NN1,113.0,840.71,46.0,7.0,219.09,8.0,17.96,1.94,0.0
social,JJ,112.0,833.27,44.0,7.0,219.09,6.0,17.65,1.93,0.0
process,NN1,114.0,848.15,54.0,8.0,250.39,12.0,15.99,1.76,0.0


Tables can similarly be filtered for part-of-speech tag:

In [56]:
kw.query('AF > 5 and `AF Ref` > 5 and PV < 0.01 and Tag.str.startswith("V")').sort_values('LR', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Token,Tag,AF,RF,Range,AF Ref,RF Ref,Range Ref,LL,LR,PV
used,VVN,143.0,1063.91,56.0,11.0,344.29,14.0,18.07,1.63,0.0
based,VVN,88.0,654.71,60.0,7.0,219.09,12.0,10.67,1.58,0.0
using,VVG,109.0,810.95,64.0,9.0,281.69,18.0,12.59,1.53,0.0
does,VDZ,166.0,1235.03,78.0,17.0,532.08,26.0,13.77,1.22,0.0
must,VM,101.0,751.43,54.0,11.0,344.29,20.0,7.46,1.13,0.01
may,VM,211.0,1569.82,72.0,24.0,751.17,20.0,14.25,1.06,0.0
is,VBZ,1783.0,13265.38,98.0,236.0,7386.54,98.0,83.08,0.85,0.0
can,VM,423.0,3147.09,94.0,59.0,1846.64,54.0,16.86,0.77,0.0
are,VBR,763.0,5676.66,96.0,119.0,3724.57,68.0,20.31,0.61,0.0
will,VM,513.0,3816.68,82.0,89.0,2785.6,62.0,8.13,0.46,0.0


Keyness tables can also be generated for counts of either part-of-speech or DocuScope tags. First, we prepare the frequency tables.

In [58]:
tag_ref = tags_table(tp_ref, ref_nonpunct, count_by='pos')
tag_tar = tags_table(tp, non_punct, count_by='pos')
ds_ref = tags_table(tp_ref, ref_total, count_by='ds')
ds_tar = tags_table(tp, corpus_total, count_by='ds')

We will set the `tags_only` argument to 'True' and we will also emply the Yates correction, setting `correct` to 'True', as well:

In [60]:
kt = keyness_table(tag_tar, tag_ref, tags_only=True, correct=True)

In [61]:
kt.sort_values('LR', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Tag,AF,RF,Range,AF Ref,RF Ref,Range Ref,LL,LR,PV
BCL,71.0,0.05,60.0,2.0,0.01,4.0,16.73,3.09,0.0
RA,231.0,0.17,38.0,10.0,0.03,16.0,47.24,2.47,0.0
REX,82.0,0.06,68.0,4.0,0.01,6.0,14.5,2.3,0.0
RGR,324.0,0.24,92.0,18.0,0.06,30.0,55.93,2.11,0.0
DDQV,9.0,0.01,18.0,0.0,0.0,0.0,1.43,2.11,0.23
DDQGE,18.0,0.01,16.0,1.0,0.0,2.0,1.96,2.11,0.16
PNQO,18.0,0.01,16.0,1.0,0.0,2.0,1.96,2.11,0.16
ZZ1,413.0,0.31,54.0,35.0,0.11,42.0,45.84,1.5,0.0
VDZ,166.0,0.12,78.0,17.0,0.05,26.0,13.22,1.23,0.0
DD2,439.0,0.33,98.0,46.0,0.14,58.0,34.83,1.19,0.0


We can do the same for the DocuScope frequency tables:

In [62]:
kds = keyness_table(ds_tar, ds_ref, tags_only=True)

In [63]:
kds.sort_values('LR', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Tag,AF,RF,Range,AF Ref,RF Ref,Range Ref,LL,LR,PV
Citation Hedged,35,0.02,40.0,1,0.0,2.0,8.78,3.02,0.0
Responsibility,176,0.11,70.0,9,0.02,16.0,31.41,2.18,0.0
Information Change Negative,133,0.09,50.0,10,0.03,16.0,16.33,1.62,0.0
Academic Writing Moves,461,0.3,90.0,45,0.12,46.0,38.75,1.25,0.0
Metadiscourse Interactive,519,0.34,100.0,56,0.15,60.0,36.13,1.1,0.0
Academic Terms,9315,6.08,100.0,1016,2.77,98.0,634.22,1.09,0.0
Information Change,1252,0.82,98.0,137,0.37,78.0,84.67,1.08,0.0
Inquiry,801,0.52,100.0,90,0.25,68.0,51.2,1.04,0.0
Confidence Hedged,1341,0.88,100.0,158,0.43,86.0,76.92,0.97,0.0
Confidence Low,25,0.02,26.0,3,0.01,6.0,1.37,0.95,0.24
