# Corpus analysis

The docuscospacy package supports the generation of:

* Token frequency tables
* Ngram tables
* Collocation tables around a node word
* Keyword comparisions against a reference corpus

Most importantly, **outputs can be contolled either by part-of-speech or by DocuScope tag**. Thus, *can* as noun and *can* as verb, for example, can be disambiguated.

Additionally, tagged multi-token sequencies are aggregatated for analysis. So, for example, where *in spite of* is tagged as a token sequence, it is combined into a signle token.

In [9]:
import warnings
warnings.filterwarnings('ignore')

## Processing a corpus

Before we generate any counts or tables, we need to load a corpus and tokenize it. Be sure you have downloaded the `en_docusco_spacy` model from [the huggingface model repository](https://huggingface.co/browndw/en_docusco_spacy).

We will also load `Corpus`, `vocabulary_size` and `corpus_num_tokens` from **tmtoolkit**. If you aren't familiar with the package, be sure to [familiarize yourself with it.](https://tmtoolkit.readthedocs.io/en/latest/text_corpora.html).

We will also import `re` for some simple pre-processing.

In [10]:
import spacy
from tmtoolkit.corpus import Corpus, vocabulary_size, corpus_num_tokens
import re

First, we need to load a spacy instance from the model.

In [14]:
%%capture
!pip install https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl

# Using spacy.load().
import spacy
nlp = spacy.load("en_docusco_spacy")

Next, we will define a simple pre-processing function. **For accurate tagging**, possessive *its* should be split into two tokens. The last part of the function will eliminate carriage returns, tabs, extra spaces, etc.

<div class="alert alert-info">

**Note: Adding pre-processing functions**

You can also pass other functions as part of the `raw_preproc` argument in a list. For example: `raw_preproc=[pre_process, simplify_unicode_chars]` would add a function built in to **tmtoolkit** that replaces accented with non accented characters.

</div>

In [15]:
def pre_process(txt):
    txt = re.sub(r'\bits\b', 'it s', txt)
    txt = re.sub(r'\bIts\b', 'It s', txt)
    txt = " ".join(txt.split())
    return(txt)

The target corpus is sample of academic papers available from the [**docuscospacy** repository](https://github.com/browndw/docuscospacy/tree/main/docs/source/data). Note the token attributes being returned: `spacy_token_attrs=['tag', 'ent_iob', 'ent_type', 'is_punct']`:

In [16]:
%%time
corp = Corpus.from_folder('data/tar_corpus', spacy_instance=nlp, raw_preproc=[pre_process], spacy_token_attrs=['tag', 'ent_iob', 'ent_type', 'is_punct'])

CPU times: user 6.45 s, sys: 993 ms, total: 7.44 s
Wall time: 7.54 s


It is simple to calculate and store some basic information about the corpus. These numbers will be useful later.

In [17]:
corpus_total = corpus_num_tokens(corp)
corpus_types = vocabulary_size(corp)
total_punct = []
for i in range(0,len(corp)):
    total_punct.append(sum(corp[i]['is_punct']))
total_punct = sum(total_punct)
non_punct = corpus_total - total_punct

In [18]:
print('Aphanumeric tokens:', non_punct, '\nPunctuation tokens:', total_punct, '\nTotal tokens:', corpus_total, '\nToken types:', corpus_types)

Aphanumeric tokens: 134410 
Punctuation tokens: 18821 
Total tokens: 153231 
Token types: 13645


## Converting a corpus

Before we generate any tables, we first need to convert the corpus into a convenient object that we can manipulate. From `docuscospacy.corpus_analysis` we will import a number of functions including `convert_corpus`. The function simply takes the object produced by the `Corpus.from_folder` function.

In [19]:
from docuscospacy import convert_corpus, frequency_table, tags_table, ngrams_by_token, ngrams_by_tag, coll_table, tags_dtm, normalize_dtm, dtm_to_coo, kwic_center_node, keyness_table, tag_ruler

In [20]:
tp = convert_corpus(corp)

The result is a dictionary, whose keys are the names of the corpus files:

In [21]:
list(tp.keys())[:9]

['acad_23',
 'acad_37',
 'acad_36',
 'acad_22',
 'acad_34',
 'acad_20',
 'acad_08',
 'acad_09',
 'acad_21']

And the values are lists of nltk-like tuples:

In [22]:
list(tp.values())[1][:9]

[('Often ', 'RR', 'B-Narrative'),
 ('referred ', 'VVN', 'B-InformationReportVerbs'),
 ('to ', 'II', 'I-InformationReportVerbs'),
 ('as ', 'II', 'I-InformationReportVerbs'),
 ('the ', 'AT', 'O-'),
 ('"', 'Y', 'O-'),
 ('Cartesian ', 'JJ', 'B-Description'),
 ('Circle', 'NN1', 'I-Description'),
 ('"', 'Y', 'O-')]

## Frequency tables

Frequency tables are produced by the `frequency_table` function, which takes a converted corpus object, a count against which to normalze and a `count_by` arguement that is one of **'pos'** or **'ds'** for part-of-speech or DocuScope category.

In addition to being trained on DocuScope, the spaCy model was trained on the [CLAWS7 tagset](https://ucrel.lancs.ac.uk/claws7tags.html). Those tags are default counting method.

Here, we use `non_punct` (or the total number of tokens that are not punctuation), as the part-of-speech token count omits tokens tagged as punctuation.

<div class="alert alert-info">

**Note: Normalizing**

Here, we use `non_punct` (or the total number of tokens that are not punctuation), as the part-of-speech token count omits tokens tagged as punctuation. For normalizing DocuScope tags, it is suggest that you use a total token count that **includes** puctuation, as DocuScope tags punctuation marks under certain conditions.
    
</div>

In [23]:
wc = frequency_table(tp, non_punct)

The table returns a column of tokens, tags, absoulte frequency, relative frequency (per million tokens) and the range of text in which the token appears:

In [24]:
wc.head(10).style.hide(axis='index').format(precision=2)

Token,Tag,AF,RF,Range
the,AT,9601,71430.7,100.0
of,IO,5063,37668.33,100.0
and,CC,3672,27319.4,100.0
in,II,2866,21322.82,100.0
a,AT1,2560,19046.2,100.0
to,TO,2198,16352.95,100.0
is,VBZ,1784,13272.82,98.0
that,CST,1521,11316.12,100.0
to,II,1301,9679.34,100.0
for,IF,1098,8169.04,100.0


The resulting data frame is easy to filter and sort. So, here, we filter for an absolute frequency greater than 10 and tokens tags as verbs (starting with 'V'):

In [25]:
wc.query('AF > 10 and Tag.str.startswith("V")').head(10).style.hide(axis='index').format(precision=2)

Token,Tag,AF,RF,Range
is,VBZ,1784,13272.82,98.0
be,VBI,960,7142.33,98.0
are,VBR,763,5676.66,96.0
was,VBDZ,594,4419.31,92.0
will,VM,510,3794.36,82.0
can,VM,422,3139.65,94.0
were,VBDR,385,2864.37,84.0
has,VHZ,334,2484.93,86.0
have,VH0,296,2202.22,78.0
would,VM,288,2142.7,90.0


Here, we sort for adverbs. Note that multi-word units tagged as a sequence are aggregated into a single token (like *for example*):

In [26]:
wc.query('Tag.str.startswith("R")').head(20).style.hide(axis='index').format(precision=2)

Token,Tag,AF,RF,Range
also,RR,302,2246.86,98.0
more,RGR,252,1874.86,80.0
et al,RA,201,1495.42,12.0
however,RR,180,1339.19,80.0
only,RR,158,1175.51,84.0
then,RT,130,967.19,82.0
most,RGT,125,929.99,72.0
how,RRQ,113,840.71,70.0
out,RP,100,743.99,72.0
even,RR,83,617.51,66.0


Similarly, we can generate a frequncy table of DocuScope tokens by setting `count_by='ds'`. Note that here we normalize by 'corpus_total' as DocuScope includes punctuation in its tagging system:

In [27]:
wc = frequency_table(tp, n_tokens=corpus_total, count_by='ds')

In [28]:
wc.head(10).style.hide(axis='index').format(precision=2)

Token,Tag,AF,RF,Range
the,Untagged,5593,36500.45,100.0
and,Untagged,3512,22919.64,100.0
of,Untagged,3245,21177.18,100.0
in,Untagged,1818,11864.44,100.0
to,Untagged,1630,10637.53,100.0
a,Untagged,1415,9234.42,100.0
that,Untagged,968,6317.26,100.0
for,Untagged,795,5188.25,98.0
as,Untagged,626,4085.34,100.0
with,Untagged,596,3889.55,100.0


As with part-of-speech tags, we can easily filter the data frame for the desired [DocuScope category](https://docuscospacy.readthedocs.io/en/latest/docuscope.html#Categories). Here, we sort by 'Character':

In [29]:
wc.query('Tag.str.startswith("Character")').head(10).style.hide(axis='index').format(precision=2)

Token,Tag,AF,RF,Range
their,Character,342,2231.92,90.0
his,Character,224,1461.85,50.0
he,Character,133,867.97,50.0
students,Character,125,815.76,18.0
participants,Character,113,737.45,14.0
children,Character,93,606.93,16.0
workers,Character,90,587.35,14.0
american,Character,76,495.98,22.0
figure,Character,71,463.35,28.0
al,Character,67,437.25,6.0


Or by 'Public Terms':

In [30]:
wc.query('Tag.str.startswith("Public")').head(20).style.hide(axis='index').format(precision=2)

Token,Tag,AF,RF,Range
national,Public Terms,99,646.08,30.0
political,Public Terms,63,411.14,26.0
society,Public Terms,55,358.94,28.0
citizenship,Public Terms,51,332.83,6.0
population,Public Terms,45,293.67,28.0
discussion,Public Terms,44,287.15,34.0
organizations,Public Terms,42,274.1,10.0
god,Public Terms,34,221.89,6.0
lesson,Public Terms,34,221.89,6.0
criteria,Public Terms,24,156.63,8.0


## Tags tables

Rather than counting tokens, we can generate counts of the tags **only** by using the `tags_table` function. It works just like the `frequency_table` function, taking a dictionary created by the `convert_corpus` function, an integer agaist which to normalize, and a `count_by` argument of either 'pos' or 'ds'.

In [31]:
tc = tags_table(tp, non_punct)

In [32]:
tc.head(10).style.hide(axis='index').format(precision=2)

Tag,AF,RF,Range
NN1,23989,17.85,100.0
JJ,11175,8.31,100.0
AT,9714,7.23,100.0
II,9466,7.04,100.0
NN2,9185,6.83,100.0
IO,5063,3.77,100.0
CC,4182,3.11,100.0
NP1,4180,3.11,98.0
RR,4164,3.1,100.0
VVN,3383,2.52,100.0


And by DocuScope category:

In [33]:
dc = tags_table(tp, corpus_total, count_by='ds')

In [34]:
dc.sort_values('RF', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Tag,AF,RF,Range
Academic Terms,9223,6.02,100.0
Character,7682,5.01,100.0
Narrative,6785,4.43,100.0
Description,6740,4.4,100.0
Information Exposition,5086,3.32,100.0
Information Topics,3636,2.37,98.0
Negative,3555,2.32,100.0
Positive,3037,1.98,100.0
Metadiscourse Cohesive,2496,1.63,100.0
Reasoning,2337,1.53,100.0


## Ngram tables

N-grams (between bigrams and 4-grams) can be calculated using the `ngrams_by_token` function. N-grams can be created using different options:
* You can input a word or string using the `ngrams_by_token` function. With that function you need to specify whether that input should match a token completely or partially, and choose which tagset to return.
* Alternatively, you can use the `ngrams_by_tag` function. That allows you to select a tag (like **NN1** or **AcademicTerms**) as the basis for your n-grams.
* For either option, you must select the size of your n-grams (2-grams, 3-grams, or 4-grams) and the slot where your chosen word or tag should appear (on the left, in the middle, or on the right).

We'll start by searching for n-grams of length **3** with **data** in the first position.

In [35]:
nc = ngrams_by_token(tp, node_word='data', n_tokens=non_punct, node_position=1, span=3, search_type='fixed', count_by='pos')

The returned data frame includes both the sequence of tokens, as well as the sequence of tags:

In [36]:
nc.head(10).style.hide(axis='index').format(precision=2)

Token1,Token2,Token3,Tag1,Tag2,Tag3,AF,RF,Range
data,from,the,NN,II,AT,6,44.64,8.0
data,collection,process,NN,NN1,NN1,3,22.32,2.0
data,was,recorded,NN,VBDZ,VVN,3,22.32,2.0
data,analysis,this,NN,NN1,DD1,2,14.88,2.0
data,can,be,NN,VM,VBI,2,14.88,4.0
data,collection,achieves,NN,NN1,VVZ,2,14.88,2.0
data,collection,and,NN,NN1,CC,2,14.88,2.0
data,collection,will,NN,NN1,VM,2,14.88,2.0
data,does,not,NN,VDZ,XX,2,14.88,4.0
data,for,the,NN,IF,AT,2,14.88,4.0


We can similarly look for n-grams that include only part of word. For example, we can find bigrams that include word ending with **-tion** by setting the `search_type` to **ends_with**.

In [37]:
nc = ngrams_by_token(tp, node_word='tion', n_tokens=non_punct, node_position=2, span=2, search_type='ends_with', count_by='pos')

In [38]:
nc.head(10).style.hide(axis='index').format(precision=2)

Token1,Token2,Tag1,Tag2,AF,RF,Range
the,intervention,AT,NN1,34,252.96,2.0
citizenship,education,NN1,NN1,30,223.2,2.0
the,nation,AT,NN1,27,200.88,12.0
data,collection,NN,NN1,17,126.48,8.0
higher,education,JJR,NN1,16,119.04,4.0
of,education,IO,NN1,16,119.04,8.0
the,formation,AT,NN1,15,111.6,8.0
the,notion,AT,NN1,15,111.6,16.0
brow,manipulation,NN1,NN1,14,104.16,2.0
the,manipulation,AT,NN1,13,96.72,2.0


Now we'll collect n-grams using the `ngrams_by_tag` function. Here, we'll look at 3-token sequences that end with a past participle (**VVN**).

In [39]:
nc = ngrams_by_tag(tp, tag='VVN', n_tokens=non_punct, tag_position=3, span=3, search_type='fixed', count_by='pos')

In [40]:
nc.head(10).style.hide(axis='index').format(precision=2)

Token1,Token2,Token3,Tag1,Tag2,Tag3,AF,RF,Range
can,be,seen,VM,VBI,VVN,17,126.48,16.0
can,be,used,VM,VBI,VVN,10,74.4,14.0
to,be,used,TO,VBI,VVN,10,74.4,14.0
could,be,used,VM,VBI,VVN,7,52.08,10.0
should,be,noted,VM,VBI,VVN,7,52.08,8.0
will,be,asked,VM,VBI,VVN,7,52.08,8.0
has,been,shown,VHZ,VBN,VVN,6,44.64,8.0
can,be,found,VM,VBI,VVN,5,37.2,8.0
can,be,observed,VM,VBI,VVN,5,37.2,4.0
have,been,used,VH0,VBN,VVN,5,37.2,10.0


Similar ngram tables can be created for DocuScope sequences. Here we generate trigrams:

In [41]:
nc = ngrams_by_tag(tp, tag='AcademicTerms', n_tokens=non_punct, tag_position=3, span=3, search_type='fixed', count_by='ds')

In [42]:
nc.head(10).style.hide(axis='index').format(precision=2)

Token1,Token2,Token3,Tag1,Tag2,Tag3,AF,RF,Range
part,time,faculty,Untagged,InformationTopics,AcademicTerms,117,870.47,2.0
of,citizenship,education,Untagged,PublicTerms,AcademicTerms,10,74.4,2.0
full,time,faculty,AcademicTerms,InformationTopics,AcademicTerms,9,66.96,2.0
of,sodium,bicarbonate,Untagged,AcademicTerms,AcademicTerms,9,66.96,2.0
%,sodium,bicarbonate,InformationExposition,AcademicTerms,AcademicTerms,8,59.52,2.0
national,identity,formation,PublicTerms,AcademicTerms,AcademicTerms,8,59.52,2.0
reinforced,concrete,structures,InformationChangePositive,Description,AcademicTerms,8,59.52,2.0
academy,of,pediatrics,InformationTopics,Untagged,AcademicTerms,7,52.08,2.0
faculty,in,higher education,AcademicTerms,Untagged,AcademicTerms,7,52.08,2.0
irony,and,metaphor,Narrative,Untagged,AcademicTerms,7,52.08,2.0


## Collocations

Collocations within a span (left and right) of a node word can be calculated according to several association measures.

The default span is 4 tokens to the left and 4 tokens to the right of the node word.

Like `frequency_table`, `coll_table` requires a dictionary of the type generated by the `convert_corpus` function. It also requires a node word, a node tag, and an association measure statistic. 

In [43]:
ct = coll_table(tp, 'can', node_tag='V', statistic='pmi', count_by='pos')

In [44]:
ct.head(10).style.hide(axis='index').format(precision=2)

Token,Tag,Freq Span,Freq Total,MI
perceive,NN1,2,1,9.27
undone,VVN,2,1,9.27
1b,FO,1,1,8.27
abrasion,NN1,1,1,8.27
abrogate,VVI,1,1,8.27
absorb,VVI,1,1,8.27
altered,JJ,1,1,8.27
ameliorate,VVI,1,1,8.27
anew,RR,1,1,8.27
antibiotics,NN2,1,1,8.27


In [45]:
ct.query('`Freq Total` > 5 and MI > 3 and Tag.str.startswith("V")').head(10).style.hide(axis='index').format(precision=2)

Token,Tag,Freq Span,Freq Total,MI
assume,VVI,6,9,7.69
arise,VVI,3,6,7.27
occur,VVI,11,23,7.21
seen,VVN,18,39,7.16
achieved,VVN,3,7,7.05
doubt,VVI,3,7,7.05
expect,VVI,5,12,7.01
studied,VVN,3,8,6.86
happen,VVI,2,6,6.69
lead,VVI,9,30,6.54


In [46]:
ct = coll_table(tp, 'can', node_tag='V', statistic='npmi', count_by='pos')
ct.head(10).style.hide(axis='index').format(precision=2)

Token,Tag,Freq Span,Freq Total,MI
be,VBI,190,960,0.63
perceive,NN1,2,1,0.58
undone,VVN,2,1,0.58
seen,VVN,18,39,0.56
assume,VVI,6,9,0.53
occur,VVI,11,23,0.53
comprehend,VVI,2,2,0.52
current,NN1,2,2,0.52
deicing,VVG,2,2,0.52
deployed,VVN,2,2,0.52


In [47]:
ct = coll_table(tp, 'people', node_tag='Character', statistic='pmi3', count_by='ds')
ct.head(10).style.hide(axis='index').format(precision=2)

Token,Tag,Freq Span,Freq Total,MI
believing that,Character,2,3,-21.34
falsely,Negative,2,3,-21.34
more and more,ForceStressed,2,4,-21.76
number,Untagged,4,33,-21.8
infected with,InformationChangeNegative,2,5,-22.08
of,Untagged,17,3245,-22.16
and,Untagged,17,3512,-22.27
severe,Negative,2,6,-22.34
significant,ForceStressed,3,22,-22.46
to,Untagged,12,1630,-22.67


We can also calculate collocations, while ignoring tags completely by setting `tag_ignore` to 'True':

In [48]:
ct = coll_table(tp, 'data', tag_ignore=True, statistic='npmi')
ct.head(10).style.hide(axis='index').format(precision=2)

Token,Freq Span,Freq Total,MI
collection,18,23,0.72
collected,13,15,0.71
conjunctions,2,1,0.66
weighting,2,1,0.66
achieves,3,3,0.62
gathered,3,3,0.62
qualitative,12,32,0.61
ample,2,2,0.6
split,2,2,0.6
recorded,9,24,0.59


## Document-term matrices for tags

Document-term matrices are basic data structures for text analysis. Each row is a document (observation) and each column is a token (variable). These [can be produced by **tmtoolkit**](https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Generating-a-sparse-document-term-matrix-(DTM))) using the `dtm` function.

The **docuscopspacy** package allows for the creation of dtms with tag counts (rather than token counts) as variables.

These are produced by the `tags_dtm` function, which takes a dictionary created by the `convert_corpus` function and a `count_by` argument of either 'pos' or 'ds'.

In [49]:
tm = tags_dtm(tp)

<div class="alert alert-warning">
    
**Warning: `doc_id` column**

The first column, 'doc_id', contains the names of the document files.  The `tags_dtm` function does not place document ids as row names initally as a saftey feature. Row names **must** be unique. Setting the document ids as a column allows users to account for any duplicates before proceeding.

</div>

The count that is returned is the raw count.

In [50]:
tm.head(10).style.hide(axis='index').format(precision=0)

doc_id,APPGE,AT,AT1,BCL,CC,CCB,CS,CSA,CSN,CST,CSW,DA,DA1,DA2,DAR,DAT,DB,DB2,DD,DD1,DD2,DDQ,DDQGE,DDQV,EX,FO,FW,GE,IF,II,IO,IW,JJ,JJR,JJT,JK,MC,MC1,MC2,MCMC,MD,MF,ND1,NN,NN1,NN2,NNA,NNB,NNO,NNO2,NNT1,NNT2,NNU,NNU1,NNU2,NP1,NP2,NPD1,NPM1,PN,PN1,PNQO,PNQS,PNQV,PNX1,PPGE,PPH1,PPHO1,PPHO2,PPHS1,PPHS2,PPIO1,PPIO2,PPIS1,PPIS2,PPX1,PPX2,PPY,RA,REX,RG,RGQ,RGQV,RGR,RGT,RL,RP,RPK,RR,RRQ,RRQV,RRR,RRT,RT,TO,UH,VBDR,VBDZ,VBG,VBI,VBM,VBN,VBR,VBZ,VD0,VDD,VDG,VDI,VDN,VDZ,VH0,VHD,VHG,VHI,VHN,VHZ,VM,VMK,VV0,VVD,VVG,VVGK,VVI,VVN,VVZ,XX,ZZ1,ZZ2
acad_23,7,105,36,0,55,1,15,3,0,12,0,5,0,2,1,0,8,3,1,27,8,2,0,1,4,3,0,6,23,73,46,13,105,0,1,1,16,8,0,0,6,0,0,14,265,119,0,0,0,0,21,7,3,0,0,4,0,4,2,0,2,0,0,0,0,0,7,1,1,0,3,0,1,0,20,0,0,6,0,0,2,0,0,2,2,0,2,0,28,2,0,0,0,1,32,0,2,3,0,19,0,1,17,18,0,0,0,0,0,0,4,0,0,0,0,3,47,0,23,5,15,0,61,45,16,6,13,0
acad_37,10,48,15,3,26,4,12,6,0,27,0,4,2,2,0,0,0,3,4,15,9,7,0,1,1,0,0,1,5,45,27,8,61,1,0,0,3,1,0,0,10,0,0,0,91,26,0,0,0,0,6,0,0,0,0,22,0,0,0,0,10,0,0,0,0,0,23,1,3,11,3,0,6,2,32,0,0,0,0,0,10,0,0,0,0,2,4,0,52,4,0,3,0,2,15,0,0,2,1,8,1,2,5,21,2,2,0,0,1,3,1,0,0,1,0,3,16,0,24,3,9,0,30,6,25,13,0,0
acad_36,14,204,58,3,52,4,23,7,2,35,2,5,2,4,1,1,3,3,6,9,4,30,1,0,4,8,0,29,8,100,84,8,151,0,0,2,6,3,0,0,7,0,0,4,409,76,0,0,0,0,1,0,0,0,0,29,0,0,0,0,6,0,1,0,0,0,16,2,2,16,6,0,0,2,0,0,0,1,5,5,12,0,0,7,0,2,3,0,98,2,0,0,0,3,45,0,1,1,2,34,0,0,15,55,1,0,0,0,2,7,3,1,0,3,0,8,46,0,24,6,23,0,62,54,49,22,13,0
acad_22,51,175,83,3,123,10,22,14,3,23,0,5,0,6,1,0,2,4,2,6,3,18,1,1,0,2,7,32,14,217,103,21,221,2,0,0,7,3,0,0,3,0,0,3,420,167,0,3,1,0,14,5,1,0,0,311,2,0,0,0,3,3,11,0,0,1,7,5,2,40,3,0,3,2,0,2,0,0,0,2,5,0,0,2,3,2,9,0,73,9,0,6,0,11,22,0,3,3,1,4,0,2,1,9,2,3,1,1,0,6,1,2,0,12,1,3,24,0,26,53,40,0,32,55,78,5,4,0
acad_34,3,68,38,0,55,10,12,7,1,17,1,0,0,1,2,0,1,0,0,21,3,6,0,1,1,0,0,2,9,73,51,11,117,3,2,2,2,1,0,0,2,0,0,2,223,82,0,0,0,0,2,0,0,0,0,24,0,0,0,0,1,0,0,0,0,0,9,0,0,0,0,0,0,0,0,0,0,0,0,1,2,0,0,1,0,2,6,0,45,0,0,3,1,1,26,0,0,2,0,10,0,0,2,23,1,0,0,0,1,3,1,0,1,2,0,3,16,0,7,4,20,0,36,20,26,8,0,0
acad_20,7,91,27,0,29,1,7,2,0,7,0,2,1,1,0,0,1,1,2,3,6,5,0,0,0,15,0,10,11,109,41,9,119,1,0,0,9,0,1,0,3,0,0,4,224,87,0,1,0,1,4,5,1,0,0,24,0,0,4,0,0,0,1,0,0,0,4,0,1,0,3,0,0,1,0,0,0,0,0,1,1,0,0,1,2,0,3,0,29,0,0,1,0,2,12,0,3,8,0,0,0,0,0,1,0,3,0,0,0,0,0,6,0,1,0,0,3,0,3,39,24,0,15,17,6,4,1,0
acad_08,20,51,25,0,44,1,7,4,0,17,0,5,1,0,0,0,12,1,0,11,2,6,0,1,1,0,0,21,12,102,26,8,56,1,2,0,1,2,0,0,4,0,0,2,169,56,0,0,0,0,10,2,0,0,0,71,0,0,0,0,16,0,1,0,0,0,13,0,1,6,3,0,0,0,1,4,1,0,0,0,1,3,0,0,0,2,6,0,38,5,0,2,1,4,41,0,1,7,2,4,0,0,7,38,1,0,0,0,0,1,3,1,0,3,0,6,19,0,5,9,29,0,55,20,19,4,0,0
acad_09,65,196,42,1,83,3,39,8,2,44,0,5,1,7,2,1,5,0,2,18,7,4,0,0,1,0,0,16,36,164,94,24,150,0,2,1,2,0,0,0,4,0,2,23,331,100,0,0,0,0,2,1,0,0,0,263,0,0,0,0,6,3,13,1,0,0,17,12,2,33,7,0,1,0,0,4,1,0,0,0,4,1,0,2,1,2,8,0,75,4,0,3,0,5,50,0,10,53,1,13,0,9,1,10,0,3,1,2,0,0,0,17,3,16,0,0,27,0,11,93,36,0,48,78,25,15,0,0
acad_21,4,97,24,1,25,5,8,5,0,13,0,1,0,1,0,0,2,1,0,4,4,6,0,0,0,6,0,10,3,57,77,8,134,2,0,0,6,1,6,0,5,0,0,3,189,71,0,0,0,0,3,0,0,0,0,53,0,0,0,0,0,0,3,0,0,0,1,0,1,4,1,0,0,0,3,1,0,0,1,3,1,0,0,2,3,3,6,0,35,2,0,0,0,1,19,0,6,3,0,2,0,1,3,8,0,0,1,1,0,0,3,1,1,1,0,3,3,0,12,18,23,0,19,20,23,3,0,0
acad_35,47,323,84,1,151,2,29,8,2,24,3,7,1,1,5,0,3,5,23,34,12,18,0,0,5,2,0,9,39,259,159,24,289,1,0,2,17,6,0,0,18,0,0,31,878,300,2,0,0,0,15,7,0,0,0,64,0,0,5,0,1,1,13,0,0,0,19,0,3,4,13,0,1,0,0,1,0,4,0,2,2,2,0,0,0,1,8,0,68,5,0,3,0,4,79,0,0,3,1,85,0,1,17,55,6,0,0,2,1,2,8,1,1,9,0,9,136,1,46,14,83,0,131,114,35,23,57,0


A similar dtm can be created for DocuScope categories by setting `count_by` to 'ds':

In [51]:
tm = tags_dtm(tp, count_by='ds')
tm.head(10).style.hide(axis='index').format(precision=0)

doc_id,AcademicTerms,AcademicWritingMoves,Character,Citation,CitationAuthority,CitationHedged,ConfidenceHedged,ConfidenceHigh,ConfidenceLow,Contingent,Description,Facilitate,FirstPerson,ForceStressed,Future,InformationChange,InformationChangeNegative,InformationChangePositive,InformationExposition,InformationPlace,InformationReportVerbs,InformationStates,InformationTopics,Inquiry,Interactive,MetadiscourseCohesive,MetadiscourseInteractive,Narrative,Negative,Positive,PublicTerms,Reasoning,Responsibility,Strategic,Uncertainty,Untagged,Updates
acad_23,67,10,42,0,3,0,14,2,0,9,61,1,14,25,24,17,0,9,59,0,37,23,72,2,15,30,3,100,18,22,9,24,5,25,1,424,6
acad_37,30,0,54,7,0,0,6,43,3,4,21,5,6,22,0,3,0,1,45,0,5,10,0,6,5,16,2,42,31,10,4,28,1,9,9,244,6
acad_36,170,3,165,14,7,1,27,20,2,23,44,1,0,25,12,6,0,0,73,0,12,45,35,16,46,43,19,83,61,56,10,46,7,16,8,614,4
acad_22,166,2,386,13,3,0,20,14,1,20,147,8,3,37,6,27,0,4,77,50,20,11,64,30,27,42,8,156,45,50,26,29,2,25,4,824,2
acad_34,118,8,66,6,7,1,8,11,0,14,22,27,0,25,9,12,0,2,36,0,9,12,29,5,6,30,2,48,27,30,11,26,1,20,2,299,5
acad_20,53,1,105,11,1,0,1,2,0,2,39,3,0,19,1,10,1,4,26,14,7,0,22,1,2,17,3,93,76,14,48,13,1,11,0,305,1
acad_08,50,0,84,4,3,0,7,8,0,4,27,7,0,34,13,20,0,3,32,1,18,32,18,3,4,23,1,49,34,58,40,24,0,9,0,328,1
acad_09,54,1,239,18,4,2,14,20,0,18,102,7,0,43,7,9,5,2,57,49,15,6,15,3,11,36,5,172,126,71,49,28,9,44,4,810,4
acad_21,87,0,77,7,1,0,9,3,0,3,75,3,1,14,2,7,2,2,38,5,7,9,16,5,7,12,4,68,41,27,17,19,0,21,0,301,3
acad_35,297,29,313,20,0,0,43,14,0,25,113,28,1,25,109,13,0,3,126,14,52,37,170,73,50,77,17,130,54,87,34,75,16,122,5,1116,8


Counts can also be normalized using the `normalize_dtm` function. The scheme can either be set to **prop** or **tfidf**.

In [52]:
norm_tm = normalize_dtm(tm, scheme='prop')
norm_tm.head(10).style.hide(axis='index').format(precision=4)

doc_id,AcademicTerms,AcademicWritingMoves,Character,Citation,CitationAuthority,CitationHedged,ConfidenceHedged,ConfidenceHigh,ConfidenceLow,Contingent,Description,Facilitate,FirstPerson,ForceStressed,Future,InformationChange,InformationChangeNegative,InformationChangePositive,InformationExposition,InformationPlace,InformationReportVerbs,InformationStates,InformationTopics,Inquiry,Interactive,MetadiscourseCohesive,MetadiscourseInteractive,Narrative,Negative,Positive,PublicTerms,Reasoning,Responsibility,Strategic,Uncertainty,Untagged,Updates
acad_23,0.0571,0.0085,0.0358,0.0,0.0026,0.0,0.0119,0.0017,0.0,0.0077,0.052,0.0009,0.0119,0.0213,0.0205,0.0145,0.0,0.0077,0.0503,0.0,0.0315,0.0196,0.0614,0.0017,0.0128,0.0256,0.0026,0.0853,0.0153,0.0188,0.0077,0.0205,0.0043,0.0213,0.0009,0.3615,0.0051
acad_37,0.0442,0.0,0.0796,0.0103,0.0,0.0,0.0088,0.0634,0.0044,0.0059,0.031,0.0074,0.0088,0.0324,0.0,0.0044,0.0,0.0015,0.0664,0.0,0.0074,0.0147,0.0,0.0088,0.0074,0.0236,0.0029,0.0619,0.0457,0.0147,0.0059,0.0413,0.0015,0.0133,0.0133,0.3599,0.0088
acad_36,0.0992,0.0018,0.0963,0.0082,0.0041,0.0006,0.0158,0.0117,0.0012,0.0134,0.0257,0.0006,0.0,0.0146,0.007,0.0035,0.0,0.0,0.0426,0.0,0.007,0.0263,0.0204,0.0093,0.0268,0.0251,0.0111,0.0484,0.0356,0.0327,0.0058,0.0268,0.0041,0.0093,0.0047,0.3582,0.0023
acad_22,0.0707,0.0009,0.1643,0.0055,0.0013,0.0,0.0085,0.006,0.0004,0.0085,0.0626,0.0034,0.0013,0.0158,0.0026,0.0115,0.0,0.0017,0.0328,0.0213,0.0085,0.0047,0.0272,0.0128,0.0115,0.0179,0.0034,0.0664,0.0192,0.0213,0.0111,0.0123,0.0009,0.0106,0.0017,0.3508,0.0009
acad_34,0.1263,0.0086,0.0707,0.0064,0.0075,0.0011,0.0086,0.0118,0.0,0.015,0.0236,0.0289,0.0,0.0268,0.0096,0.0128,0.0,0.0021,0.0385,0.0,0.0096,0.0128,0.031,0.0054,0.0064,0.0321,0.0021,0.0514,0.0289,0.0321,0.0118,0.0278,0.0011,0.0214,0.0021,0.3201,0.0054
acad_20,0.0584,0.0011,0.1158,0.0121,0.0011,0.0,0.0011,0.0022,0.0,0.0022,0.043,0.0033,0.0,0.0209,0.0011,0.011,0.0011,0.0044,0.0287,0.0154,0.0077,0.0,0.0243,0.0011,0.0022,0.0187,0.0033,0.1025,0.0838,0.0154,0.0529,0.0143,0.0011,0.0121,0.0,0.3363,0.0011
acad_08,0.0532,0.0,0.0895,0.0043,0.0032,0.0,0.0075,0.0085,0.0,0.0043,0.0288,0.0075,0.0,0.0362,0.0138,0.0213,0.0,0.0032,0.0341,0.0011,0.0192,0.0341,0.0192,0.0032,0.0043,0.0245,0.0011,0.0522,0.0362,0.0618,0.0426,0.0256,0.0,0.0096,0.0,0.3493,0.0011
acad_09,0.0262,0.0005,0.1161,0.0087,0.0019,0.001,0.0068,0.0097,0.0,0.0087,0.0495,0.0034,0.0,0.0209,0.0034,0.0044,0.0024,0.001,0.0277,0.0238,0.0073,0.0029,0.0073,0.0015,0.0053,0.0175,0.0024,0.0835,0.0612,0.0345,0.0238,0.0136,0.0044,0.0214,0.0019,0.3934,0.0019
acad_21,0.0974,0.0,0.0862,0.0078,0.0011,0.0,0.0101,0.0034,0.0,0.0034,0.084,0.0034,0.0011,0.0157,0.0022,0.0078,0.0022,0.0022,0.0426,0.0056,0.0078,0.0101,0.0179,0.0056,0.0078,0.0134,0.0045,0.0761,0.0459,0.0302,0.019,0.0213,0.0,0.0235,0.0,0.3371,0.0034
acad_35,0.0901,0.0088,0.095,0.0061,0.0,0.0,0.013,0.0042,0.0,0.0076,0.0343,0.0085,0.0003,0.0076,0.0331,0.0039,0.0,0.0009,0.0382,0.0042,0.0158,0.0112,0.0516,0.0221,0.0152,0.0234,0.0052,0.0394,0.0164,0.0264,0.0103,0.0228,0.0049,0.037,0.0015,0.3386,0.0024


In [53]:
tfidf_tm = normalize_dtm(tm, scheme='tfidf')
tfidf_tm.head(10).style.hide(axis='index').format(precision=4)

doc_id,AcademicTerms,AcademicWritingMoves,Character,Citation,CitationAuthority,CitationHedged,ConfidenceHedged,ConfidenceHigh,ConfidenceLow,Contingent,Description,Facilitate,FirstPerson,ForceStressed,Future,InformationChange,InformationChangeNegative,InformationChangePositive,InformationExposition,InformationPlace,InformationReportVerbs,InformationStates,InformationTopics,Inquiry,Interactive,MetadiscourseCohesive,MetadiscourseInteractive,Narrative,Negative,Positive,PublicTerms,Reasoning,Responsibility,Strategic,Uncertainty,Untagged,Updates
acad_23,0.039,0.0063,0.0245,0.0,0.002,0.0,0.0082,0.0012,0.0,0.0054,0.0355,0.0006,0.0108,0.0146,0.0144,0.0099,0.0,0.0056,0.0344,0.0,0.0216,0.0138,0.0425,0.0012,0.0089,0.0175,0.0018,0.0583,0.0105,0.0128,0.0052,0.014,0.0035,0.0146,0.0008,0.247,0.0036
acad_37,0.0302,0.0,0.0544,0.0077,0.0,0.0,0.006,0.0433,0.0061,0.0041,0.0212,0.0052,0.008,0.0222,0.0,0.003,0.0,0.0011,0.0454,0.0,0.005,0.0104,0.0,0.006,0.0051,0.0161,0.002,0.0423,0.0312,0.0101,0.004,0.0282,0.0012,0.0091,0.0122,0.2459,0.0062
acad_36,0.0678,0.0013,0.0658,0.0061,0.0033,0.0008,0.0108,0.008,0.0016,0.0094,0.0175,0.0004,0.0,0.01,0.0049,0.0024,0.0,0.0,0.0291,0.0,0.0048,0.0185,0.0142,0.0064,0.0186,0.0171,0.0077,0.0331,0.0243,0.0223,0.004,0.0183,0.0033,0.0064,0.0043,0.2448,0.0016
acad_22,0.0483,0.0006,0.1123,0.0041,0.001,0.0,0.0058,0.0041,0.0006,0.006,0.0428,0.0024,0.0012,0.0108,0.0018,0.0079,0.0,0.0012,0.0224,0.017,0.0058,0.0033,0.0189,0.0087,0.008,0.0122,0.0024,0.0454,0.0131,0.0145,0.0076,0.0084,0.0007,0.0073,0.0016,0.2397,0.0006
acad_34,0.0863,0.0063,0.0483,0.0048,0.006,0.0014,0.0059,0.008,0.0,0.0105,0.0161,0.0203,0.0,0.0183,0.0068,0.0088,0.0,0.0016,0.0263,0.0,0.0066,0.009,0.0215,0.0037,0.0045,0.0219,0.0015,0.0351,0.0198,0.0219,0.008,0.019,0.0009,0.0146,0.002,0.2187,0.0038
acad_20,0.0399,0.0008,0.0791,0.0091,0.0009,0.0,0.0008,0.0015,0.0,0.0016,0.0294,0.0023,0.0,0.0143,0.0008,0.0075,0.0012,0.0032,0.0196,0.0123,0.0053,0.0,0.0168,0.0008,0.0015,0.0128,0.0023,0.0701,0.0573,0.0105,0.0362,0.0098,0.0009,0.0083,0.0,0.2298,0.0008
acad_08,0.0364,0.0,0.0611,0.0032,0.0025,0.0,0.0051,0.0058,0.0,0.003,0.0196,0.0052,0.0,0.0247,0.0097,0.0146,0.0,0.0023,0.0233,0.0008,0.0131,0.024,0.0133,0.0022,0.003,0.0167,0.0007,0.0357,0.0247,0.0422,0.0291,0.0175,0.0,0.0065,0.0,0.2387,0.0007
acad_09,0.0179,0.0004,0.0793,0.0065,0.0015,0.0013,0.0046,0.0066,0.0,0.0061,0.0338,0.0024,0.0,0.0143,0.0024,0.003,0.0025,0.0007,0.0189,0.019,0.005,0.002,0.005,0.001,0.0037,0.0119,0.0017,0.0571,0.0418,0.0236,0.0163,0.0093,0.0035,0.0146,0.0018,0.2688,0.0014
acad_21,0.0666,0.0,0.0589,0.0059,0.0009,0.0,0.0069,0.0023,0.0,0.0024,0.0574,0.0024,0.001,0.0107,0.0016,0.0054,0.0023,0.0016,0.0291,0.0045,0.0054,0.0071,0.0124,0.0038,0.0054,0.0092,0.0031,0.052,0.0314,0.0207,0.013,0.0145,0.0,0.0161,0.0,0.2303,0.0024
acad_35,0.0616,0.0065,0.0649,0.0045,0.0,0.0,0.0089,0.0029,0.0,0.0053,0.0234,0.006,0.0003,0.0052,0.0233,0.0027,0.0,0.0007,0.0261,0.0034,0.0108,0.0079,0.0358,0.0151,0.0105,0.016,0.0036,0.027,0.0112,0.018,0.007,0.0155,0.0039,0.0253,0.0014,0.2314,0.0017


A **dtm** can also be passed to **tmtoolkit** functions to create normalized counts (using the `tf_proportions` function), [tf-idf values](https://tmtoolkit.readthedocs.io/en/latest/bow.html#Term-frequency%E2%80%93inverse-document-frequency-transformation-(tf-idf)) (using the `tfidf` function), or other kids of data structures.

In [54]:
from tmtoolkit.bow.bow_stats import tf_proportions, tfidf
from tmtoolkit.bow.dtm import dtm_to_dataframe

Beginning with version 0.12.0 of **tmtoolkit**, matrices must first be converted into a COOrdinate format. This can be done using the `dtm_to_coo` function.

In [55]:
tags_coo, docs, vocab = dtm_to_coo(tm)

In [56]:
tags_coo

<50x37 sparse matrix of type '<class 'numpy.float64'>'
	with 1666 stored elements in COOrdinate format>

These can now be processed using various **tmtoolkit** functions

In [57]:
dtm_to_dataframe(tags_coo, docs, vocab).head()

Unnamed: 0_level_0,AcademicTerms,AcademicWritingMoves,Character,Citation,CitationAuthority,CitationHedged,ConfidenceHedged,ConfidenceHigh,ConfidenceLow,Contingent,...,Narrative,Negative,Positive,PublicTerms,Reasoning,Responsibility,Strategic,Uncertainty,Untagged,Updates
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
acad_23,67.0,10.0,42.0,0.0,3.0,0.0,14.0,2.0,0.0,9.0,...,100.0,18.0,22.0,9.0,24.0,5.0,25.0,1.0,424.0,6.0
acad_37,30.0,0.0,54.0,7.0,0.0,0.0,6.0,43.0,3.0,4.0,...,42.0,31.0,10.0,4.0,28.0,1.0,9.0,9.0,244.0,6.0
acad_36,170.0,3.0,165.0,14.0,7.0,1.0,27.0,20.0,2.0,23.0,...,83.0,61.0,56.0,10.0,46.0,7.0,16.0,8.0,614.0,4.0
acad_22,166.0,2.0,386.0,13.0,3.0,0.0,20.0,14.0,1.0,20.0,...,156.0,45.0,50.0,26.0,29.0,2.0,25.0,4.0,824.0,2.0
acad_34,118.0,8.0,66.0,6.0,7.0,1.0,8.0,11.0,0.0,14.0,...,48.0,27.0,30.0,11.0,26.0,1.0,20.0,2.0,299.0,5.0


In [58]:
tfidf_coo = tfidf(tags_coo)
dtm_to_dataframe(tfidf_coo, docs, vocab).head()

Unnamed: 0_level_0,AcademicTerms,AcademicWritingMoves,Character,Citation,CitationAuthority,CitationHedged,ConfidenceHedged,ConfidenceHigh,ConfidenceLow,Contingent,...,Narrative,Negative,Positive,PublicTerms,Reasoning,Responsibility,Strategic,Uncertainty,Untagged,Updates
doc_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
acad_23,0.039029,0.006272,0.024466,0.0,0.002039,0.0,0.008155,0.001165,0.0,0.005396,...,0.058252,0.010485,0.012815,0.005243,0.01398,0.003457,0.014563,0.000786,0.246988,0.003597
acad_37,0.030234,0.0,0.054422,0.007715,0.0,0.0,0.006047,0.043336,0.006068,0.004149,...,0.042328,0.031242,0.010078,0.004031,0.028219,0.001196,0.00907,0.012243,0.245906,0.006224
acad_36,0.067771,0.001288,0.065778,0.006103,0.003256,0.000752,0.010764,0.007973,0.0016,0.009438,...,0.033088,0.024318,0.022325,0.003987,0.018338,0.003312,0.006378,0.004305,0.244774,0.001641
acad_22,0.048287,0.000626,0.112283,0.004135,0.001018,0.0,0.005818,0.004072,0.000584,0.005988,...,0.045378,0.01309,0.014544,0.007563,0.008436,0.00069,0.007272,0.001571,0.239691,0.000599
acad_34,0.086326,0.006302,0.048284,0.0048,0.005975,0.001381,0.005853,0.008047,0.0,0.010542,...,0.035116,0.019753,0.021947,0.008047,0.019021,0.000868,0.014632,0.001975,0.218742,0.003765


## KWIC tables

There is also a function for generating Key Word in Context (KWIC) tables. For display purposes the `kwic_center_node` function trims the context columns to 75 characters maximum.

The function requires a **corpus** of the type generated by the `Corpus.from_dictionary` function. A node word needs to be set and there is the option to ignore the case of the node word.

<div class="alert alert-info">

**Note: Other KWIC options**

The **tmtoolkit** package has [its own KWIC functions](https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Keywords-in-context-(KWIC)-and-general-filtering-methods). The only difference is that this function produced a table with the node word in a center column with context columns to the left and right. The **tmtoolkit** functions produce tables with a single column that includes the node word.
  
</div>

In [59]:
kcn = kwic_center_node(tp, 'data', ignore_case=True, search_type='fixed')

In [60]:
kcn.head(10).style.set_properties(subset=["Post-Node"], **{'text-align': 'left'}).set_properties(subset=["Node"], **{'text-align': 'center'})

Unnamed: 0,Doc ID,Pre-Node,Node,Post-Node
0,acad_23,proposed systems. This analysis will include,data,"collection, current system modeling and"
1,acad_23,To achieve this goal we will collect,data,on current procedures and use that
2,acad_23,data on current procedures and use that,data,to standardize the assembly and packaging
3,acad_23,will create simulations based on real world,data,. This data will include employee
4,acad_23,based on real world data. This,data,"will include employee recommendations, standardized"
5,acad_23,studies and compare the results to the,data,gathered by the workers. We
6,acad_23,. We will statistically analyze the two,data,sets to identify significant. If
7,acad_23,processes. For this portion of the,data,collection we will need access to
8,acad_23,production need not be stopped for the,data,gathering. To create an accurate
9,acad_23,. With the gathered process and layout,data,we will construct a computer simulation


There is also an option allowing for that contain character sequences at the beginning or end of tokens by changing the `search_type` argument:

In [61]:
kwc = kwic_center_node(tp, 'tion', ignore_case=True, search_type='ends_with')

In [62]:
kwc.head(10).style.set_properties(subset=["Post-Node"], **{'text-align': 'left'}).set_properties(subset=["Node"], **{'text-align': 'center'})

Unnamed: 0,Doc ID,Pre-Node,Node,Post-Node
0,acad_23,"and other new markets, while the",reorganization,would allow E-Dining to
1,acad_23,systems. This analysis will include data,collection,", current system modeling and simulation"
2,acad_23,"data collection, current system modeling and",simulation,", facility layout reconstruction, and"
3,acad_23,"system modeling and simulation, facility layout",reconstruction,", and final system modeling and"
4,acad_23,"reconstruction, and final system modeling and",simulation,. One of the final models
5,acad_23,following is a summary of the current,situation,of your company it s environment
6,acad_23,order forms that contain shipping and kit,information,". Then, for each order"
7,acad_23,scattered in the hallway without an overall,organization,strategy. This procedure of processing
8,acad_23,was designed to produce only a small,fraction,of the current demand. This
9,acad_23,boxes. This causes problems with both,congestion,of foot traffic and lack of


## Keyword tables

[Keywords](https://eprints.lancs.ac.uk/id/eprint/140803/1/Rayson_2019_CorpusAnalysisofKeyWords_Submitted.pdf) are common method for profiling corpora by statstically comparing token frequencies in one corpus (a target corpus) to those in another (a reference corpus).

To generate a keyword list, we first need to process our reference corpus, in this case a small corpus of news articles.

<div class="alert alert-warning">
    
**Warning: Preparing frequency tables**

Be sure to process target and reference corpora in precisely the same way prior to comparison.

</div>

In [63]:
%%time
corp_ref = Corpus.from_folder('data/ref_corpus', spacy_instance=nlp, raw_preproc=pre_process, spacy_token_attrs=['tag', 'ent_iob', 'ent_type', 'is_punct'])

CPU times: user 1.69 s, sys: 208 ms, total: 1.9 s
Wall time: 1.93 s


We will also store various counts of tokens:

In [64]:
ref_total = corpus_num_tokens(corp_ref)
ref_types = vocabulary_size(corp_ref)
ref_punct = []
for i in range(0,len(corp_ref)):
    ref_punct.append(sum(corp_ref[i]['is_punct']))
ref_punct = sum(ref_punct)
ref_nonpunct = ref_total - ref_punct

In [65]:
print('Aphanumeric tokens (Reference corpus):', ref_nonpunct, '\nPunctuation tokens (Reference corpus):', ref_punct, '\nTotal tokens (Reference corpus):', ref_total, '\nTypes (Reference corpus):', ref_types)

Aphanumeric tokens (Reference corpus): 31950 
Punctuation tokens (Reference corpus): 4742 
Total tokens (Reference corpus): 36692 
Types (Reference corpus): 6364


As before, we will use the `convert_corpus` function to prepare our data for further analysis:

In [66]:
tp_ref = convert_corpus(corp_ref)

Finally, we will use `frequency_table` to generate 2 tables, both normalized by total counts of non-punctuation tokens:

In [67]:
wc_target = frequency_table(tp, non_punct)
wc_ref = frequency_table(tp_ref, ref_nonpunct)

To generate a table of key words, we will use `keyness_table`, which takes both our target and reference frequency tables. An arguement can also be set for using the Yates correction by setting the `correct` argument to 'True'. Here will leave the default, which is for no correction.

In [68]:
kw = keyness_table(wc_target, wc_ref)

The table returns the frequency data for both corpora, with a column for [log-likehood](https://ucrel.lancs.ac.uk/llwizard.html) (the test of significance), as well as [Log Ratio](http://cass.lancs.ac.uk/log-ratio-an-informal-introduction/) (an effect size measure), and the *p*-value.

In [69]:
kw.head(10).style.hide(axis='index').format(precision=2)

Token,Tag,LL,LR,PV,AF,RF,Range,AF Ref,RF Ref,Range Ref
of,IO,214.94,0.8,0.0,5063.0,37668.33,100.0,692.0,21658.84,96.0
the,AT,92.02,0.35,0.0,9601.0,71430.7,100.0,1797.0,56244.13,100.0
et al,RA,85.8,6.58,0.0,201.0,1495.42,12.0,0.0,0.0,0.0
is,VBZ,83.31,0.85,0.0,1784.0,13272.82,98.0,236.0,7386.54,98.0
faculty,NN1,70.24,5.47,0.0,186.0,1383.83,4.0,1.0,31.3,2.0
these,DD2,67.0,2.23,0.0,356.0,2648.61,96.0,18.0,563.38,32.0
this,DD1,66.46,1.04,0.0,1020.0,7588.72,100.0,118.0,3693.27,84.0
students,NN2,48.93,4.15,0.0,149.0,1108.55,20.0,2.0,62.6,4.0
education,NN1,48.7,4.99,0.0,134.0,996.95,14.0,1.0,31.3,2.0
study,NN1,45.78,3.29,0.0,165.0,1227.59,46.0,4.0,125.2,2.0


The table can be sorted according to various criteria, like absolute frequencies and *p*-value thresholds:

In [70]:
kw.query('AF > 5 and `AF Ref` > 5 and PV < 0.01').sort_values('LR', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Token,Tag,LL,LR,PV,AF,RF,Range,AF Ref,RF Ref,Range Ref
these,DD2,67.0,2.23,0.0,356.0,2648.61,96.0,18.0,563.38,32.0
different,JJ,21.46,2.2,0.0,116.0,863.03,66.0,6.0,187.79,10.0
within,II,21.14,2.19,0.0,115.0,855.59,48.0,6.0,187.79,12.0
such as,II,20.16,2.15,0.0,112.0,833.27,68.0,6.0,187.79,10.0
between,II,41.55,2.11,0.0,236.0,1755.82,84.0,13.0,406.89,18.0
more,RGR,41.53,2.0,0.0,252.0,1874.86,80.0,15.0,469.48,22.0
system,NN1,17.96,1.94,0.0,113.0,840.71,46.0,7.0,219.09,8.0
social,JJ,17.66,1.93,0.0,112.0,833.27,44.0,7.0,219.09,6.0
process,NN1,15.7,1.75,0.0,113.0,840.71,52.0,8.0,250.39,12.0
however,RR,24.47,1.72,0.0,180.0,1339.19,80.0,13.0,406.89,18.0


Tables can similarly be filtered for part-of-speech tag:

In [71]:
kw.query('AF > 5 and `AF Ref` > 5 and PV < 0.01 and Tag.str.startswith("V")').head(10).style.hide(axis='index').format(precision=2)

Token,Tag,LL,LR,PV,AF,RF,Range,AF Ref,RF Ref,Range Ref
is,VBZ,83.31,0.85,0.0,1784.0,13272.82,98.0,236.0,7386.54,98.0
are,VBR,20.33,0.61,0.0,763.0,5676.66,96.0,119.0,3724.57,68.0
may,VM,18.41,1.26,0.0,211.0,1569.82,72.0,21.0,657.28,18.0
used,VVN,16.97,1.59,0.0,139.0,1034.15,56.0,11.0,344.29,14.0
can,VM,16.7,0.77,0.0,422.0,3139.65,94.0,59.0,1846.64,54.0
does,VDZ,13.77,1.22,0.0,166.0,1235.03,78.0,17.0,532.08,26.0
using,VVG,12.6,1.53,0.0,109.0,810.95,64.0,9.0,281.69,18.0
based,VVN,11.22,1.61,0.0,90.0,669.59,62.0,7.0,219.09,12.0
be,VBI,9.47,0.35,0.0,960.0,7142.33,98.0,179.0,5602.5,90.0
will,VM,8.89,0.48,0.0,510.0,3794.36,82.0,87.0,2723.0,62.0


Keyness tables can also be generated for counts of either part-of-speech or DocuScope tags. First, we prepare the frequency tables.

In [72]:
tag_ref = tags_table(tp_ref, ref_nonpunct, count_by='pos')
tag_tar = tags_table(tp, non_punct, count_by='pos')
ds_ref = tags_table(tp_ref, ref_total, count_by='ds')
ds_tar = tags_table(tp, corpus_total, count_by='ds')

We will set the `tags_only` argument to 'True' and we will also emply the Yates correction, setting `correct` to 'True', as well:

In [73]:
kt = keyness_table(tag_tar, tag_ref, tags_only=True, correct=True)

In [74]:
kt.head(10).style.hide(axis='index').format(precision=2)

Tag,LL,LR,PV,AF,RF,Range,AF Ref,RF Ref,Range Ref
JJ,252.22,0.55,0.0,11175.0,8.31,100.0,1829.0,5.72,100.0
IO,221.91,0.81,0.0,5063.0,3.77,100.0,692.0,2.17,96.0
NN2,114.69,0.4,0.0,9185.0,6.83,100.0,1673.0,5.24,100.0
NN1,105.46,0.23,0.0,23989.0,17.85,100.0,4917.0,15.39,100.0
AT,94.59,0.35,0.0,9714.0,7.23,100.0,1832.0,5.73,100.0
RR,87.81,0.53,0.0,4164.0,3.1,100.0,691.0,2.16,98.0
ZZ1,58.93,1.87,0.0,395.0,0.29,58.0,26.0,0.08,32.0
DD1,57.98,0.75,0.0,1521.0,1.13,100.0,217.0,0.68,94.0
VVZ,57.3,0.69,0.0,1755.0,1.31,100.0,262.0,0.82,94.0
RGR,50.01,2.07,0.0,297.0,0.22,84.0,17.0,0.05,26.0


We can do the same for the DocuScope frequency tables:

In [75]:
kds = keyness_table(ds_tar, ds_ref, tags_only=True)

In [76]:
kds.sort_values('LR', ascending=False).head(10).style.hide(axis='index').format(precision=2)

Tag,LL,LR,PV,AF,RF,Range,AF Ref,RF Ref,Range Ref
Citation Hedged,7.46,2.86,0.01,30,0.02,36.0,1,0.0,2.0
Academic Writing Moves,60.41,1.63,0.0,471,0.31,90.0,37,0.1,46.0
Information Change,120.8,1.3,0.0,1298,0.85,100.0,128,0.35,76.0
Responsibility,19.04,1.25,0.0,216,0.14,78.0,22,0.06,26.0
Academic Terms,780.92,1.22,0.0,9223,6.02,100.0,961,2.62,98.0
Confidence Low,2.03,1.13,0.15,27,0.02,32.0,3,0.01,6.0
Metadiscourse Interactive,35.41,1.08,0.0,505,0.33,98.0,58,0.16,62.0
Inquiry,54.31,1.03,0.0,839,0.55,100.0,100,0.27,70.0
Reasoning,145.93,1.0,0.0,2337,1.53,100.0,283,0.77,90.0
Confidence Hedged,82.2,0.99,0.0,1337,0.87,100.0,163,0.44,88.0


## Single document tag highlighting

Tags (either part-of-speech or DocuScope) can be highlighted in single documents. In order facilitate the highlighing of tags, the `tag_ruler` function generates a data frame with the complete document text and the spans of tagged tokens. From that data frame, the original document text can be easily recovered, and any tags of interest can be filtered for highlighting.

To render the highlights, an additionally package is needed. For this demonstration, we will use (ipymarkup)[https://nbviewer.org/github/natasha/ipymarkup/blob/master/docs.ipynb], which is simple and flexible.

In [77]:
from ipymarkup import show_span_box_markup

When calling the `tag_ruler` function, a document needs to be specificed based on how its id is stored as a key from the `convert_corpus` function. Keys can be recovered from our `tp` object using `tp.keys()`, for example.

In [78]:
df_pos = tag_ruler(tp, key='acad_17', count_by='pos')

The data frame contains all tokens, tags and start/end of spans:

In [79]:
df_pos.head(20)

Unnamed: 0,Token,Tag,tag_start,tag_end
0,In,II,0,2
1,the,AT,3,6
2,societal,JJ,7,15
3,realm,NN1,16,21
4,in,II,22,24
5,which,DDQ,25,30
6,Middlemarch,NP1,31,42
7,resides,VVZ,43,50
8,",",Y,50,51
9,the,AT,52,55


The output can easily be filtered, as it here for part-of-speech tags starting with 'N' (or nouns):

In [80]:
df_n = df_pos[df_pos.Tag.str.startswith('N')]
df_n.head(10)

Unnamed: 0,Token,Tag,tag_start,tag_end
3,realm,NN1,16,21
6,Middlemarch,NP1,31,42
10,demarcation,NN1,56,67
12,women,NN2,76,81
14,men,NN2,86,89
19,Notions,NN2,111,118
21,male,NN1,122,126
24,character,NN1,138,147
31,perspective,NN1,176,187
42,reading,NN1,229,236


First, we will reconstruct the document text from the **full** data frame.

In [81]:
text = ''.join(df_pos['Token'].tolist())

Next, we will contruct a list a tuples from the **filtered** data frame, using the `tag_start`, `tag_end` and `Tag` columns:

In [82]:
spans = list(zip(list(df_n['tag_start']), list(df_n['tag_end']), list(df_n['Tag'])))

Finally, we can use `show_span_box_markup` to highlight the tags:

In [83]:
show_span_box_markup(text, spans)

The same thing can be done for DocuScope tags by switching `count_by` to 'ds':

In [84]:
df_ds = tag_ruler(tp, key='acad_37', count_by='ds')
df_ds.head(20)

Unnamed: 0,Token,Tag,tag_start,tag_end
0,Often,Narrative,0,5
1,referred to as,InformationReportVerbs,6,20
2,the,Untagged,21,24
3,"""",Untagged,25,26
4,Cartesian Circle,Description,26,42
5,"""",Untagged,42,43
6,",",Untagged,43,44
7,Descartes,Untagged,45,54
8,presents a,Interactive,55,65
9,very,ConfidenceHigh,66,70


This time, we'll filter for tags related to expressions of confidence:

In [85]:
df_c = df_ds[df_ds.Tag.str.startswith('Conf')]
df_c.head(10)

Unnamed: 0,Token,Tag,tag_start,tag_end
9,very,ConfidenceHigh,66,70
33,proves the,ConfidenceHigh,246,256
54,clearly,ConfidenceHigh,371,378
56,distinctly,ConfidenceHigh,383,393
83,clearly,ConfidenceHigh,563,570
85,distinctly,ConfidenceHigh,575,585
87,is true,ConfidenceHigh,596,603
105,are true,ConfidenceHigh,729,737
113,clearly,ConfidenceHigh,789,796
115,distinctly,ConfidenceHigh,801,811


Again, the text is reconstructed from the full data frame, and the spans are taken from the filtered one:

In [86]:
text = ''.join(df_ds['Token'].tolist())
spans = list(zip(list(df_c['tag_start']), list(df_c['tag_end']), list(df_c['Tag'])))
show_span_box_markup(text, spans)