# Corpus analysis

<div class="alert alert-success">

**Update: Changes to v > 0.3.0**

Some major changes have been made with the newest version of the **docuscospacy** package. Most don't affect the syntax of the basic functions. However, the package runs all processing in [polars](https://docs.pola.rs/api/python/stable/reference/index.html) for vastly increased speed. After processing, you can easily convert a polars DataFrame [to pandas](https://docs.pola.rs/api/python/stable/reference/dataframe/api/polars.DataFrame.to_pandas.html), if that is your preference for filtering and sorting.

The package is also now equipped with convenience functions like `corpus_from_folder` and `docuscope_parse` to make the processing pipeline easier for users and with fewer dependencies.

Finally, though the syntax of the functions is largely unchanged from earlier versions, none of them require the passing of total counts anymore. All normalization takes place inside the functions for greater consistency.

</div>

The docuscospacy package supports the generation of:

* Token frequency tables
* Ngram tables
* Collocation tables around a node word
* Keyword comparisions against a reference corpus

Most importantly, **outputs can be contolled either by part-of-speech or by DocuScope tag**. Thus, *can* as noun and *can* as verb, for example, can be disambiguated.

Additionally, tagged multi-token sequencies are aggregatated for analysis. So, for example, where *in spite of* is tagged as a token sequence, it is combined into a signle token.

<div class="alert alert-info">

**Note:About tmtoolkit**

The package no longer requires [tmtoolit](https://tmtoolkit.readthedocs.io/en/latest/). However, there are functions to convert a tmtoolkit corpus to a docuscospacy DataFrame (`from_tmtoolkit`) and to convert a document-feature-matrix to a COOrdinate format matrix (`dtm_to_coo`), which can then be analyzed inside tmtoolkit.

</div>

In [1]:
import spacy
import docuscospacy as ds
import polars as pl

## Processing a corpus

Before we generate any counts or tables, we need to load a corpus and tokenize it. Be sure you have downloaded the `en_docusco_spacy` model from [the huggingface model repository](https://huggingface.co/browndw/en_docusco_spacy).

In order to download install the model into your environment use either:

`pip install https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl`

Or for some newer spaCy versions:

`pip install "en_docusco_spacy @ https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl"`


### Load an instance

In [None]:
%%capture
pip install "en_docusco_spacy @ https://huggingface.co/browndw/en_docusco_spacy/resolve/main/en_docusco_spacy-any-py3-none-any.whl"

In [None]:
nlp = spacy.load("en_docusco_spacy")

### Load a corpus from a directory

One easy way to prepare a corpus for processing is to simply simply use `corpus_from_folder` function, which reads in plain text (TXT) files from a directory and into a polars DataFrame with 'doc_id' and 'text' columns.

The function **does not** recursively search through subdirectories. For greater control you can use the `get_text_paths` function, which has a recursive option and then `readtext` from the list returned list of file paths. This approach can also be useful if, for example, you have many files and want to test a pipeline with a subsample. In such a case, the list of paths can simply be down-sampled and the resulting subset read in using `readtext`.

In [3]:
ds_corpus = ds.corpus_from_folder("data/tar_corpus")

Note the resulting data structure.

In [4]:
ds_corpus.head(5)

doc_id,text
str,str
"""acad_01.txt""","""In the field of plant biology,…"
"""acad_02.txt""","""In my first paper for Complex …"
"""acad_03.txt""","""At root, every hypothesis is a…"
"""acad_04.txt""","""Several tests were administere…"
"""acad_05.txt""","""The development of necking and…"


This simple DataFrame structure is all that is explected to process the corpus. Thus, if you want to read in a CSV file, a parquet file, or similar tabular data, you can simply use one of [the input options from polars](https://docs.pola.rs/api/python/stable/reference/io.html).

The only requirements are that the first column is called 'doc_id' and contains a unique idenfiier and that the second column is called 'text' and contains a string.

### Process corpus

To process a corpus use the `docuscope_parse` function. The function requires a corpus DataFrame and the spaCy instance.

In [6]:
ds_tokens = ds.docuscope_parse(ds_corpus, nlp_model=nlp, n_process=4)

In [7]:
ds_tokens.head(20)

doc_id,token,pos_tag,ds_tag,pos_id,ds_id
str,str,str,str,u32,u32
"""acad_01.txt""","""In ""","""II""","""Untagged""",1,1
"""acad_01.txt""","""the ""","""AT""","""Untagged""",2,2
"""acad_01.txt""","""field ""","""NN1""","""Untagged""",3,3
"""acad_01.txt""","""of ""","""IO""","""Untagged""",4,4
"""acad_01.txt""","""plant ""","""NN1""","""InformationTopics""",5,5
…,…,…,…,…,…
"""acad_01.txt""","""photosynthesis""","""NN1""","""AcademicTerms""",16,13
"""acad_01.txt""",""". ""","""Y""","""Untagged""",17,14
"""acad_01.txt""","""This ""","""DD1""","""MetadiscourseCohesive""",18,15
"""acad_01.txt""","""process ""","""NN1""","""InformationTopics""",19,16


## Frequency tables

Frequency tables are produced by the `frequency_table` function, which takes a converted corpus object, a count against which to normalze and a `count_by` arguement that is one of **'pos'** or **'ds'** for part-of-speech or DocuScope category.

In addition to being trained on DocuScope, the spaCy model was trained on the [CLAWS7 tagset](https://ucrel.lancs.ac.uk/claws7tags.html). Those tags are default counting method.

<div class="alert alert-info">

**Note: Normalizing**

Earlier versions of the package required passing a tokens total the function. That is no longer required, as all normalizing is carried out inside the function.
    
</div>

In [8]:
wc = ds.frequency_table(ds_tokens)

The table returns a column of tokens, tags, absoulte frequency, relative frequency (per million tokens) and the range of text in which the token appears:

In [9]:
wc.head(10)

Token,Tag,AF,RF,Range
str,str,u32,f64,f64
"""the""","""AT""",9610,72382.989621,100.0
"""of""","""IO""",5065,38149.827516,100.0
"""and""","""CC""",3672,27657.683443,100.0
"""in""","""II""",2853,21488.93542,100.0
"""a""","""AT1""",2569,19349.833542,100.0
"""to""","""TO""",2171,16352.078092,100.0
"""is""","""VBZ""",1784,13437.17518,98.0
"""that""","""CST""",1550,11674.675745,100.0
"""to""","""II""",1324,9972.432701,100.0
"""for""","""IF""",1097,8262.657608,100.0


The resulting data frame is easy to filter and sort. So, here, we filter for an absolute frequency greater than 10 and tokens tags as verbs (starting with 'V'):

In [10]:
wc.filter(
    (pl.col("AF") > 10) &
    (pl.col("Tag").str.starts_with("V"))
    )

Token,Tag,AF,RF,Range
str,str,u32,f64,f64
"""is""","""VBZ""",1784,13437.17518,98.0
"""be""","""VBI""",960,7230.766913,98.0
"""are""","""VBR""",763,5746.953286,96.0
"""was""","""VBDZ""",594,4474.037028,92.0
"""will""","""VM""",512,3856.40902,82.0
…,…,…,…,…
"""take""","""VV0""",11,82.852538,14.0
"""test""","""VVI""",11,82.852538,12.0
"""want""","""VV0""",11,82.852538,14.0
"""work""","""VV0""",11,82.852538,12.0


Here, we sort for adverbs. Note that multi-word units tagged as a sequence are aggregated into a single token (like *for example*):

In [11]:
wc.filter(
    pl.col("Tag").str.starts_with("R")
    )

Token,Tag,AF,RF,Range
str,str,u32,f64,f64
"""also""","""RR""",302,2274.678758,98.0
"""more""","""RGR""",255,1920.672461,82.0
"""et al""","""RA""",201,1513.941822,12.0
"""however""","""RR""",184,1385.896992,80.0
"""only""","""RR""",159,1197.59577,84.0
…,…,…,…,…
"""wholeheartedly""","""RR""",1,7.532049,2.0
"""wholly""","""RR""",1,7.532049,2.0
"""wirelessly""","""RR""",1,7.532049,2.0
"""wonderfully""","""RR""",1,7.532049,2.0


Similarly, we can generate a frequncy table of DocuScope tokens by setting `count_by='ds'`.

In [12]:
wc = ds.frequency_table(ds_tokens, count_by='ds')

Most function words in isolation are not tagged by DocuScope (as they don't carry clear rhetorical meaning on their own).

In [13]:
wc.head(10)

Token,Tag,AF,RF,Range
str,str,u32,f64,f64
"""the""","""Untagged""",5686,52226.947488,100.0
"""and""","""Untagged""",3506,32203.249718,100.0
"""of""","""Untagged""",3148,28914.954396,100.0
"""in""","""Untagged""",1935,17773.328067,100.0
"""to""","""Untagged""",1705,15660.736101,100.0
"""a""","""Untagged""",1452,13336.884937,100.0
"""that""","""Untagged""",891,8183.997575,98.0
"""for""","""Untagged""",749,6879.701665,98.0
"""as""","""Untagged""",638,5860.146412,100.0
"""with""","""Untagged""",610,5602.961303,100.0


However, these same function works may appear in recognized phrases. This also means that the count of *the* is not inclusive of all occurences of the token.

In [14]:
wc.filter(
    pl.col("Token").str.starts_with("the ")
    ).head(20)

Token,Tag,AF,RF,Range
str,str,u32,f64,f64
"""the same""","""InformationExposition""",35,321.481386,36.0
"""the most""","""ForceStressed""",33,303.111021,38.0
"""the study""","""AcademicTerms""",29,266.370291,4.0
"""the united states""","""InformationPlace""",25,229.629562,22.0
"""the current""","""Narrative""",22,202.074014,20.0
…,…,…,…,…
"""the community""","""PublicTerms""",14,128.592554,8.0
"""the court""","""PublicTerms""",14,128.592554,4.0
"""the second""","""InformationExposition""",14,128.592554,18.0
"""the importance of""","""AcademicWritingMoves""",13,119.407372,18.0


As with part-of-speech tags, we can easily filter the data frame for the desired [DocuScope category](https://docuscospacy.readthedocs.io/en/latest/docuscope.html#Categories). Here, we sort by 'Character':

In [15]:
wc.filter(
    pl.col("Tag").str.starts_with("Character")
    ).head(20)

Token,Tag,AF,RF,Range
str,str,u32,f64,f64
"""their""","""Character""",335,3077.036125,88.0
"""his""","""Character""",239,2195.258609,52.0
"""he""","""Character""",135,1239.999633,48.0
"""students""","""Character""",129,1184.888538,18.0
"""participants""","""Character""",106,973.629341,14.0
…,…,…,…,…
"""religious""","""Character""",54,495.999853,16.0
"""self""","""Character""",54,495.999853,28.0
"""women""","""Character""",51,468.444306,20.0
"""jews""","""Character""",45,413.333211,6.0


Or by 'Public Terms':

In [16]:
wc.filter(
    pl.col("Tag").str.starts_with("Public")
    ).head(20)

Token,Tag,AF,RF,Range
str,str,u32,f64,f64
"""national""","""PublicTerms""",100,918.518246,32.0
"""political""","""PublicTerms""",63,578.666495,24.0
"""society""","""PublicTerms""",54,495.999853,28.0
"""citizenship""","""PublicTerms""",53,486.814671,6.0
"""population""","""PublicTerms""",45,413.333211,28.0
…,…,…,…,…
"""institutions""","""PublicTerms""",21,192.888832,10.0
"""authority""","""PublicTerms""",20,183.703649,18.0
"""amendment""","""PublicTerms""",19,174.518467,6.0
"""majority of""","""PublicTerms""",19,174.518467,24.0


## Tags tables

Rather than counting tokens, we can generate counts of the tags **only** by using the `tags_table` function. It works just like the `frequency_table` function, taking a dictionary created by the `convert_corpus` function, an integer agaist which to normalize, and a `count_by` argument of either 'pos' or 'ds'.

In [17]:
tc = ds.tags_table(ds_tokens)

In [18]:
tc.head(10)

Tag,AF,RF,Range
str,u32,f64,f64
"""NN1""",24030,18.099513,100.0
"""JJ""",11392,8.58051,100.0
"""AT""",9725,7.324918,100.0
"""II""",9492,7.149421,100.0
"""NN2""",9146,6.888812,100.0
"""IO""",5065,3.814983,100.0
"""NP1""",4251,3.201874,98.0
"""CC""",4184,3.151409,100.0
"""RR""",4161,3.134086,100.0
"""VVI""",3246,2.444903,100.0


And by DocuScope category:

In [19]:
dc = ds.tags_table(ds_tokens, count_by="ds")

In [20]:
dc.head(10)

Tag,AF,RF,Range
str,u32,f64,f64
"""Untagged""",36990,33.98036,100.0
"""AcademicTerms""",9245,8.492793,100.0
"""Character""",7945,7.298566,100.0
"""Narrative""",6840,6.283473,100.0
"""Description""",6536,6.004207,100.0
"""InformationExposition""",4982,4.576646,100.0
"""InformationTopics""",3729,3.425595,98.0
"""Negative""",3679,3.379663,100.0
"""Positive""",3045,2.797248,100.0
"""MetadiscourseCohesive""",2451,2.251578,100.0


## Dispersions

The `frequency_table` function includes 'Range' as a rudimentary measure for how tokens are distributed. For more advanced measures, you can use the `dispersions_table` function. This function includes common measures like Gries' [Deviation of Proportions](https://www.stgries.info/research/2010_STG_DispersionAdjFreq_CorpLingAppl.pdf).

In [23]:
dsp = ds.dispersions_table(ds_tokens, count_by="pos")

In [24]:
dsp.head(10)

Token,Tag,AF,RF,Carrolls_D2,Rosengrens_S,Lynes_D3,DC,Juillands_D,DP,DP_norm
str,str,u64,f64,f64,f64,f64,f64,f64,f64,f64
"""the""","""AT""",9610,72382.989621,0.964601,0.984981,0.930806,0.929015,0.967197,0.102275,0.102698
"""of""","""IO""",5065,38149.827516,0.947715,0.984078,0.883843,0.90022,0.955746,0.095509,0.095904
"""and""","""CC""",3672,27657.683443,0.928468,0.978108,0.821805,0.869744,0.957209,0.124252,0.124766
"""in""","""II""",2959,22287.3326,0.930874,0.978738,0.844625,0.868134,0.953631,0.116709,0.117192
"""a""","""AT1""",2572,19372.429688,0.945612,0.981248,0.886344,0.893346,0.960714,0.114134,0.114607
"""to""","""TO""",2171,16352.078092,0.951199,0.972768,0.899994,0.903728,0.949974,0.131491,0.132035
"""is""","""VBZ""",1784,13437.17518,0.919229,0.928686,0.831238,0.831865,0.922917,0.194194,0.194997
"""that""","""CST""",1550,11674.675745,0.927448,0.956544,0.847784,0.855659,0.923811,0.156775,0.157424
"""to""","""II""",1324,9972.432701,0.938721,0.987034,0.85423,0.885227,0.963669,0.097986,0.098392
"""for""","""IF""",1099,8277.721706,0.941273,0.954536,0.875632,0.883362,0.933182,0.184637,0.185401


## Ngrams and clusters

Beacuse of the increased efficiency of polars, these functions have been updated and now include options for both ngrams and clusters, using a distinction that will be familiar to users of [AntConc](https://www.laurenceanthony.net/software/antconc/releases/AntConc324/help.pdf).

### Ngrams

Ngrams are simply to the most frequent tokens sequences from 2 to 5 in length. The `ngrams` function will filter for a minimum frequency. (The default is 10.)

<div class="alert alert-warning">
    
**Warning: Setting a low `min_frequency`**

Be aware that depending on the size of your corpus, ngram tables can be massive. So be cautious when setting the threshold to or near zero.

</div>

The count that is returned is the raw count.

In [25]:
nc = ds.ngrams(ds_tokens, span=3, min_frequency=10)

In [26]:
nc.head(10)

Token_1,Token_2,Token_3,Tag_1,Tag_2,Tag_3,AF,RF,Range
str,str,str,str,str,str,u32,f64,f64
"""part""","""time""","""faculty""","""NN1""","""NNT1""","""NN1""",124,933.97406,2.0
"""of""","""part""","""time""","""IO""","""NN1""","""NNT1""",53,399.19859,2.0
"""one""","""of""","""the""","""MC1""","""IO""","""AT""",41,308.814004,48.0
"""the""","""pardoner""","""'s""","""AT""","""NP1""","""GE""",40,301.281955,2.0
"""the""","""fact""","""that""","""AT""","""NN1""","""CST""",34,256.089662,36.0
"""the""","""number""","""of""","""AT""","""NN1""","""IO""",32,241.025564,18.0
"""there""","""is""","""a""","""EX""","""VBZ""","""AT1""",31,233.493515,44.0
"""the""","""effects""","""of""","""AT""","""NN2""","""IO""",30,225.961466,20.0
"""more""","""likely""","""to""","""RGR""","""JJ""","""TO""",29,218.429417,16.0
"""at""","""community""","""colleges""","""II""","""NN1""","""NN2""",28,210.897368,2.0


### Clusters

Clusters can be calculated using the `clusters_by_token` function. Clusters can be created using different options:
* You can input a word or string using the `clusters_by_token` function. With that function you need to specify whether that input should match a token completely or partially, and choose which tagset to return.
* Alternatively, you can use the `clusters_by_tag` function. That allows you to select a tag (like **NN1** or **AcademicTerms**) as the basis for your clusters.
* For either option, you must select the size of your clusters (2-grams, 3-grams, or 4-grams) and the slot where your chosen word or tag should appear (on the left, in the middle, or on the right).

We'll start by searching for clusters of length **3** with **data** in the first position. The returned data frame includes both the sequence of tokens, as well as the sequence of tags:

In [56]:
ds.clusters_by_token(ds_tokens, node_word='data', node_position=1, span=3).head()

Token_1,Token_2,Token_3,Tag_1,Tag_2,Tag_3,AF,RF,Range
str,str,str,str,str,str,u32,f64,f64
"""data""","""from""","""the""","""NN""","""II""","""AT""",6,45.192293,19.047619
"""data""","""was""","""recorded""","""NN""","""VBDZ""","""VVN""",3,22.596147,4.761905
"""data""","""collection""","""process""","""NN""","""NN1""","""NN1""",3,22.596147,4.761905
"""data""","""is""","""by""","""NN""","""VBZ""","""II""",2,15.064098,4.761905
"""data""","""collection""","""will""","""NN""","""NN1""","""VM""",2,15.064098,4.761905


We can similarly look for clusters that include only part of word. For example, we can find bigrams that include word ending with **-tion** by setting the `search_type` to **ends_with**.

In [27]:
nc = ds.clusters_by_token(ds_tokens, node_word='tion', node_position=2, span=2, search_type='ends_with', count_by='pos')

In [28]:
nc.head(10)

Token_1,Token_2,Tag_1,Tag_2,AF,RF,Range
str,str,str,str,u32,f64,f64
"""the""","""intervention""","""AT""","""NN1""",34,256.089662,2.0
"""citizenship""","""education""","""NN1""","""NN1""",30,225.961466,2.0
"""the""","""nation""","""AT""","""NN1""",27,203.365319,12.0
"""data""","""collection""","""NN""","""NN1""",17,128.044831,8.0
"""higher""","""education""","""JJR""","""NN1""",16,120.512782,4.0
"""of""","""education""","""IO""","""NN1""",16,120.512782,8.0
"""the""","""formation""","""AT""","""NN1""",15,112.980733,8.0
"""the""","""notion""","""AT""","""NN1""",15,112.980733,16.0
"""brow""","""manipulation""","""NN1""","""NN1""",14,105.448684,2.0
"""the""","""manipulation""","""AT""","""NN1""",13,97.916635,2.0


Now we'll collect n-grams using the `clusters_by_tag` function. Here, we'll look at 3-token sequences that end with a past participle (**VVN**).

In [35]:
nc = ds.clusters_by_tag(ds_tokens, tag='VVN', tag_position=3, span=3, count_by='pos')

In [36]:
nc.head(10)

Token_1,Token_2,Token_3,Tag_1,Tag_2,Tag_3,AF,RF,Range
str,str,str,str,str,str,u32,f64,f64
"""can""","""be""","""seen""","""VM""","""VBI""","""VVN""",17,128.044831,16.0
"""to""","""be""","""used""","""TO""","""VBI""","""VVN""",10,75.320489,14.0
"""can""","""be""","""used""","""VM""","""VBI""","""VVN""",10,75.320489,14.0
"""will""","""be""","""asked""","""VM""","""VBI""","""VVN""",7,52.724342,8.0
"""should""","""be""","""noted""","""VM""","""VBI""","""VVN""",7,52.724342,8.0
"""could""","""be""","""used""","""VM""","""VBI""","""VVN""",7,52.724342,10.0
"""has""","""been""","""shown""","""VHZ""","""VBN""","""VVN""",6,45.192293,8.0
"""will""","""be""","""used""","""VM""","""VBI""","""VVN""",5,37.660244,4.0
"""can""","""be""","""observed""","""VM""","""VBI""","""VVN""",5,37.660244,4.0
"""can""","""be""","""found""","""VM""","""VBI""","""VVN""",5,37.660244,8.0


Similar ngram tables can be created for DocuScope sequences. Here we generate trigrams:

In [37]:
nc = ds.clusters_by_tag(ds_tokens, tag='AcademicTerms', tag_position=3, span=3, count_by='ds')

In [38]:
nc.head(10)

Token_1,Token_2,Token_3,Tag_1,Tag_2,Tag_3,AF,RF,Range
str,str,str,str,str,str,u32,f64,f64
"""part""","""time""","""faculty""","""Untagged""","""InformationTopics""","""AcademicTerms""",112,1028.872741,2.0
"""nicaraguan""","""sign""","""language""","""Character""","""Untagged""","""AcademicTerms""",13,119.422729,2.0
"""full""","""time""","""faculty""","""AcademicTerms""","""InformationTopics""","""AcademicTerms""",11,101.050001,2.0
"""of""","""citizenship""","""education""","""Untagged""","""PublicTerms""","""AcademicTerms""",10,91.863638,2.0
"""reinforced""","""concrete""","""structures""","""InformationChangePositive""","""Description""","""AcademicTerms""",9,82.677274,2.0
"""national""","""identity""","""formation""","""PublicTerms""","""AcademicTerms""","""AcademicTerms""",8,73.49091,2.0
"""of""","""an""","""electron""","""Untagged""","""Untagged""","""AcademicTerms""",8,73.49091,2.0
"""faculty""","""in""","""higher education""","""AcademicTerms""","""Untagged""","""AcademicTerms""",7,64.304546,2.0
"""academy""","""of""","""pediatrics""","""InformationTopics""","""Untagged""","""AcademicTerms""",7,64.304546,2.0
"""the""","""rate of""","""photosynthesis""","""Untagged""","""AcademicTerms""","""AcademicTerms""",7,64.304546,2.0


## Collocations

Collocations within a span (left and right) of a node word can be calculated according to several association measures.

The default span is 4 tokens to the left and 4 tokens to the right of the node word.

Like `frequency_table`, `coll_table` requires a table of the type generated by the `docuscope_parse` function. It also requires a node word.

In [54]:
ds.coll_table(ds_tokens, 'data').head()

Token,Tag,Freq Span,Freq Total,MI
str,str,u32,u32,f64
"""collection""","""NN1""",18,23,0.721679
"""collected""","""VVN""",10,12,0.683613
"""conjunctions""","""NN2""",2,1,0.66337
"""split""","""VV0""",2,1,0.66337
"""weighting""","""NN1""",2,1,0.66337


You can also specify a node tag (by default, tags are ignored) and an association measure statistic from the point-wise mutual information family ('pmi', 'pmi2', 'pmi3', or 'npmi', which is the default).

In [50]:
ct = ds.coll_table(ds_tokens, 'can', node_tag='V', statistic='pmi', count_by='pos')

In [51]:
ct.head(10)

Token,Tag,Freq Span,Freq Total,MI
str,str,u32,u32,f64
"""perceive""","""NN1""",2,1,9.294012
"""undone""","""VVN""",2,1,9.294012
"""1b""","""FO""",1,1,8.294012
"""abrasion""","""NN1""",1,1,8.294012
"""abrogate""","""VVI""",1,1,8.294012
"""absorb""","""VVI""",1,1,8.294012
"""additives""","""VVZ""",1,1,8.294012
"""altered""","""JJ""",1,1,8.294012
"""ameliorate""","""VVI""",1,1,8.294012
"""anew""","""RR""",1,1,8.294012


In [52]:
ct.filter(
    (pl.col("Freq Total") > 5) &
    (pl.col("Tag").str.starts_with("V"))
)

Token,Tag,Freq Span,Freq Total,MI
str,str,u32,u32,f64
"""assume""","""VVI""",6,9,7.70905
"""arise""","""VVI""",3,6,7.294012
"""occur""","""VVI""",11,23,7.229882
"""seen""","""VVN""",18,39,7.178535
"""achieved""","""VVN""",3,7,7.07162
…,…,…,…,…
"""have""","""VH0""",2,296,1.084559
"""was""","""VBDZ""",4,594,1.079693
"""is""","""VBZ""",11,1784,0.952544
"""does""","""VDZ""",1,165,0.92769


In [55]:
ct = ds.coll_table(ds_tokens, 'people', node_tag='Character', statistic='pmi3', count_by='ds')
ct.head(10)

Token,Tag,Freq Span,Freq Total,MI
str,str,u32,u32,f64
"""believing that""","""Character""",2,3,-21.383312
"""cure""","""Positive""",2,3,-21.383312
"""falsely""","""Negative""",2,3,-21.383312
"""of""","""Untagged""",20,3148,-21.452785
"""more and more""","""ForceStressed""",2,4,-21.798349
"""infected""","""InformationChangeNegative""",3,15,-21.950352
"""and""","""Untagged""",18,3506,-22.064185
"""who had""","""Narrative""",2,5,-22.120277
"""number""","""Untagged""",4,44,-22.257781
"""sera""","""Description""",2,6,-22.383312


## Document-term matrices for tags

Document-term matrices are basic data structures for text analysis. Each row is a document (observation) and each column is a token (variable). These [can be produced by **tmtoolkit**](https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Generating-a-sparse-document-term-matrix-(DTM))) using the `dtm` function.

The **docuscopspacy** package allows for the creation of dtms with tag counts (rather than token counts) as variables.

These are produced by the `tags_dtm` function, which takes a dictionary created by the `convert_corpus` function and a `count_by` argument of either 'pos' or 'ds'.

In [57]:
tm = ds.tags_dtm(ds_tokens)

<div class="alert alert-warning">
    
**Warning: `doc_id` column**

The first column, 'doc_id', contains the names of the document files.  The `tags_dtm` function does not place document ids as row names initally as a saftey feature. Row names **must** be unique. Setting the document ids as a column allows users to account for any duplicates before proceeding.

</div>

The count that is returned is the raw count.

In [58]:
tm.head(10)

doc_id,NN1,JJ,AT,II,NN2,IO,NP1,CC,RR,VVI,AT1,VVN,MC,TO,VVG,VM,VBZ,VVZ,CST,VV0,DD1,VVD,APPGE,CS,IF,PPH1,IW,VBI,GE,XX,VBR,DDQ,NNT1,VBDZ,CSA,DD2,…,PPHO1,FW,PPX2,DAT,MC2,NNU2,NPM1,UH,VDI,VHG,NP2,VDN,NNB,PPIO2,MCMC,RGQ,VHN,DDQGE,PNQO,VDG,VBM,RRT,VMK,DDQV,PN,PPIO1,NNO2,NNU1,PPGE,NPD1,NNO,MF,PNQV,VVGK,RPK,RGQV,RRQV
str,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,…,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
"""acad_01.txt""",252,62,99,70,69,83,2,14,24,23,24,52,28,13,13,20,16,5,15,2,22,12,0,5,12,13,8,7,1,6,3,1,2,18,3,2,…,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0
"""acad_02.txt""",419,263,187,219,229,129,62,70,137,75,72,61,17,33,21,74,54,54,48,43,49,17,15,36,11,40,25,30,15,15,21,14,12,2,14,14,…,0,0,4,1,0,0,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
"""acad_03.txt""",1345,816,377,701,825,330,353,354,257,188,124,166,353,90,98,148,89,79,87,133,73,74,41,59,40,45,73,52,27,35,66,36,41,13,14,28,…,0,1,0,6,4,0,0,20,1,2,1,0,0,0,4,2,0,2,0,0,1,1,0,0,0,0,2,0,0,0,1,0,0,0,0,0,0
"""acad_04.txt""",270,102,90,76,111,38,26,41,40,36,28,73,46,24,18,30,17,11,8,5,28,9,5,10,27,6,8,22,7,14,6,8,10,9,0,12,…,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
"""acad_05.txt""",508,196,199,148,128,70,20,48,41,41,63,78,38,24,43,40,45,56,10,25,39,12,1,29,23,13,16,23,5,10,10,16,2,14,9,5,…,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
"""acad_06.txt""",708,288,240,268,271,121,34,70,101,125,78,90,24,68,73,83,57,64,34,43,44,15,5,24,26,16,31,31,3,18,31,28,8,3,9,20,…,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0
"""acad_07.txt""",1197,534,352,391,509,175,159,219,204,169,137,217,82,93,72,177,121,64,61,69,69,24,13,75,81,45,32,96,4,55,73,29,9,13,11,33,…,0,0,0,4,0,2,0,0,1,1,8,1,1,1,0,0,2,0,0,0,0,1,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0
"""acad_08.txt""",171,56,51,103,55,26,71,44,38,52,25,17,4,39,28,19,38,20,19,5,9,12,20,7,12,13,8,4,21,4,7,6,11,7,4,2,…,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
"""acad_09.txt""",307,153,196,165,108,94,281,83,74,46,42,76,27,50,36,27,10,24,44,11,18,95,65,40,36,17,24,13,16,15,1,4,2,53,9,7,…,12,0,1,1,0,0,0,0,2,3,0,0,0,1,0,1,0,0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0
"""acad_10.txt""",1033,482,455,510,231,286,311,153,240,107,201,120,56,78,98,59,101,156,80,52,102,52,68,51,32,48,32,29,41,21,21,43,10,24,31,27,…,4,6,1,0,0,0,0,1,2,4,4,0,0,0,0,4,0,2,2,1,0,1,2,0,2,0,0,0,0,0,0,0,0,0,0,0,0


A similar dtm can be created for DocuScope categories by setting `count_by` to 'ds':

In [60]:
tm = ds.tags_dtm(ds_tokens, count_by='ds')
tm.head(10)

doc_id,Untagged,AcademicTerms,Character,Narrative,Description,InformationExposition,InformationTopics,Negative,Positive,MetadiscourseCohesive,Reasoning,ForceStressed,PublicTerms,Strategic,InformationStates,InformationChange,ConfidenceHedged,InformationReportVerbs,Citation,InformationPlace,Interactive,Inquiry,Future,ConfidenceHigh,Contingent,AcademicWritingMoves,Facilitate,MetadiscourseInteractive,Updates,InformationChangePositive,CitationAuthority,FirstPerson,Responsibility,InformationChangeNegative,Uncertainty,ConfidenceLow,CitationHedged
str,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32,u32
"""acad_01.txt""",324,127,15,66,70,57,15,10,9,12,26,7,4,10,9,10,15,17,0,0,3,18,3,3,0,16,1,3,0,1,2,0,2,0,0,0,0
"""acad_02.txt""",760,255,79,133,132,157,74,67,66,97,51,54,18,24,33,40,60,38,12,9,22,8,20,20,38,5,7,3,8,26,3,9,0,2,1,1,1
"""acad_03.txt""",2392,844,465,422,435,428,240,201,160,142,160,126,52,78,124,130,137,57,415,49,39,82,42,30,43,20,28,31,21,47,23,42,3,32,9,1,3
"""acad_04.txt""",373,72,28,64,161,73,29,31,42,39,35,17,22,35,12,12,19,23,3,9,7,6,11,4,6,24,12,1,1,2,2,1,2,1,0,0,0
"""acad_05.txt""",651,200,47,133,172,79,77,73,18,42,52,33,2,14,33,65,21,27,3,0,7,10,21,5,19,17,7,5,3,0,0,1,2,0,0,1,0
"""acad_06.txt""",777,188,99,107,420,101,72,131,84,106,54,55,32,41,55,39,65,30,16,23,16,7,23,19,30,11,14,5,7,29,14,0,23,27,0,1,0
"""acad_07.txt""",1621,395,159,245,556,285,291,126,153,137,84,101,47,82,123,61,104,88,23,35,45,11,86,36,54,28,25,14,22,25,6,4,13,2,8,2,2
"""acad_08.txt""",292,60,78,48,27,36,20,33,65,21,26,34,37,10,30,22,7,18,4,2,4,5,16,6,3,0,7,2,1,3,3,0,0,0,0,0,0
"""acad_09.txt""",645,59,360,171,100,59,20,128,71,35,27,41,46,47,7,7,12,13,19,72,7,3,9,21,18,1,8,3,7,3,3,0,11,4,5,0,2
"""acad_10.txt""",1948,466,483,319,226,238,79,111,119,106,80,127,54,63,71,22,45,23,39,57,88,31,28,50,15,9,10,36,13,15,19,11,1,4,4,0,0


Counts can also be normalized using the `dtm_weight` function. The scheme can either be set to 'prop', 'scale', or 'tfidf'.

In [61]:
norm_tm = ds.dtm_weight(tm, scheme='prop')
norm_tm.head(10)

doc_id,Untagged,AcademicTerms,Character,Narrative,Description,InformationExposition,InformationTopics,Negative,Positive,MetadiscourseCohesive,Reasoning,ForceStressed,PublicTerms,Strategic,InformationStates,InformationChange,ConfidenceHedged,InformationReportVerbs,Citation,InformationPlace,Interactive,Inquiry,Future,ConfidenceHigh,Contingent,AcademicWritingMoves,Facilitate,MetadiscourseInteractive,Updates,InformationChangePositive,CitationAuthority,FirstPerson,Responsibility,InformationChangeNegative,Uncertainty,ConfidenceLow,CitationHedged
str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""acad_01.txt""",0.378947,0.148538,0.017544,0.077193,0.081871,0.066667,0.017544,0.011696,0.010526,0.014035,0.030409,0.008187,0.004678,0.011696,0.010526,0.011696,0.017544,0.019883,0.0,0.0,0.003509,0.021053,0.003509,0.003509,0.0,0.018713,0.00117,0.003509,0.0,0.00117,0.002339,0.0,0.002339,0.0,0.0,0.0,0.0
"""acad_02.txt""",0.325761,0.109301,0.033862,0.057008,0.05658,0.067295,0.031719,0.028718,0.02829,0.041577,0.02186,0.023146,0.007715,0.010287,0.014145,0.017145,0.025718,0.016288,0.005144,0.003858,0.00943,0.003429,0.008573,0.008573,0.016288,0.002143,0.003,0.001286,0.003429,0.011144,0.001286,0.003858,0.0,0.000857,0.000429,0.000429,0.000429
"""acad_03.txt""",0.316695,0.111744,0.061565,0.055872,0.057593,0.056666,0.031775,0.026612,0.021184,0.0188,0.021184,0.016682,0.006885,0.010327,0.016417,0.017212,0.018138,0.007547,0.054945,0.006487,0.005164,0.010857,0.005561,0.003972,0.005693,0.002648,0.003707,0.004104,0.00278,0.006223,0.003045,0.005561,0.000397,0.004237,0.001192,0.000132,0.000397
"""acad_04.txt""",0.31637,0.061069,0.023749,0.054283,0.136556,0.061917,0.024597,0.026293,0.035623,0.033079,0.029686,0.014419,0.01866,0.029686,0.010178,0.010178,0.016115,0.019508,0.002545,0.007634,0.005937,0.005089,0.00933,0.003393,0.005089,0.020356,0.010178,0.000848,0.000848,0.001696,0.001696,0.000848,0.001696,0.000848,0.0,0.0,0.0
"""acad_05.txt""",0.353804,0.108696,0.025543,0.072283,0.093478,0.042935,0.041848,0.039674,0.009783,0.022826,0.028261,0.017935,0.001087,0.007609,0.017935,0.035326,0.011413,0.014674,0.00163,0.0,0.003804,0.005435,0.011413,0.002717,0.010326,0.009239,0.003804,0.002717,0.00163,0.0,0.0,0.000543,0.001087,0.0,0.0,0.000543,0.0
"""acad_06.txt""",0.285557,0.069092,0.036384,0.039324,0.154355,0.037119,0.026461,0.048144,0.030871,0.038956,0.019846,0.020213,0.01176,0.015068,0.020213,0.014333,0.023888,0.011025,0.00588,0.008453,0.00588,0.002573,0.008453,0.006983,0.011025,0.004043,0.005145,0.001838,0.002573,0.010658,0.005145,0.0,0.008453,0.009923,0.0,0.000368,0.0
"""acad_07.txt""",0.317905,0.077466,0.031183,0.048049,0.109041,0.055893,0.05707,0.024711,0.030006,0.026868,0.016474,0.019808,0.009217,0.016082,0.024122,0.011963,0.020396,0.017258,0.004511,0.006864,0.008825,0.002157,0.016866,0.00706,0.01059,0.005491,0.004903,0.002746,0.004315,0.004903,0.001177,0.000784,0.00255,0.000392,0.001569,0.000392,0.000392
"""acad_08.txt""",0.317391,0.065217,0.084783,0.052174,0.029348,0.03913,0.021739,0.03587,0.070652,0.022826,0.028261,0.036957,0.040217,0.01087,0.032609,0.023913,0.007609,0.019565,0.004348,0.002174,0.004348,0.005435,0.017391,0.006522,0.003261,0.0,0.007609,0.002174,0.001087,0.003261,0.003261,0.0,0.0,0.0,0.0,0.0,0.0
"""acad_09.txt""",0.315558,0.028865,0.176125,0.083659,0.048924,0.028865,0.009785,0.062622,0.034736,0.017123,0.013209,0.020059,0.022505,0.022994,0.003425,0.003425,0.005871,0.00636,0.009295,0.035225,0.003425,0.001468,0.004403,0.010274,0.008806,0.000489,0.003914,0.001468,0.003425,0.001468,0.001468,0.0,0.005382,0.001957,0.002446,0.0,0.000978
"""acad_10.txt""",0.388822,0.093014,0.096407,0.063673,0.04511,0.047505,0.015768,0.022156,0.023752,0.021158,0.015968,0.025349,0.010778,0.012575,0.014172,0.004391,0.008982,0.004591,0.007784,0.011377,0.017565,0.006188,0.005589,0.00998,0.002994,0.001796,0.001996,0.007186,0.002595,0.002994,0.003792,0.002196,0.0002,0.000798,0.000798,0.0,0.0


In [62]:
tfidf_tm = ds.dtm_weight(tm, scheme='tfidf')
tfidf_tm.head(10)

doc_id,Untagged,AcademicTerms,Character,Narrative,Description,InformationExposition,InformationTopics,Negative,Positive,MetadiscourseCohesive,Reasoning,ForceStressed,PublicTerms,Strategic,InformationStates,InformationChange,ConfidenceHedged,InformationReportVerbs,Citation,InformationPlace,Interactive,Inquiry,Future,ConfidenceHigh,Contingent,AcademicWritingMoves,Facilitate,MetadiscourseInteractive,Updates,InformationChangePositive,CitationAuthority,FirstPerson,Responsibility,InformationChangeNegative,Uncertainty,ConfidenceLow,CitationHedged
str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""acad_01.txt""",0.258933,0.101495,0.011988,0.052746,0.055942,0.045553,0.01216,0.007992,0.007193,0.00959,0.020779,0.005594,0.003197,0.007992,0.007403,0.007992,0.011988,0.013586,0.0,0.0,0.002432,0.014593,0.002504,0.002398,0.0,0.013357,0.000811,0.002398,0.0,0.000874,0.001834,0.0,0.001964,0.0,0.0,0.0,0.0
"""acad_02.txt""",0.222591,0.074685,0.023138,0.038953,0.03866,0.045983,0.021986,0.019623,0.01933,0.02841,0.014937,0.015816,0.005272,0.007029,0.009948,0.011715,0.017573,0.01113,0.003843,0.002928,0.006536,0.002377,0.006119,0.005858,0.011455,0.00153,0.00208,0.000879,0.002412,0.008327,0.001008,0.003558,0.0,0.00092,0.000395,0.000607,0.000734
"""acad_03.txt""",0.216396,0.076354,0.042067,0.038177,0.039353,0.03872,0.022025,0.018184,0.014475,0.012846,0.014475,0.011399,0.004704,0.007056,0.011546,0.011761,0.012394,0.005157,0.041056,0.004925,0.003579,0.007525,0.003969,0.002714,0.004004,0.00189,0.00257,0.002804,0.001955,0.00465,0.002388,0.005129,0.000334,0.004544,0.001099,0.000188,0.00068
"""acad_04.txt""",0.216174,0.041728,0.016228,0.037091,0.093308,0.042307,0.017049,0.017966,0.024341,0.022603,0.020284,0.009852,0.01275,0.020284,0.007158,0.006955,0.011012,0.01333,0.001901,0.005795,0.004115,0.003527,0.006659,0.002318,0.003579,0.01453,0.007055,0.00058,0.000597,0.001268,0.00133,0.000782,0.001425,0.00091,0.0,0.0,0.0
"""acad_05.txt""",0.241753,0.074271,0.017454,0.04939,0.063873,0.029337,0.029007,0.027109,0.006684,0.015597,0.019311,0.012255,0.000743,0.005199,0.012614,0.024138,0.007798,0.010027,0.001218,0.0,0.002637,0.003767,0.008146,0.001857,0.007262,0.006595,0.002637,0.001857,0.001147,0.0,0.0,0.000501,0.000913,0.0,0.0,0.00077,0.0
"""acad_06.txt""",0.195119,0.04721,0.024861,0.02687,0.10547,0.025363,0.018341,0.032897,0.021094,0.026619,0.01356,0.013812,0.008036,0.010296,0.014216,0.009794,0.016323,0.007534,0.004394,0.006417,0.004076,0.001783,0.006033,0.004771,0.007754,0.002885,0.003566,0.001256,0.001809,0.007964,0.004034,0.0,0.007098,0.010644,0.0,0.000521,0.0
"""acad_07.txt""",0.217223,0.052932,0.021307,0.032831,0.074507,0.038192,0.039558,0.016885,0.020503,0.018359,0.011256,0.013535,0.006298,0.010988,0.016965,0.008174,0.013937,0.011792,0.00337,0.005211,0.006117,0.001495,0.012038,0.004824,0.007448,0.003919,0.003398,0.001876,0.003034,0.003664,0.000923,0.000724,0.002141,0.000421,0.001447,0.000556,0.000672
"""acad_08.txt""",0.216872,0.044563,0.057932,0.03565,0.020053,0.026738,0.015068,0.024509,0.048276,0.015597,0.019311,0.025252,0.02748,0.007427,0.022934,0.01634,0.005199,0.013369,0.003249,0.00165,0.003014,0.003767,0.012413,0.004456,0.002293,0.0,0.005274,0.001485,0.000764,0.002437,0.002557,0.0,0.0,0.0,0.0,0.0,0.0
"""acad_09.txt""",0.215619,0.019723,0.120345,0.057164,0.033429,0.019723,0.006782,0.04279,0.023735,0.0117,0.009026,0.013706,0.015377,0.015712,0.002409,0.00234,0.004012,0.004346,0.006946,0.02674,0.002374,0.001017,0.003143,0.00702,0.006193,0.000349,0.002713,0.001003,0.002409,0.001097,0.001151,0.0,0.004519,0.002099,0.002256,0.0,0.001676
"""acad_10.txt""",0.26568,0.063556,0.065875,0.043507,0.030823,0.03246,0.01093,0.015139,0.01623,0.014457,0.010911,0.017321,0.007365,0.008592,0.009967,0.003,0.006137,0.003137,0.005817,0.008637,0.012175,0.004289,0.003989,0.006819,0.002106,0.001282,0.001384,0.00491,0.001825,0.002237,0.002974,0.002025,0.000168,0.000856,0.000736,0.0,0.0


## KWIC tables

There is also a function for generating Key Word in Context (KWIC) tables. For display purposes the `kwic_center_node` function trims the context columns to 75 characters maximum.

The function requires a **corpus** of the type generated by the `Corpus.from_dictionary` function. A node word needs to be set and there is the option to ignore the case of the node word.

<div class="alert alert-info">

**Note: Other KWIC options**

The **tmtoolkit** package has [its own KWIC functions](https://tmtoolkit.readthedocs.io/en/latest/preprocessing.html#Keywords-in-context-(KWIC)-and-general-filtering-methods). The only difference is that this function produced a table with the node word in a center column with context columns to the left and right. The **tmtoolkit** functions produce tables with a single column that includes the node word.
  
</div>

In [64]:
kcn = ds.kwic_center_node(ds_tokens, 'data', ignore_case=True, search_type='fixed')

In [66]:
kcn.head()

Doc ID,Pre-Node,Node,Post-Node
str,str,str,str
"""acad_01.txt""","""and the results were recorded …","""data ""","""chart. This was repeated for a…"
"""acad_01.txt""","""the surface. Table 1 shows the…","""data ""","""chart for the number of bubble…"
"""acad_01.txt""","""of sodium bicarbonate was calc…","""data ""","""can be seen below in Table 2"""
"""acad_01.txt""","""bicarbonate increased. As show…","""data ""","""in Tables 1 and 2 in the """
"""acad_01.txt""","""is 10.8 bubbles. Based on the ""","""data ""","""shown in Table 1, it is """


There is also an option allowing for that contain character sequences at the beginning or end of tokens by changing the `search_type` argument:

In [68]:
kwc = ds.kwic_center_node(ds_tokens, 'tion', ignore_case=True, search_type='ends_with')

In [69]:
kwc.head(10)

Doc ID,Pre-Node,Node,Post-Node
str,str,str,str
"""acad_01.txt""","""photosynthesis. This process o…","""fixation ""","""of carbon dioxide in the prese…"
"""acad_01.txt""","""The end result of photosynthes…","""production ""","""of organic materials, such as …"
"""acad_01.txt""","""factor to be tested would be t…","""concentration ""","""of carbon dioxide initially pr…"
"""acad_01.txt""","""was generated: An increase in …","""concentration ""","""of carbon dioxide initially pr…"
"""acad_01.txt""","""bubbles produced by the plants…","""attention ""","""was paid to cutting the stem o…"
"""acad_01.txt""","""concentrations were accomplish…","""solution ""","""of 0.2% sodium bicarbonate wit…"
"""acad_01.txt""","""number of bubbles observed at …","""concentration ""","""of sodium bicarbonate in the f…"
"""acad_01.txt""","""number of oxygen bubbles obser…","""concentration ""","""of sodium bicarbonate was calc…"
"""acad_01.txt""","""of photosynthesis steadily inc…","""concentration ""","""of sodium bicarbonate increase…"
"""acad_01.txt""","""Tables 1 and 2 in the Results ""","""section""",""", the number of oxygen bubbles…"


## Keyword tables

[Keywords](https://eprints.lancs.ac.uk/id/eprint/140803/1/Rayson_2019_CorpusAnalysisofKeyWords_Submitted.pdf) are common method for profiling corpora by statstically comparing token frequencies in one corpus (a target corpus) to those in another (a reference corpus).

To generate a keyword list, we first need to process our reference corpus, in this case a small corpus of news articles.

<div class="alert alert-warning">
    
**Warning: Preparing frequency tables**

Be sure to process target and reference corpora in precisely the same way prior to comparison.

</div>

In [70]:
corp_ref = ds.corpus_from_folder('data/ref_corpus')
ref_tokens = ds.docuscope_parse(corp_ref, nlp_model=nlp, n_process=4)

CPU times: user 2.2 s, sys: 231 ms, total: 2.43 s
Wall time: 8.5 s


Next, we will use `frequency_table` to generate 2 tables:

In [71]:
wc_target = ds.frequency_table(ds_tokens)
wc_ref = ds.frequency_table(ref_tokens)

To generate a table of key words, we will use `keyness_table`, which takes both our target and reference frequency tables. An arguement can also be set for using the Yates correction by setting the `correct` argument to 'True'. Here will leave the default, which is for no correction.

In [72]:
kw = ds.keyness_table(wc_target, wc_ref)

The table returns the frequency data for both corpora, with a column for [log-likehood](https://ucrel.lancs.ac.uk/llwizard.html) (the test of significance), as well as [Log Ratio](http://cass.lancs.ac.uk/log-ratio-an-informal-introduction/) (an effect size measure), and the *p*-value.

In [75]:
kw.head(10)

Token,Tag,LL,LR,PV,RF,RF_Ref,AF,AF_Ref,Range,Range_Ref
str,str,f64,f64,f64,f64,f64,u32,u32,f64,f64
"""of""","""IO""",217.586864,0.804786,3.0392e-49,38149.827516,21838.753516,5065,691,100.0,96.0
"""the""","""AT""",94.076679,0.349927,3.0353e-22,72382.989621,56793.400967,9610,1797,100.0,100.0
"""et al""","""RA""",85.930266,6.582033,1.8639e-20,1513.941822,0.0,201,0,12.0,0.0
"""is""","""VBZ""",83.80889,0.849238,5.4499e-20,13437.17518,7458.677033,1784,236,98.0,98.0
"""faculty""","""NN1""",70.356482,5.47014,4.95e-17,1400.961089,31.604564,186,1,4.0,2.0
"""these""","""DD2""",67.179713,2.23679,2.4785e-16,2681.409397,568.882147,356,18,96.0,32.0
"""this""","""DD1""",66.791235,1.042692,3.0184e-16,7682.689845,3729.338516,1020,118,100.0,84.0
"""students""","""NN2""",49.021193,4.15015,2.5321e-12,1122.275281,63.209127,149,2,20.0,4.0
"""education""","""NN1""",48.779503,4.997071,2.8642e-12,1009.294548,31.604564,134,1,14.0,2.0
"""study""","""NN1""",48.152184,3.348834,3.9439e-12,1287.980356,126.418255,171,4,48.0,2.0


<div class="alert alert-success">
    
**Updates: Threshold specification**

As of v0.3.0 the `keyness_table` function allows users to set a significance threshold. This is because when comparing even moderate-sized corpora, a keyness table can become massive. Thus, the function now only returns those values that reach the specified threshold, show only tokens whose frequency is significantly higher in the target corpus than the reference corpus. In order to see the revers (those more significantly more frequent in the reference than target) the order of the frequency tables in the function need to be swapped.

</div>

The default is 'threshold=0.01', which can be seen by looking at the tail of the table:

In [76]:
kw.tail(10)

Token,Tag,LL,LR,PV,RF,RF_Ref,AF,AF_Ref,Range,Range_Ref
str,str,f64,f64,f64,f64,f64,u32,u32,f64,f64
"""rail""","""NN1""",6.84022,2.930981,0.008913,120.512782,0.0,16,0,2.0,0.0
"""recognize""","""VVI""",6.84022,2.930981,0.008913,120.512782,0.0,16,0,18.0,0.0
"""relation""","""NN1""",6.84022,2.930981,0.008913,120.512782,0.0,16,0,10.0,0.0
"""replacement""","""NN1""",6.84022,2.930981,0.008913,120.512782,0.0,16,0,6.0,0.0
"""slope""","""NN1""",6.84022,2.930981,0.008913,120.512782,0.0,16,0,4.0,0.0
"""suggested""","""VVN""",6.84022,2.930981,0.008913,120.512782,0.0,16,0,16.0,0.0
"""technologies""","""NN2""",6.84022,2.930981,0.008913,120.512782,0.0,16,0,4.0,0.0
"""wazzan""","""NP1""",6.84022,2.930981,0.008913,120.512782,0.0,16,0,2.0,0.0
"""welfare""","""NN1""",6.84022,2.930981,0.008913,120.512782,0.0,16,0,10.0,0.0
"""how""","""RRQ""",6.701434,0.969116,0.009634,866.18562,442.463892,115,14,70.0,24.0


Keyness tables can also be generated for counts of either part-of-speech or DocuScope tags. First, we prepare the frequency tables.

In [77]:
tag_ref = ds.tags_table(ref_tokens, count_by='pos')
tag_tar = ds.tags_table(ds_tokens, count_by='pos')
ds_ref = ds.tags_table(ref_tokens, count_by='ds')
ds_tar = ds.tags_table(ds_tokens,  count_by='ds')

We will set the `tags_only` argument to 'True' and we will also emply the Yates correction, setting `correct` to 'True', as well:

In [80]:
kt = ds.keyness_table(tag_tar, tag_ref, tags_only=True, correct=True, threshold=.05)

In [81]:
kt.head(10)

Tag,LL,LR,PV,RF,RF_Ref,AF,AF_Ref,Range,Range_Ref
str,f64,f64,f64,f64,f64,u32,u32,f64,f64
"""JJ""",258.236798,0.554966,4.1577e-58,8.58051,5.840523,11392,1848,100.0,100.0
"""IO""",217.909342,0.804786,2.5848e-49,3.814983,2.183875,5065,691,100.0,96.0
"""NN2""",107.912423,0.386003,2.8092000000000004e-25,6.888812,5.271641,9146,1668,100.0,100.0
"""NN1""",101.543168,0.223199,6.9923e-24,18.099513,15.505199,24030,4906,100.0,100.0
"""AT""",90.876836,0.340048,1.529e-21,7.324918,5.786796,9725,1831,100.0,100.0
"""RR""",81.123951,0.508681,2.1199000000000003e-19,3.134086,2.202838,4161,697,100.0,98.0
"""ZZ1""",67.0445,2.044044,2.6545e-16,0.299776,0.07269,398,23,54.0,28.0
"""VVZ""",62.211092,0.706523,3.0855e-15,1.35125,0.82804,1794,262,98.0,92.0
"""RGR""",57.142521,2.262496,4.0535e-14,0.227468,0.047407,302,15,86.0,22.0
"""DD1""",55.060338,0.732546,1.1689e-13,1.123782,0.676338,1492,214,100.0,94.0


We can do the same for the DocuScope frequency tables:

In [83]:
kds = ds.keyness_table(ds_tar, ds_ref, tags_only=True)

In [85]:
kds.sort("LR", descending=True).head()

Tag,LL,LR,PV,RF,RF_Ref,AF,AF_Ref,Range,Range_Ref
str,f64,f64,f64,f64,f64,u32,u32,f64,f64
"""CitationHedged""",6.981271,2.954139,0.008237,0.015617,0.0,17,0,20.0,0.0
"""AcademicWritingMoves""",51.654651,1.311183,6.6174e-13,0.530053,0.213606,577,53,94.0,52.0
"""AcademicTerms""",729.47416,1.205083,1.1655999999999999e-160,8.492793,3.683701,9245,914,100.0,98.0
"""InformationChange""",101.904145,1.1768,5.8274000000000006e-24,1.230054,0.544092,1339,135,100.0,80.0
"""MetadiscourseInteractive""",31.731942,1.143007,1.7699e-08,0.400525,0.181364,436,45,100.0,50.0


## Single document tag highlighting

Tags (either part-of-speech or DocuScope) can be highlighted in single documents. In order facilitate the highlighing of tags, the `tag_ruler` function generates a data frame with the complete document text and the spans of tagged tokens. From that data frame, the original document text can be easily recovered, and any tags of interest can be filtered for highlighting.

To render the highlights, an additionally package is needed. For this demonstration, we will use (ipymarkup)[https://nbviewer.org/github/natasha/ipymarkup/blob/master/docs.ipynb], which is simple and flexible.

In [86]:
from ipymarkup import show_span_box_markup

When calling the `tag_ruler` function, a doc_id needs to be specificed. Those can be recovered easily from the tokens table:

In [90]:
ds_tokens.get_column("doc_id").unique().sort().head(5)

doc_id
str
"""acad_01.txt"""
"""acad_02.txt"""
"""acad_03.txt"""
"""acad_04.txt"""
"""acad_05.txt"""


In [91]:
df_pos = ds.tag_ruler(ds_tokens, doc_id='acad_17.txt', count_by='pos')

The data frame contains all tokens, tags and start/end of spans:

In [92]:
df_pos.head(20)

Token,Tag,tag_start,tag_end
str,str,u32,u32
"""In ""","""II""",0,2
"""the ""","""AT""",3,6
"""societal ""","""JJ""",7,15
"""realm ""","""NN1""",16,21
"""in ""","""II""",22,24
…,…,…,…
"""are ""","""VBR""",90,93
"""starkly ""","""RR""",94,101
"""defined""","""VVN""",102,109
""". ""","""Y""",109,110


The output can easily be filtered, as it here for part-of-speech tags starting with 'N' (or nouns):

In [93]:
df_n = df_pos.filter(pl.col("Tag").str.starts_with("N"))
df_n.head(10)

Token,Tag,tag_start,tag_end
str,str,u32,u32
"""realm ""","""NN1""",16,21
"""Middlemarch ""","""NP1""",31,42
"""demarcation ""","""NN1""",56,67
"""women ""","""NN2""",76,81
"""men ""","""NN2""",86,89
"""Notions ""","""NN2""",111,118
"""male ""","""NN1""",122,126
"""character ""","""NN1""",138,147
"""perspective""","""NN1""",176,187
"""reading ""","""NN1""",229,236


First, we will reconstruct the document text from the **full** data frame.

In [95]:
text = ''.join(df_pos['Token'].to_list())

Next, we will contruct a list a tuples from the **filtered** data frame, using the `tag_start`, `tag_end` and `Tag` columns:

In [96]:
spans = list(zip(list(df_n['tag_start']), list(df_n['tag_end']), list(df_n['Tag'])))

Finally, we can use `show_span_box_markup` to highlight the tags:

In [97]:
show_span_box_markup(text, spans)

The same thing can be done for DocuScope tags by switching `count_by` to 'ds':

In [99]:
df_ds = ds.tag_ruler(ds_tokens, doc_id='acad_37.txt', count_by='ds')
df_ds.head(20)

Token,Tag,tag_start,tag_end
str,str,u32,u32
"""Often ""","""Narrative""",0,5
"""referred ""","""InformationReportVerbs""",6,14
"""to ""","""InformationReportVerbs""",15,17
"""as ""","""InformationReportVerbs""",18,20
"""the ""","""Untagged""",21,24
…,…,…,…
"""argument ""","""AcademicTerms""",83,91
"""about ""","""Untagged""",92,97
"""the ""","""Untagged""",98,101
"""existence ""","""Untagged""",102,111


This time, we'll filter for tags related to expressions of confidence:

In [100]:
df_c = df_ds.filter(pl.col("Tag").str.starts_with("Conf"))
df_c.head(10)

Token,Tag,tag_start,tag_end
str,str,u32,u32
"""very ""","""ConfidenceHigh""",66,70
"""clearly ""","""ConfidenceHigh""",371,378
"""distinctly ""","""ConfidenceHigh""",383,393
"""clearly ""","""ConfidenceHigh""",563,570
"""distinctly ""","""ConfidenceHigh""",575,585
"""is ""","""ConfidenceHigh""",596,598
"""true""","""ConfidenceHigh""",599,603
"""are ""","""ConfidenceHigh""",729,732
"""true""","""ConfidenceHigh""",733,737
"""clearly ""","""ConfidenceHigh""",789,796


Again, the text is reconstructed from the full data frame, and the spans are taken from the filtered one:

In [101]:
text = ''.join(df_ds['Token'].to_list())
spans = list(zip(list(df_c['tag_start']), list(df_c['tag_end']), list(df_c['Tag'])))
show_span_box_markup(text, spans)

## Compatability with tmtoolkit

The **docuscospacy** package not longer requires **tmtoolkit** as a dependency. However, there some functions are included that allow users to move data between the two.

All necessary pre-processing  is now done inside the `docuscope_parse` function. If you choose to use tmtoolkit, you will need to explicitly define your own pre-processing function. **For accurate tagging**, possessive *its* should be split into two tokens. The last part of the function will eliminate carriage returns, tabs, extra spaces, etc.

<div class="alert alert-info">

**Note: Adding pre-processing functions**

You can also pass other functions as part of the `raw_preproc` argument in a list. For example: `raw_preproc=[pre_process, simplify_unicode_chars]` would add a function built in to **tmtoolkit** that replaces accented with non accented characters.

</div>

In [102]:
import re
from tmtoolkit.corpus import Corpus

def pre_process(txt):
    txt = re.sub(r'\bits\b', 'it s', txt)
    txt = re.sub(r'\bIts\b', 'It s', txt)
    txt = " ".join(txt.split())
    return(txt)

In [103]:
corp = Corpus.from_folder('data/tar_corpus', spacy_instance=nlp, raw_preproc=[pre_process], spacy_token_attrs=['tag', 'ent_iob', 'ent_type', 'is_punct'])

### Converting a corpus

To convert a tmtoolkit Corpus object, use the `from_tmtoolkit` function.

<div class="alert alert-info">

**Note: `convert_corpus` function**

Note that the `convert_corpus` function has been depreicated. Use the `from_tmtoolkit` function instead.

</div>

In [105]:
tm_corpus = ds.from_tmtoolkit(corp)

The result is a dictionary, whose keys are the names of the corpus files:

In [106]:
tm_corpus.head()

doc_id,token,pos_tag,ds_tag,pos_id,ds_id
str,str,str,str,u32,u32
"""acad_01""","""In ""","""II""","""Untagged""",1,1
"""acad_01""","""the ""","""AT""","""Untagged""",2,2
"""acad_01""","""field ""","""NN1""","""Untagged""",3,3
"""acad_01""","""of ""","""IO""","""Untagged""",4,4
"""acad_01""","""plant ""","""NN1""","""InformationTopics""",5,5


A **dtm** can also be passed to **tmtoolkit** functions to create normalized counts (using the `tf_proportions` function), [tf-idf values](https://tmtoolkit.readthedocs.io/en/latest/bow.html#Term-frequency%E2%80%93inverse-document-frequency-transformation-(tf-idf)) (using the `tfidf` function), or other kids of data structures.

In [110]:
from tmtoolkit.bow.bow_stats import tf_proportions, tfidf
from tmtoolkit.bow.dtm import dtm_to_dataframe

Beginning with version 0.12.0 of **tmtoolkit**, matrices must first be converted into a COOrdinate format. This can be done using the `dtm_to_coo` function.

In [107]:
tags_coo, docs, vocab = ds.dtm_to_coo(tm)

In [108]:
tags_coo

<COOrdinate sparse matrix of dtype 'uint32'
	with 1657 stored elements and shape (50, 37)>

These can now be processed using various **tmtoolkit** functions

In [111]:
dtm_to_dataframe(tags_coo, docs, vocab).head()

Unnamed: 0,Untagged,AcademicTerms,Character,Narrative,Description,InformationExposition,InformationTopics,Negative,Positive,MetadiscourseCohesive,Reasoning,ForceStressed,PublicTerms,Strategic,InformationStates,InformationChange,ConfidenceHedged,InformationReportVerbs,Citation,InformationPlace,Interactive,Inquiry,Future,ConfidenceHigh,Contingent,AcademicWritingMoves,Facilitate,MetadiscourseInteractive,Updates,InformationChangePositive,CitationAuthority,FirstPerson,Responsibility,InformationChangeNegative,Uncertainty,ConfidenceLow,CitationHedged
acad_01.txt,324,127,15,66,70,57,15,10,9,12,26,7,4,10,9,10,15,17,0,0,3,18,3,3,0,16,1,3,0,1,2,0,2,0,0,0,0
acad_02.txt,760,255,79,133,132,157,74,67,66,97,51,54,18,24,33,40,60,38,12,9,22,8,20,20,38,5,7,3,8,26,3,9,0,2,1,1,1
acad_03.txt,2392,844,465,422,435,428,240,201,160,142,160,126,52,78,124,130,137,57,415,49,39,82,42,30,43,20,28,31,21,47,23,42,3,32,9,1,3
acad_04.txt,373,72,28,64,161,73,29,31,42,39,35,17,22,35,12,12,19,23,3,9,7,6,11,4,6,24,12,1,1,2,2,1,2,1,0,0,0
acad_05.txt,651,200,47,133,172,79,77,73,18,42,52,33,2,14,33,65,21,27,3,0,7,10,21,5,19,17,7,5,3,0,0,1,2,0,0,1,0


In [112]:
tfidf_coo = tfidf(tags_coo)
dtm_to_dataframe(tfidf_coo, docs, vocab).head()

Unnamed: 0,Untagged,AcademicTerms,Character,Narrative,Description,InformationExposition,InformationTopics,Negative,Positive,MetadiscourseCohesive,Reasoning,ForceStressed,PublicTerms,Strategic,InformationStates,InformationChange,ConfidenceHedged,InformationReportVerbs,Citation,InformationPlace,Interactive,Inquiry,Future,ConfidenceHigh,Contingent,AcademicWritingMoves,Facilitate,MetadiscourseInteractive,Updates,InformationChangePositive,CitationAuthority,FirstPerson,Responsibility,InformationChangeNegative,Uncertainty,ConfidenceLow,CitationHedged
acad_01.txt,0.258933,0.101495,0.011988,0.052746,0.055942,0.045553,0.01216,0.007992,0.007193,0.00959,0.020779,0.005594,0.003197,0.007992,0.007403,0.007992,0.011988,0.013586,0.0,0.0,0.002432,0.014593,0.002504,0.002398,0.0,0.013357,0.000811,0.002398,0.0,0.000874,0.001834,0.0,0.001964,0.0,0.0,0.0,0.0
acad_02.txt,0.222591,0.074685,0.023138,0.038953,0.03866,0.045983,0.021986,0.019623,0.01933,0.02841,0.014937,0.015816,0.005272,0.007029,0.009948,0.011715,0.017573,0.01113,0.003843,0.002928,0.006536,0.002377,0.006119,0.005858,0.011455,0.00153,0.00208,0.000879,0.002412,0.008327,0.001008,0.003558,0.0,0.00092,0.000395,0.000607,0.000734
acad_03.txt,0.216396,0.076354,0.042067,0.038177,0.039353,0.03872,0.022025,0.018184,0.014475,0.012846,0.014475,0.011399,0.004704,0.007056,0.011546,0.011761,0.012394,0.005157,0.041056,0.004925,0.003579,0.007525,0.003969,0.002714,0.004004,0.00189,0.00257,0.002804,0.001955,0.00465,0.002388,0.005129,0.000334,0.004544,0.001099,0.000188,0.00068
acad_04.txt,0.216174,0.041728,0.016228,0.037091,0.093308,0.042307,0.017049,0.017966,0.024341,0.022603,0.020284,0.009852,0.01275,0.020284,0.007158,0.006955,0.011012,0.01333,0.001901,0.005795,0.004115,0.003527,0.006659,0.002318,0.003579,0.01453,0.007055,0.00058,0.000597,0.001268,0.00133,0.000782,0.001425,0.00091,0.0,0.0,0.0
acad_05.txt,0.241753,0.074271,0.017454,0.04939,0.063873,0.029337,0.029007,0.027109,0.006684,0.015597,0.019311,0.012255,0.000743,0.005199,0.012614,0.024138,0.007798,0.010027,0.001218,0.0,0.002637,0.003767,0.008146,0.001857,0.007262,0.006595,0.002637,0.001857,0.001147,0.0,0.0,0.000501,0.000913,0.0,0.0,0.00077,0.0
