# TACS: Documentation

## Installation

TACS requires `Python` and the libraries `pandas`,`spacy`, `re`. Visualisations require `Jupyter Notebook`.

In [227]:
%run tacs.py

## Introduction

### Framework

The dictionary consists of words and phrases (**Lexemes**). At the lowest level, lexical variants (_pass phrase_ and _password_) or very similar in meaning (_security gap_ and _secuirty hole_) are grouped into **Terms**. Each term is assigned to 1 of 4 **Domains**: Cyber Security (e.g. _hacker, password_), Security (e.g., _safety, danger_), Cyber (e.g., _mobile phone, user_), and General (_company, child_). Terms are further grouped into the broader **Concepts** to which they relate (e.g., the **Terms** _Exploit_ and _Security Flaw_ are grouped under the **Concept** _Vulnerability_. Concepts are assigned to 1 of several categories, nested within 2 over-arching dimensions--Security and Context--which can be thought of as separate dictionaries. The **Security Categories** are: 'Cybersecurity', 'Security Actor', 'Security General','Security Mechanism','Threat Actor','Threat General','Threat Mechanism'. The **Context Categories** are: 'Activity', 'Cyber Entity', 'Individual', 'Org', 'Quality'. 

The `tacs_show()` function with the `out='vis'` option can be used to visualise the framework as an interactive sunburst diagram in a Jupyter Notebook. The option `out='table'` produces a table with custom level. If `level='category'` is selected, the table shows one category per row, with concepts and their domains listed in a string. If `level='concept'`, each concept is displayed followed by a string of terms.

In [228]:
tacs_show(context=True, out='vis')

In [25]:
tacs_show('table',level='cat')

Unnamed: 0_level_0,Unnamed: 1_level_0,concept_domain
dict,cat,Unnamed: 2_level_1
Security,Threat Mechanism,"Vulnerability(S), Vulnerability(CS), Misc Thre..."
Security,Threat General,"Harm(S), Crime(S), Threat(S), Risk(CS), Risk(S..."
Security,Threat Actor,"Hacker(CS), Criminal(S), Attacker(S)"
Security,Security Mechanism,"Updates(G), Updates(CS), Updates(S), Training(..."
Security,Security General,"Resilience(S), Confidentiality(S), Authenticit..."
Security,Security Actor,"Security Expert(CS), Security Expert(S), Secur..."
Security,Cybersecurity,"Cybersecurity(CS), Cybersecurity(S)"
Context,Quality,"Positive(S), Positive(G), Negative(G)"
Context,Org,"Tech(C), Organisation(S), Medical(S), Law & Or..."
Context,Individual,"User(C), User(G), Relationship(G), Pronoun3(G)..."


## Tagging & Tokenisation

At the foundation of all TACS operations is the tagging function, whereby each dictionary lexeme is identified in a document. The function retruns a tokenised version of the document, in which each token is represented by tuple containing the token and its respective category or, if the word is not in the dictionary, the string 'None'.

In [31]:
tacs_tag('the boy is bullied online')

the boy is bullied online

In [32]:
tacs_annotate_doc(tacs_tag('the boy is bullied online'),render=True)

*TACS offers a number of helper functions to render and process tokenised documents. These are explained in greated detail in the suplementary materials. Here we use the render function which returns annotated document in html format.

### Context-aware tagging

The tagger attempts to tag non-cybersecurity terms only if they appear in the context/vicinity of cybersecurity. Currently, context is determined by a simple rule: A non-cs term is only tagged if its `n` neighbouring terms contain one cybersecurity term or (one cyber and one security term) `cs|(c&s)`. 

In [34]:
tacs_annotate_doc(tacs_tag('the boy is bullied online'),render=True)
tacs_annotate_doc(tacs_tag('the boy is bullied in school'),render=True)
tacs_annotate_doc(tacs_tag('the boy is shopping in school'),render=True)

The context rule offers for a couple of simple parameters:
- it can be disabled altogether `context_rule=False`
- the default context rule `cs|(c&s)` can be replaced with a more conservative version where neighbours must include a cybersecurity term `c&s`
- the window of neighbours can be specified via `context_window=10`

Note the different parameters in the following examples:

In [33]:
tacs_annotate_doc(tacs_tag('the boy is shopping online',context_rule=False),render=True)
tacs_annotate_doc(tacs_tag('the boy is shopping online',context_rule='cs'),render=True)
tacs_annotate_doc(tacs_tag('the boy is shopping online using a password',context_rule='cs'),render=True)
tacs_annotate_doc(tacs_tag('the boy is shopping online using a password',context_rule='cs',context_window=4),render=True)

## Counting

> ### TODO
> - path to folder as input; subfolders as grouping var

To demonstrate this function, importing is a dataset 'data'.

In [109]:
from pathlib import Path
from os import getcwd, chdir
corpuspath = str(Path(getcwd()).parent)+'/corpus/'
data = pd.read_csv(open(corpuspath+'data.csv','rb'))
data = data.sample(100).reset_index(drop=True)

The simplest form of analysis is counting the number of times TACS terms occur in text. The counting function provides the options to count Concepts or Categories, specify a grouping variable or count in each document. 

**Input:** The input of the counting function can be a list of documents or a table (dataframe). The function will first check if the input contains documents that are already preprocessed with tacs (spacy docs). If not, the function will look for full-text documents and process them (this will take time). If preprocessing is required, the full-text column in the dataframe should be labelled 'text' or the label would have to be specified via the `textcol` parameter.

**Grouping:** The function allows for 3 types of aggregation, specified via the `aggr` parameter: 
- `all` aggregates all documents and returns a single column for the entire dataset
- `each` returns a single column per document
- a custom grouping variable in the form of a list (corresponding to the lenght and order of documents) or the name of a column in the dataframe

**Output**: Table with Dictionary Categories as rows and documents as columns. Option to aggregate dictionary levels (e.g., sum Concept counts into Categories). If counting across documents (`agg = 'all'`), the most frequent instances of the aggregated level (e.g., Terms if Concept counts are shown; Concepts for Categories) are included in the table as a string.

**Parameters:**
- level
- aggregate
- normalisation
- ...

In [113]:
data['tokdocs'] = tacs_tag(data.text)

In [114]:
tacs_count(data, level='concept', aggr='srctype', subcount = True)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,doc_newspapers,doc_tradejournals,doc_magazines
dict,cat,concept,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Context,Activity,Access,1,4,1
Context,Activity,Comm,22,29,13
Context,Activity,Financial,0,0,1
Context,Activity,Political,0,0,0
Context,Activity,Sharing,1,2,1
Context,Activity,UI,2,5,8
Context,Cyber Entity,Asset,68,162,59
Context,Cyber Entity,Device,84,78,56
Context,Cyber Entity,Digi Comm,1,2,3
Context,Cyber Entity,Internet,83,122,94


## Annotation

Annotates documents such that TACS terms appear in bold, their TACS category as subscript. Annotated documents can be exported to a signle html file `show_annot=True` and/or displayed within a Jupyter Notebook `save_annot=True`. Currently, only HTML annotation is supported. Future versions can implement a DOCX variant.

In [209]:
tacs_annotate(data[:3], output = 'html', custom=False, level='concept', context=True)

## Queries

The queries function return documents or sentences in which certain TACS categories appear. Queries can be specified using category keys (e.g. 'tttp') or labels (e.g. 'Threat Mechanisms) linked through AND and OR operators. For example `Mobile Devices AND ttp` would look for sentences (or documents where both of these categories are mentioned). By default, documents are split into sentences `qsents=True` and each sentence is queried indiviudally; if this is disabled, full documents are queried.

Currently, query returns True if the specified condition is met at least once. Future iterations might implement an option to specify a custom threshold, e.g., Mobile Device mentioned > 3.

**Output:**
- Data: The result of the query can be saved as a table, with one searched units (documents or sentences) per rows and column for query outcome (True, i.e., condition met, or False). The id and full text of the unit are included. Selected via `save_data=False`.
- Annotated documents*: The queried documents are returned in full. If sentences are queried separately, sentences matching the query are highlighted. Selected via `fulldoc=True`.
- Excerpts*: Documents are returned as an itemised list of sentences matching the query. Selected via `fulldoc=False`.

*As per the Annotation function, TACS terms appear in bold with categories as subscripts; the output can be saved `save_annot=True` and/or displayed in a Jupyter Notebook `show_annot=True`. Currently, only HTML annotation is supported. Future versions can implement a DOCX variant.

**Parameters:**
- ...

In [218]:
%run tacs.py

In [226]:
tacs_query(data[:3], 'tttp', fulldoc=True, save_data=False, annot='html', show_annot = True, custom=False, level='concept', context=True)


<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>

## Co-occurrence

In this example, the tokens function is used to select only words and phrases, which appear near TACS terms. The topic model and visualisation are performed by the external libraries `gensim` and `pyLDAvis`.

In [None]:
dat = data[data.corpuscat == "cs media newspapers"].text
tokdocs = tacs_tokens(tacs_tag(dat), level='concept',show=True, cs_only=True, topic_window=8,
                                             remove_stops = True, trim_extremes=True)
quicktm(tokdocs, num_topics = 10)

## Queries

In [None]:
q = 'Mobile Device'
kwic([tok for doc in tacs_tag(dat) for tok in doc],q,2)

---

> ### TODO
> - make counting work without notebook? simple interface--> link to folder, saves count table in folder? 
>
> #### Fixes
> - improve accuracy
> - spacy tagging
> - do without textacy and trasnf()
> - rewrite summarise (): same as vis() incoprorate, grpsummary()
> - fix aggr=each
>
> #### TM
> - allocate topics to documents
>
> #### Queries / Object-centred
> - rewrite: input should be query, such as ((concept & concept)|concept)
> - better/slower version would be with sentences--spacy (or nltk if faster?)
>     - `doc > sent > tok` structure and rewrite all functions to work with it?
> - incorporate render(); different output options
> - use spacy to get s-v-o and q
> - use spacy to get full sentences


-----------
# Supplementary / Under the hood 

## Tagging

In [None]:
sample['tokdocs'] = tacs_tag(sample.text)
sample['tokdocs_tacsonly'] = tacs_tag(sample.text, tacs_only=True)

In [None]:
tagdemo = pd.DataFrame()
tagdemo['text'] = [sample.text[2][:600]]
tagdemo['tokdocs'] = [sample.tokdocs[2][:150]]
tagdemo['tokdocs_tacsonly'] = [sample.tokdocs_tacsonly[2][:100]]

In [None]:
tagdemo.style.set_properties(**{'text-align': 'left'})

## Token Actions

In [None]:
tokdemo = pd.DataFrame()
tokdemo['tokdocs'] = [sample.tokdocs_tacsonly[0][:100]]
tokdemo['tokdocs_to_concepts'] = [tacs_tokens(sample.tokdocs_tacsonly, level='concept',show=True)[0][:100]]
tokdemo['tokdocs_to_cats'] = [tacs_tokens(sample.tokdocs_tacsonly, level='category',show=True)[0][:100]]
tokdemo['tokdocs_concepts'] = [tacs_tokens(sample.tokdocs, level='concept',show=True, cs_only=False, 
                                             remove_stops = False, trim_extremes=False)[0][:100]]
tokdemo['tokdocs_trimtopic'] = [tacs_tokens(sample.tokdocs, level='concept',show=True, cs_only=True,
                                            topic_window=10,
                                             remove_stops = False, trim_extremes=False)[0][:100]]
tokdemo['tokdocs_trimtopic'] = [tacs_tokens(sample.tokdocs, level='concept',show=True, cs_only=True,
                                            topic_window=3,
                                             remove_stops = False, trim_extremes=False)[0][:100]]



In [None]:
tokdemo.style.set_properties(**{'text-align': 'left'})

---