# TACS: Documentation

## Installation

TACS consists of two files:
- `csd.csv`: Dictionary data with one Term per row, categories it belongs to as columns, individual lexemes listed in a single string.
- `tacs.py`: A python file with functions to compile and use the dictionary from the Dictionary data.

Running TACS requires [`Python`](https://wiki.python.org/moin/BeginnersGuide/Download) and the libraries [`pandas`](https://pandas.pydata.org/getting_started.html) and [`spacy`](https://spacy.io/usage). Visualisations require `Jupyter Notebook`. With all required packages installed, download the files `tacs.py` and `csd.csv` and run `tacs.py` in a desired Python application.

In [16]:
%run tacs.py

## Introduction

### Framework

The dictionary consists of words and phrases (**Lexemes**). At the lowest level, lexical variants (_pass phrase_ and _password_) or very similar in meaning (_security gap_ and _secuirty hole_) are grouped into **Terms**. Each term is assigned to 1 of 4 **Domains**: Cyber Security (e.g. _hacker, password_), Security (e.g., _safety, danger_), Cyber (e.g., _mobile phone, user_), and General (_company, child_). Terms are further grouped into the broader **Concepts** to which they relate (e.g., the **Terms** _Exploit_ and _Security Flaw_ are grouped under the **Concept** _Vulnerability_. Concepts are assigned to 1 of several categories, nested within 2 over-arching dimensions--Security and Context--which can be thought of as separate dictionaries. The **Security Categories** are: 'Cybersecurity', 'Security Actor', 'Security General','Security Mechanism','Threat Actor','Threat General','Threat Mechanism'. The **Context Categories** are: 'Activity', 'Cyber Entity', 'Individual', 'Org', 'Quality'. 

The `tacs_show()` function with the `out='vis'` option can be used to visualise the framework as an interactive sunburst diagram in a Jupyter Notebook. The option `out='table'` produces a table with custom level. If `level='category'` is selected, the table shows one category per row, with concepts and their domains listed in a string. If `level='concept'`, each concept is displayed followed by a string of terms.

In [None]:
tacs_show(context=True, out='vis')

In [3]:
tacs_show('table',level='cat')

Unnamed: 0_level_0,Unnamed: 1_level_0,concept_domain
dict,cat,Unnamed: 2_level_1
Security,Threat Mechanism,"Vulnerability(S), Vulnerability(CS), Misc Thre..."
Security,Threat General,"Harm(S), Crime(S), Threat(S), Risk(CS), Risk(S..."
Security,Threat Actor,"Hacker(CS), Criminal(S), Attacker(S)"
Security,Security Mechanism,"Updates(G), Updates(CS), Updates(S), Training(..."
Security,Security General,"Resilience(S), Confidentiality(S), Authenticit..."
Security,Security Actor,"Security Expert(CS), Security Expert(S), Secur..."
Security,Cybersecurity,"Cybersecurity(CS), Cybersecurity(S)"
Context,Quality,"Positive(S), Positive(G), Negative(G)"
Context,Org,"Tech(C), Organisation(S), Medical(S), Law & Or..."
Context,Individual,"User(C), User(G), Relationship(G), Pronoun3(G)..."


## Data

TACS operates on text documents, which can be supplied to its functions as: a string, a list of strings, a table (dataframe) with text-column (labelled `text`, unless a custom label is specified via the `textcol` parameter. The final version should also be able to import documents (.txt, .docx, .pdf) directly from a specified folder path. 

In [8]:
data = pd.read_csv('demodata.csv','rb')

## Tagging & Tokenisation

The first step in all TACS operations is to preprocess and tag the documents. During this step, documents are segmented into individual units called tokens (words, phrases, punctuation marks, etc.) using `spacy`. A spacy document is a list of tokens, which have various attributes (e.g., part-of-speach tags `tok.pos_`, lemmatised versions `tok.lemma_`). TACS categories are specified in a custom attribute `._.csd`. 

Tagging is performed using the `tacs_tag()` function. This is the slowest process in TACS and it is therefore recommended to call the tagging function once on all documents, save the output as a separate object (e.g., `tokdocs`) or a column in the dataframe, and use it as input for the remaining TACS functions.

The remaining TACS functions will first check whether the data provided as input contains documents that are already tagged. If not, the function will look for full-text documents and process them (this will take time). 

### Context-aware tagging

The function attempts to tag non-cybersecurity terms only if they appear in the context/vicinity of cybersecurity. Currently, context is determined by a simple rule: A non-cs term is only tagged if its `n` neighbouring terms contain one cybersecurity term or (one cyber and one security term) `cs|(c&s)`. 

Context-aware tagging can be disabled or adjusted with the following parameters:
- `context_rule`: Default is `cs|(c&s)`: tag word if nearby terms include at least one cyber-security word or one cyber and one security word. The default rule can be replaced with a more conservative version where neighbours must include a cybersecurity term `c&s`. `False` to disable altogether: every dictionary lexeme is tagged regardless of context.
- `context_window=10`: Specifies the number of neightbouring terms taken into account by the context rule.

#### Illustration of TACS tagging

In [18]:
doc = tacs_tag('the boy is bullied online')
[(tok,tok._.csd) for tok in doc]

[(the, 'None'),
 (boy, 'c_ind_ind_boy_dg'),
 (is, 'None'),
 (bullied, 's_tttp_bulli_bullying_ds'),
 (online, 'c_cy_int_online_dc')]

#### Note the parameters and output in the following examples:

The examples are using the tacs annotate function, which is described in a subsequent section. Interesting here is to note the differences in tagging that result from the different parameters.

In [6]:
tacs_annotate_doc(tacs_tag('the boy is bullied online'),render=True)
tacs_annotate_doc(tacs_tag('the boy is bullied in school'),render=True)
tacs_annotate_doc(tacs_tag('the boy is shopping in school'),render=True)

In [7]:
tacs_annotate_doc(tacs_tag('the boy is shopping online',context_rule=False),render=True)
tacs_annotate_doc(tacs_tag('the boy is shopping online',context_rule='cs'),render=True)
tacs_annotate_doc(tacs_tag('the boy is shopping online using a password',context_rule='cs'),render=True)
tacs_annotate_doc(tacs_tag('the boy is shopping online using a password',context_rule='cs',context_window=4),render=True)

#### Tagging all documents and saving the output in a column labelled `tacsdocs`

In [21]:
data['tacsdocs'] = tacs_tag(data.text)

## Counting

The simplest form of analysis is counting the number of times TACS terms occur in text. The counting function provides the options to count Concepts or Categories, across all documents, in each document, or in specified groups of documents. 

**Grouping:** The function allows for 3 types of aggregation, specified via the `aggr` parameter: 
- `all` aggregates all documents and returns a single column for the entire dataset
- `each` returns a single column per document
- a custom grouping variable in the form of a list (corresponding to the lenght and order of documents) or the name of a column in the dataframe

**Output**: Table with Dictionary Categories as rows and documents as columns. Option to aggregate dictionary levels (e.g., sum Concept counts into Categories). If counting across documents (`agg = 'all'`), the most frequent instances of the aggregated level (e.g., Terms if Concept counts are shown; Concepts for Categories) are included in the table as a string, with the option to include the frequency of each unit (`subcount=True`).

**Parameters:**
- `level`: Specifies whether to count Categories, Concepts or Terms. Default is Concept.
- `aggr`: Specifies how to aggregate the documents, default is `all`.
- `subcount`: 

In [None]:
data['tokdocs'] = tacs_tag(data.text)

In [None]:
tacs_count(data, level='concept', aggr='srctype', subcount = True)

## Annotation

Annotates documents such that TACS terms appear in bold, their TACS category as subscript. Annotated documents can be exported to a signle html file `save_annot=True` and/or displayed within a Jupyter Notebook `show_annot=True`. Currently, only HTML annotation is supported. Future versions can implement a DOCX variant.

In [14]:
tacs_annotate(data[:3], show_annot = True, custom=False, level='concept', context=True)

Tagging documents. This might take a couple of minutes.


## Query

The query function returns documents or sentences in which certain TACS categories appear. Queries can be specified using category keys (e.g. 'tttp') or labels (e.g. 'Threat Mechanisms) linked through AND and OR operators. For example `Mobile Devices AND ttp` would look for sentences (or documents where both of these categories are mentioned). By default, documents are split into sentences `qsents=True` and each sentence is queried indiviudally; if this is disabled, full documents are queried. Currently, query returns True if the specified condition is met at least once. Future iterations might implement an option to specify a custom threshold, e.g., Mobile Device mentioned > 3. As per the Annotation function, TACS terms appear in bold with categories as subscripts; the output can be saved and/or displayed in a Jupyter Notebook. Currently, only HTML annotation is supported. Future versions can implement a DOCX variant.

**Output:**
- **Data**: The result of the query can be returned and/or saved as a table, with one searched units (documents or sentences) per rows and column for query outcome (True, i.e., condition met, or False). The id and full text of the unit are included. Selected via `save_data=False`.
- **Annotated documents**: The queried documents are returned in full. If sentences are queried separately, sentences matching the query are highlighted. Selected via `return_all=True`.
- **Segments**: Documents are returned as an itemised list of sentences matching the query. Selected via `return_all=False`.

**Parameters:**
- `data`: Object containing TACS or plain text documents (single document, list, dataframe). If no TACS documents are present, tacs_tag is called on plain text documents. In a dataframe, plain text documents should be located in a column labelled 'text' or the column label should be specifed via `textcol`.
- `query`: A string specifying a query of TACS entities, specified with full or partial labels (e.g. Protection) or keys (e.g. 'tttp'), and linked with the operators AND and OR.
- `qsents`: True/False. If True, documents are segmented into sentences and the queruy is performed on sentence-level.
- `return_all`: True/False. If True, the full document is returned and query outcome is indicated via highlights in annotated documents or a boolean _query_ variable in data. If False, only matching documents/sentences are returned.
- `data_return`: If True, the function returns a dataframe with the query outcome.
- `data_save`: If True, the query outcome dataframe is written to disk under the name _tacsq__{query}_
- `annot_return`: If True and if using a Jupyter Notebook, the annotated documents will be displayed in the notebook. Not recommended for large data.
- `annot_save`: If True, annotated documents will be saved to disk in the form of a single document, individual documents separated by headers.
- `annot_markup`: Specifies markup language for the annotation. Currently only `html` is implemented.
- `annot_level`: Specifies the level of categorisation displayed in the annotation. Default is 'Concept'.

In [54]:
tacs_query(data[:3], 'tttp', qsents=True, return_all=True, data_save=False, annot_markup='html', annot_return = True, annot_save=True, annot_level='concept')

In [56]:
data[:5].to_csv('demodata.csv')

---