# Stage II
This stage collects, analyzes and selects literature on a bibliometric level, without technically ever needing to look into any of them. Using available indexing, metadata, as well as abstract and text extraction, screening is done based on a representation of the document instead of the document itself. This allows to more efficiently assess the literature, reducing the workload to a manageable load by not investing too much effort into later expelled documents. 

**Workload distribution**

|Actor|Time|
|:----|:----|
|Researcher|15 min|
|Machine|30 min|

**Tools**
* Search Engine (e.g. [Google Scholar](https://scholar.google.de/schhp?#d=gs_asd))
* [Zotero](https://www.zotero.org/)
* [Python](https://www.python.org/), including [pip](https://pip.pypa.io/en/stable/installation/) for ```pip install bnw-tools```, installing the package [bnw-tools](https://pypi.org/project/bnw-tools/) containing the SWARM-SLR code
* Optional:
    * Markdown file viewer (e.g. [Obsidian](https://obsidian.md/), preferably with [Dataview](https://github.com/blacksmithgu/obsidian-dataview) plugin)
    * [Jupyter](https://jupyter.org/) for using this Jupyter notebook itself (the file you are reading right now)

## Task 3: Search
|Step|Result|Requirement|
:----|:----|:----|
|Find resources|preliminary document list|16. use reliable sources<br>17. find relevant documents<br>18. find similar documents|
|Remove duplicates|preliminary document set|19. identify duplicate documents|
|Find missing documents|curated document set|20. find unindexed documents<br>21. identify document set gaps|

<div align="center">
<img src="images/Task 3.jpg" width="90%" />
</div>

### Step 3.1: Find resources
Requirements:
* use reliable sources
* find relevant documents
* find similar documents

1. **Reference manager**: First, a digital library setup is required. While this setup uses [Zotero](https://www.zotero.org/), others like [Citavi](https://www.citavi.com/) are equally valid, though some steps may need to be modified to fit their specific functionality. The following steps are recommended:
    1. Install the Zotero application and browser plugin.
    2. Sign in to your Zotero account for both.
    3. **Zotero library**: Setup a library (/group) for this SWARM-SLR.
    4. **Query collections**: For each search query, set up a collection (/folder) within this library.
2. **Search engine**: While each SWARM-SLR only uses one *Reference manager*, it can and should use multiple search engines, Google Scholar, Semantic Scholar, OKMaps etc. Proceed as follows:
    1. For each search engine:
        1. For each query:
            1. Create a collection (/folder) within the respective *query collection*. Select this collection with a left-click.
            2. Open the search engine and use the query there.
            3. Store each resulting page using the zotero plugin. While the results should be between 100 and 500, it is recommended to stop after cataloguing the first ~200 results.
                * *Hint*: While you can open multiple tabs and increase the amount of search results per page, neither are recommended. Many search engines use temporary timeouts to sanction highly frequent access.
            4. Then export the *query collection* as a BibTeX, for later reference.

The reliability of sources is generally given for most dedicated scholarly search engines, with only minor curation required later on. While this step collects the majority of relevant documents will be collected during this step, there are various means through which the document set can be expanded later on. Examples include:
* A highly relevant document is found and is used to query **Connected Papers**. This results in a recommendation of additional relevant papers, which can be downloaded as BibTeX and added via Zotero's "Import from BibTeX" functionality.
* A (large) research project relevant to a research question is found and has it's publications catalogued. These results can be either added through BibTeX (see above), or looked up individually and added manually.

Within the *Zotero library*, a new collection should be created called "others". Within that, each of these examples (e.g. Connected Papers dump / research project publications / ...) is stored as it's own collection. Like the *query collection*, the BibTeX representation of these collections should be exported and stored.

### Step 3.2: Remove duplicates
**Requirements:**
* identify duplicate documents

*Hint*: Before any duplicates are removed, it is important to make sure all collections have their BibTeX stored. If not, it is impossible to reliably trace back the origin of a document once duplicates are removed.

Most duplicates can be removed by using Zotero's "Duplicate items" section next to the collections. Each item there can be merged with the click of a button, removing the duplicates.

Other duplicates may only be found later on, for example when calculating document similarity. This step is primarily to reduce future overhead processing the same file twice, while it presumably can't fully prevent this.

### Step 3.3: Find missing documents
Requirements:
* find unindexed documents
* identify document set gaps

Once all queries are processed and all apparent duplicates removed, the last step of task 3 includes three phases:
1. **Find documents**: First, using Zotero's "Find Available PDFs" feature, all PDFs of the library are downloaded. Select the library, all references therin and find this feature by right-clicking any of them. After this is processed, presumably not all PDFs are found. By filtering the library by attatchments, each reference without PDF can be inspected. The research can attempt to find missing PDFs of promising references through other means, e.g. from their (departments) own library, by buying the work, contacting the authors or consulting a librarian.
2. **Identify gaps**: If works knwon to the research that are deemed relevant to the research question are not in the Zotero library, they can be added as a dedicated collection within the "others" collection. Each document added here should be thuroughly inspected for the following reasons:
    * Works of a second (non-english) language are usually not picked up by research engines. This can be circumnvented by translating each query to said other language, which provides extended scope in exchange for significant overhead in data management. While not recommended, this process is not adviced against, and can be conducted if the benefit is deemed worth the investment.
    * Potentially, the search queries have a blind spot, either being to exclusive in their keyword combination, or not including an important keyword. This indicates a) an oversight in "Step 2.4 Refine with related literature", as well as b) a potential to soft-reset the survey back to Step 2.5. Generally, the collected references are not invalid and can be kept, leaving only the modified and added queries to be re-run.
        * *Hint*: Instead of adding this kind of missing keyword ```B``` as ```A OR B```, a new query should be added without the ```A```. While in hindsight the ```A OR B``` variation would have been prefered, re-running the ```A OR B``` query will return already recorded results.
    
    Similarly, gaps can be identified without the right documents to fill them. If such a gap is identified *via known missing keywords*, proceed with a soft reset as described above. If not, this blind spot might be a finding of the SWARM-SLR itself, noting not a gap in the search, but in the literature itself.
3. **Export library**: Concluding task 3, the Zotero library should be exported as BibTeX, this time with "Export Files" checked.

*Note*: Take note that this survey methodology can only process documents that have a PDF available. If the PDF is behind a paywall or otherwise unaccessible, the work will not be regarded beyond this point. This, amongst others, further supports the relevance of Open Science.

## Task 4: Select

|Step|Result|Requirement|
:----|:----|:----|
|Extract structured data from documents|metadata|22. extract publication date<br>23. extract author(s)<br>24. extract publication venue<br>25. extract specified keywords|
| |content|26. extract document text<br>27. differentiate chapters|
| |bag of words|28. remove special character<br>29. identify multi-word expressions<br>30. expand acronyms<br>31. lemmatise words<br>32. normalise words<br>33. remove stopwords|
| |contribution statements|34. identify statements|
|Calculate relational meassurements within the document set|document representation|35. calculate "term frequency" (tf) and "term frequency - inverse document frequency" (tf-idf)<br>36. calculate wordembedding and document embedding<br>37. represent document machine-readable|
| |similarity of documents within the document set|38. consider synonyms<br>39. consider polysems<br>40. calculate document similarity|
|Identify documents relevant for the research questions|relevance of documents for research question|41. represent research question machine-readable<br>42. calculate document relevancy for research question|

<div align="center">
<img src="images/Task 4.jpg" width="90%" />
</div>

Since this task is done largely computationally, the following graphic adds insight into each step and their interaction. 
<div align="center">
<img src="images/Task 4 - Details.jpg" width="100%" />
</div>

In [None]:
# Task 4: Select
# Setup
from bnw_tools.extract.nlp import util_nlp
from bnw_tools.publish import util_wordcloud
from bnw_tools.publish.Obsidian import nlped_whispered_folder

folder_path = "D:/workspace/Zotero/SE2A-B4-2"
language = "en"
nlptools = util_nlp.NLPTools()

# Analyze Step (4.1, 4.2 und 4.3)
folder = util_nlp.Folder(folder_path, nlptools=nlptools, language=language)

# Publish (for better usability)
util_wordcloud.folder(folder)
nlped_whispered_folder.folder(folder, force=True)

### Step 4.1 Extract structures data form documents.
**Requirements:**
* metadata
    * extract publication date
    * extract author(s)
    * extract publication venue
    * extract specified keywords
* content
    * extract document text
    * differentiate chapters
* bag of words
    * remove special character
    * identify multi-wprds
    * expand acronyms
    * lemmatise words
    * normalise words
    * remove stopwords
* contribution statements
    * identify statements

This step is broken down into extracting various structured data from documents. With this step, we begin to utilize Python, requiring setup.

#### Metadata
Given the BibTeX export, most of the matadata is already structured.

In [None]:
# Example for metadata extracted from Zotero's BibTeX export:
folder.media_resources[0].pdf.metadata

#### Content
Working on a purely textual basis, this pipeline extracts all text from the PDF. This has a multitude of shortcomings, including among others:
* PDFs that either not allow, actively hinder or somehow are incompatible with automatic mining are excluded from this approach.
* Previously formated text looses context, such as tables, image captions, page numbers, headers and footers, etc.
* Overlapping text, either visible oder invisible, may grain the mined text.

Despite these errors, this method continues to work with this text-extraction-basis, due the nature of prior publications being of PDF-only nature at best and (scanned) print versions at worst. If more tools like [ORKG](https://orkg.org/) and [SciKGTeX](https://github.com/Christof93/SciKGTeX) are used in scientific practice, this and other step would benefit greatly.

In the meantime, we proceed as follows:
1. For each PDF
    1. **Extract text**: If no available PDF reader can extract text from a given PDF, this step is skipped for said document. It will not be further analyzed computationally and awaits manual evaluation later on.
    2. **Clean text**: Due to the nature of PDF, a multitude of errors can be *attempted* to clean up post extraction. These attempts are differentiated between ```always correct [++]```, ```potentially wrong [+-]``` and  ```highly situational [--]```. They include:
        * [++] Replace known error characters: ```ﬄ``` -> ```ffl```
        * [++] Removing line-breaks inside words: ```know-\n ledge``` -> ```know-ledge```
        * [+-] Removing hyphenation inside words: ```know-ledge``` -> ```knowledge```
        * [+-] Remove PDF authors, institutions, etc.
        * [+-] Remove headers and footers
        * [+-] Re-position captions
        * [--] Remove / re-position footnotes
        * [--] Remove / re-position table content
    3. **Structure text**: Attempt to differentiate between:
        * Abstract
        * Chapters / sections
        * References

Each extracted and potentially structured text is stored for future use.

In [None]:
# Text is read from the pdf, with minimal invasive cleanup.
# Extract text:     Completely implemented. Mostly reliable, though some files are still unreadable.
# Clean text:       Work in progress, only known error characters implemented.
# Structure text:   Work in progress.

import inspect
from bnw_tools.extract.util_pdf import PDFDocument

print(inspect.getsource(PDFDocument.from_file))
print(inspect.getsource(PDFDocument.clean_text))

folder.media_resources[0].pdf.text

#### Bag of words
Before the plain text can be converted to a bag of words (BoW), some sub-steps are required: 
* **remove special characters**: For the purpose of the BoW, special characters such as ```!"§$%&/()=?``` are not required and are removed.
* **identify multi-word expressions**: Words that semantically form a single unit, such as ```New York```, should be treated as such. 
* **expand acronyms**: Acronyms such as ```BoW``` and ```Bag of Words``` should not both be stored, hence each acroynme is expanded to its long form.
* **lemmatise words**: Instance like ```New Yorks``` and ```instances``` is reduced to it's non-flexed lemma ```New York``` and ```instance```. 
* **normalise words**: Terms like ```U.S.A.``` and ```anti-discriminatory``` are assigned to their *equivalence classes* and hence match with other words like them, such as ```USA``` and ```antidiscriminatory```. Note that terms such as ```Windows``` might appear in different *equivalence classes*, where a search for ```window``` might return instances of ```Windows```, but not the other way around: (```window```, ```windows```, ```Window```, ```Windows```, ); (```Windows```).
* **remove stopwords**: Grammatical words that carry no distinct meaning are removed, such as ```such```, ```as``` and ```and```.

Each of these processes should be handled with care, since potentially important information can be lost.

In [None]:
# Bag of words
# remove special characters:        completely implemented.
# identify multi-word expressions:  work in progress.
# expand acronyms:                  work in progress.
# lemmatise words**:                completely implemented
# normalise words**:                work in progress.
# remove stopwords**:               completely implemented, continues to be improved.
folder.media_resources[0].nlp_analysis.bag_of_words.get()

#### contribution statements
The representation of a document as statements is among the potentially most significant forms of knowledge representations. Unless pre-existing manual curation either through the authors usage of [SciKGTeX](https://github.com/Christof93/SciKGTeX), or by a fellow researchers [ORKG](https://orkg.org/) usage, potentially in a prior SWARM-SLR.

In [None]:
# Contribution
# A module is in development, but due to the complexity not ready for implementation yet.
# -> The current iteration of the SWARM-SLR includes no statement extraction module.

## Task 5: Evaluate
|Step|Result|Requirement|
:----|:----|:----|
|Remove out-of-scope document subset|in-scope document set|43. define in-/exclusion criteria<br>44. exclude documents|
|Evaluate relevancy measurement|selection approval|45. evaluate similarity of documents<br>46. evaluate relevance of documents|
|Classify document subsets|literature subsets|47. define thresholds<br>48. divide document set into subsets|

### Remove out-of-scope document subset
**Requirements:**
* define in-/exclusion criteria
* exclude documents

While formally each exclusion criteria could be rephrased as an inclusion criteria, their nature is different:
* **Inclusion criteria** are defined *a priori* and designate which documents *can* be significant to answer the research question. They are derived from the research question scope. They aim to increase feasability.
    * e.g. publicated between YYYY and XXXX
* **Exclusion criteria** are defined *a posteriori* and designate which documents *will not* be evaluated to anser the research question. They narrow down the research question's scope. They aim to increase quality.
    * e.g. without a scientific evaluation

The exclusion can be semi-automated, while certain criteria will most likely take effect during the later manual review of remaining papers.

In [None]:
# There is currently no automatic support for additional criteria.
# The available information is curated in a final note, however, making the apllication of additional criteria easier.
# Eventually, certain reliably automatable criteria, such as date ranges, can be implemented.
folder.media_resources[0].pdf.metadata['date']

### Evaluate relevancy measurement
**Requirements:**
* evaluate similarity of documents
* evaluate relevance of documents

Works can be pre-evaluated according to their allignment with the research question via the keywords. This means that the researcher can begin with works that are statistically more likely to be relevant to the research questions, and can gradually work down until all research questions are satisfied. Due to each RQ having their own set of keywords, the calculation of significance can be adjusted throughout the survey process.

In [None]:
import inspect
from bnw_tools.extract.nlp.util_nlp import Folder

# Similarity of documents is simple tf-idf cosine similarity.
print(inspect.getsource(Folder.calc_sim_matrix))

#### Document relevancy
Document relevance however is calculated using the term-frequency in term-based normalized keyword weighed cosine similarity:
* Reduce tf vector to keywords -> keyword-frequency-vector $kf$
  * sum tf scores of words fitting to the keyword (e.g. "software" -> "software", "software-engineer", ...)
* Normalize $kf$ keyword by keyword -> normalized keyword-frequency-vector $kf_{n}$
* Raise scores by using either root or log method to reduce the impact of keyword stuffing, but still reward higher frequencies.  -> raised normalized keyword-frequency-vector $kf_{nr}$
  * SWARM-SLR currently uses the square root (green continued line) to keep the influence simple and intuitive.
* calculate weighed cosine similarity with $kf_{nr}$ ($kf_{nr,i} \in [0,1]$), the keyword vector ($k_{i} \in \{0,1\}$) and the weight vector ($w_{i} \in [0,1]$)

As a result, the usage of the most different, highest weighed and most frequent keywords is rewarded with the highest grade.

*Hint*: Without raising the score, a document mentioning a keyword 100 times would potentially be 100 times more relevant than another document of similar length that mentions the keyword only once. An example with square root:
* A: 100 keywords in 10.000 words  -> $kf = 0.01$    ->  $kf_{n} = 1.0$  ->  $kf_{nr} = 1.0$ 
* B: 1 keyword in 10.000 words    -> $kf = 0.0001$  ->  $kf_{n} = 0.01$ ->  $kf_{nr} = 0.1$ 
* C: 10 keywords in 10.000.000 words -> $kf = 0.000001$  ->  $kf_{n} = 0.0001$ ->  $kf_{nr} = 0.01$ 

<div align="center">
<img src="images/root log plot.png" width="90%" />
</div>

<div style="display: flex; justify-content: center;">
<img src="images/root.png" width="45%" />
<img src="images/log.png" width="45%" />
</div>

In [None]:
import inspect
from bnw_tools.extract.nlp.util_nlp import Folder
from bnw_tools.review.similarity import compute_weighed_similarity, normalize

print(inspect.getsource(Folder.calc_rq_sim))
print(inspect.getsource(compute_weighed_similarity))
print(inspect.getsource(normalize))

In [None]:
print(folder.sim_matrix[0])
folder.rq_sim_mat[0]

### Classify document subsets
**Requirements:**
* define thresholds
* divide document set into subsets

With the metrics established in this stage, a final classification of the document set is possible. Similar to the in-/exclusion criteria, this classification functions as a soft criteria, specifying the conditions to evaluate a given document. By using certain **thresholds** over the document relevancy, the document set is divided into subsets that *must (M)*, *should (S)*, *could (C)* and *won't (W)* be considered. Unlike hard criteria, these are guidelines and aim as a decision support to which works to prioritise.

Giving an example:
* Documents *must* be evaluated if their relevancy score is above 75 % (M=.75), they have been published within the last year or in a Q1/A venue.
* Documents *should* be evaluated if their relevancy score is above 50 % (S=.50) or they have been published within the last 3 years.
* Documents *could* be evaluated if their relevancy score is above 25 % (C=.25) or below 5 % (W=.05).
* Documents *won't* be evaluated if their relevancy score is above 5 % (W=.05).

*Note*:
* These are *soft* criteria and to be percievved as guidelines.
* Relevancy scores of 0 and below the *W* threshold are percieved as errors and move up to *C*.


In [None]:
# This step defines thresholds that inform the next steps. All data required is already present in the folder.