# SWIM

# Stage II

## Task 3: Search
|Step|Result|Requirement|
:----|:----|:----|
|Find resources|preliminary document list|16. use reliable sources<br>17. find relevant documents<br>18. find similar documents|
|Remove duplicates|preliminary document set|19. identify duplicate documents|
|Find missing documents|curated document set|20. find unindexed documents<br>21. identify document set gaps|

<div align="center">
<img src="images/Stage III.jpg" width="990" />
</div>

### Step 3.1: Find resources
Requirements:
* use reliable sources
* find relevant documents
* find similar documents

1. **Reference manager**: First, a digital library setup is required. While this setup uses [Zotero](https://www.zotero.org/), others like [Citavi](https://www.citavi.com/) are equally valid, though some steps may need to be modified to fit their specific functionality. The following steps are recommended:
    1. Install the Zotero application and browser plugin.
    2. Sign in to your Zotero account for both.
    3. **Zotero library**: Setup a library (/group) for this literature SWIM.
    4. **Query collections**: For each search query, set up a collection (/folder) within this library.
2. **Search engine**: While each literature SWIM only uses one *Reference manager*, it can and should use multiple search engines, Google Scholar, Semantic Scholar, OKMaps etc. Proceed as follows:
    1. For each search engine:
        1. For each query:
            1. Create a collection (/folder) within the respective *query collection*. Select this collection with a left-click.
            2. Open the search engine and use the query there.
            3. Store each resulting page using the zotero plugin. While the results should be between 100 and 500, it is recommended to stop after cataloguing the first ~200 results.
                * *Hint*: While you can open multiple tabs and increase the amount of search results per page, neither are recommended. Many search engines use temporary timeouts to sanction highly frequent access.
            4. Then export the *query collection* as a BibTeX, for later reference.

The reliability of sources is generally given for most dedicated scholarly search engines, with only minor curation required later on. While this step collects the majority of relevant documents will be collected during this step, there are various means through which the document set can be expanded later on. Examples include:
* A highly relevant document is found and is used to query **Connected Papers**. This results in a recommendation of additional relevant papers, which can be downloaded as BibTeX and added via Zotero's "Import from BibTeX" functionality.
* A (large) research project relevant to a research question is found and has it's publications catalogued. These results can be either added through BibTeX (see above), or looked up individually and added manually.

Within the *Zotero library*, a new collection should be created called "others". Within that, each of these examples (e.g. Connected Papers dump / research project publications / ...) is stored as it's own collection. Like the *query collection*, the BibTeX representation of these collections should be exported and stored.

### Step 3.2: Remove duplicates
**Requirements:**
* identify duplicate documents

*Hint*: Before any duplicates are removed, it is important to make sure all collections have their BibTeX stored. If not, it is impossible to reliably trace back the origin of a document once duplicates are removed.

Most duplicates can be removed by using Zotero's "Duplicate items" section next to the collections. Each item there can be merged with the click of a button, removing the duplicates.

Other duplicates may only be found later on, for example when calculating document similarity. This step is primarily to reduce future overhead processing the same file twice, while it presumably can't fully prevent this.

### Step 3.3: Find missing documents
Requirements:
* find unindexed documents
* identify document set gaps

Once all queries are processed and all apparent duplicates removed, the last step of task 3 includes three phases:
1. **Find documents**: First, using Zotero's "Find Available PDFs" feature, all PDFs of the library are downloaded. Select the library, all references therin and find this feature by right-clicking any of them. After this is processed, presumably not all PDFs are found. By filtering the library by attatchments, each reference without PDF can be inspected. The research can attempt to find missing PDFs of promising references through other means, e.g. from their (departments) own library, by buying the work, contacting the authors or consulting a librarian.
2. **Identify gaps**: If works knwon to the research that are deemed relevant to the research question are not in the Zotero library, they can be added as a dedicated collection within the "others" collection. Each document added here should be thuroughly inspected for the following reasons:
    * Works of a second (non-english) language are usually not picked up by research engines. This can be circumnvented by translating each query to said other language, which provides extended scope in exchange for significant overhead in data management. While not recommended, this process is not adviced against, and can be conducted if the benefit is deemed worth the investment.
    * Potentially, the search queries have a blind spot, either being to exclusive in their keyword combination, or not including an important keyword. This indicates a) an oversight in "Step 2.4 Refine with related literature", as well as b) a potential to soft-reset the survey back to Step 2.5. Generally, the collected references are not invalid and can be kept, leaving only the modified and added queries to be re-run.
        * *Hint*: Instead of adding this kind of missing keyword ```B``` as ```A OR B```, a new query should be added without the ```A```. While in hindsight the ```A OR B``` variation would have been prefered, re-running the ```A OR B``` query will return already recorded results.
    
    Similarly, gaps can be identified without the right documents to fill them. If such a gap is identified *via known missing keywords*, proceed with a soft reset as described above. If not, this blind spot might be a finding of the SWIM survey itself, noting not a gap in the search, but in the literature itself.
3. **Export library**: Concluding task 3, the Zotero library should be exported as BibTeX, this time with "Export Files" checked.

## Task 4: Select
|Step|Result|Requirement|
:----|:----|:----|
|Extract structured data from documents|metadata|22. extract publication date<br>23. extract author(s)<br>24. extract publication venue<br>25. extract specified keywords|
| |content|26. extract document text<br>27. differentiate chapters|
| |bag of words|28. remove special character<br>29. identify multi-words<br>30. expand acronyms<br>31. lemmatise words<br>32. normalise words<br>33. remove stopwords|
| |contribution statements|34. identify statements|
|Calculate relational meassurements within the document set|document representation|35. calculate "term frequency" (tf) and "term frequency - inverse document frequency" (tf-idf)<br>36. calculate wordembedding and document embedding<br>37. represent document machine-readable|
| |similarity of documents within the document set|38. consider synonyms<br>39. consider polysems<br>40. calculate document similarity|
|Identify documents relevant for the research questions|relevance of documents for research question|41. represent research question machine-readable<br>42. calculate document relevancy for research question|

### Step 4.1 Extract structures data form documents.
**Requirements:**
* metadata
    * extract publication date
    * extract author(s)
    * extract publication venue
    * extract specified keywords
* content
    * extract document text
    * differentiate chapters
* bag of words
    * remove special character
    * identify multi-wprds
    * expand acronyms
    * lemmatise words
    * normalise words
    * remove stopwords
* contribution statements
    * identify statements

This step is broken down into extracting various structured data from documents. With this step, we begin to utilize Python, requiring setup.

In [None]:
% Setup
directory_path = "..."


#### Metadata
Given the BibTeX export, most of these matadata are already structured.

In [None]:
% ...

#### Content
Extract document text via .... Then differentiate chapters.

In [None]:
% ...

#### Bag of words
* remove special character
* identify multi-wprds
* expand acronyms
* lemmatise words
* normalise words
* remove stopwords

In [None]:
% ...

#### contribution statements
* identify statements

In [None]:
% ...

## Task 5: Evaluate
|Step|Result|Requirement|
:----|:----|:----|
|Remove out-of-scope document subset|in-scope document set|43. define in-/exclusion criteria<br>44. exclude documents|
|Evaluate relevancy measurement|selection approval|45. evaluate similarity of documents<br>46. evaluate relevance of documents|
|Classify document subsets|literature subsets|47. define thresholds<br>48. divide document set into subsets|

### Remove out-of-scope document subset
**Requirements:**
* define in-/exclusion criteria
* exclude documents

### Evaluate relevancy measurement
**Requirements:**
* evaluate similarity of documents
* evaluate relevance of documents

### Classify document subsets
**Requirements:**
* define thresholds
* divide document set into subsets