# Stage I
In this stage we transition from an implicit informal research interest to a set of formalized human- and machine-readable research questions.

**Workload distribution**
|Actor|Time|
|:----|:----|
|Researcher|300 min|
|Machine|15 min|
|Domain Experts|10 min|

**Tools**
* LLM (e.g. [ChatGPT](https://chat.openai.com/))
* Search Engine (e.g. [Google Scholar](https://scholar.google.de/schhp?#d=gs_asd))
* Optional:
    * Markdown file viewer (e.g. [Obsidian](https://obsidian.md/), preferably with [Dataview](https://github.com/blacksmithgu/obsidian-dataview) plugin)

## Task 1: Planning a review
|Step|Result|Requirement|
:----|:----|:----|
|Formulate research interest|explicit informal research interest|1. formalize abstract interest<br>2. utilize central keywords|

<div align="center">
<img src="images/Task 1.jpg" width="527" />
</div>

### Step 1.1: Formulate research interest
Requirements:
* formalize abstract interest
* utilize abstract interest

This task is one of many that can be assisted by a Large Language Model (LLM). Each of these tasks will be described as an LLM prompt and tagged as such.

#### LLM prompt

```
Given the following research interest:
"<describe your research interest. Describe what you want to do to the best of your abilites. This could be from one sentence to a whole paragraph.>"

Provide a list of relevant keywords from the given research interest. Provide the list with **bold** keywords.
Then provide a list of 10 additional relevant scientific keywords for this research interest. Shortly describe each keyword's relevancy to the research interest. Provide the list with **bold** keywords and the relevancy describtion as a subitem.
Then formalize this research interest using these scientific keywords. Highlight the keywords in **bold**.
```

#### Example
```
Given the following research interest:
"I want to understand what engineers use to describe their knowledge. I know they use CPACS and XML, maybe some use Ontologies and Knowledge graphs. I want to analyze if they use natural language, data models, markup language, ..."

Provide a list of relevant keywords from the given research interest. Provide the list with **bold** keywords.
Then provide a list of 10 additional relevant scientific keywords for this research interest. Shortly describe each keyword's relevancy to the research interest. Provide the list with **bold** keywords and the relevancy describtion as a subitem.
Then formalize this research interest using these scientific keywords. Highlight the keywords in **bold**.

Provide the entire answer as one code block.
```

Notice the last row being added for easy of copy-paste, while reducing immediate readbablity within the application.

#### Reply
**Keywords from Given Research Interest:**
- **CPACS**
- **XML**
- **Ontologies**
- **Knowledge graphs**
- **Natural language**
- **Data models**
- **Markup language**

**Additional Scientific Keywords:**
1. **Semantic Web**
   - Relevancy: Engineers might utilize Semantic Web technologies to represent and link data in a meaningful way, enhancing interoperability and knowledge representation.

2. **Metadata**
   - Relevancy: Metadata is essential for organizing and describing engineering knowledge, providing context and facilitating efficient retrieval.

3. **RDF (Resource Description Framework)**
   - Relevancy: RDF can be used for expressing relationships in a machine-readable form, contributing to knowledge representation and sharing.

4. **Linked Data**
   - Relevancy: Engineers may employ linked data principles to connect and expose structured information, fostering a more interconnected knowledge ecosystem.

5. **Knowledge Representation**
   - Relevancy: Understanding how engineers formally represent knowledge is crucial, and knowledge representation frameworks play a key role in this process.

6. **Data Integration**
   - Relevancy: Engineers often need to integrate data from diverse sources; studying data integration methods is pertinent to understanding their knowledge-sharing practices.

7. **Machine Learning**
   - Relevancy: Machine learning techniques can be applied to analyze patterns and trends in engineering knowledge usage, offering insights into preferences and practices.

8. **Collaborative Knowledge**
   - Relevancy: Investigating how engineers collaborate on knowledge creation and sharing can provide a holistic view of their information exchange processes.

9. **Taxonomy**
   - Relevancy: Taxonomies are foundational for organizing and classifying knowledge, offering a structured framework for information retrieval.

10. **Text Mining**
    - Relevancy: Engineers may use text mining to extract valuable insights from unstructured data, such as documents and reports, contributing to their knowledge base.

**Formalized Research Interest:**
"I aim to investigate how **engineers** articulate and share their knowledge using various technologies, including **CPACS** and **XML**. Additionally, I seek to explore the role of **ontologies** and **knowledge graphs** in knowledge representation. The study will analyze the utilization of **natural language**, **data models**, and **markup language**. The investigation extends to examining the application of **Semantic Web** technologies, **metadata**, **RDF (Resource Description Framework)**, **linked data**, and other related concepts. Furthermore, the research will delve into the realms of **knowledge representation**, **data integration**, **machine learning**, **collaborative knowledge** practices, **taxonomy**, and **text mining** within the context of engineering knowledge sharing."



## Task 2: Defining a scope
|Step|Result|Requirement|
:----|:----|:----|
|Check for related research questions|narrowed informal research interest|3. know what previous scholarly works already covered<br>4. know what previous scholarly works were missing|
|Refine scientific interest|specific research question|5. use specific vocabulary<br>6. state a clear scope<br>7. state a clear perspective|
|Formulate Search Query|preliminary weighted keywords and queries|8. use relevant keywords<br>9. weight individual keywords|
|Refine with related literature|weighted keywords and refined queries|10. extend the vocabulary<br>11. identify polysemes and synonyms|
|Re-evaluate with domain experts|validated weighted keywords and queries|12. validate research questions<br>13. validate keywords<br>14. validate weights<br>15. validate search query|

<div align="center">
<img src="images/Task 2.jpg" width="1398" />
</div>

### Step 2.1: Check for related research questions
Requirements:
* Know what previous scholarly works already covered
* Know what previous scholarly works were missing

To achieve this, we query large databases, such as [Google Scholar](https://scholar.google.de/) and [Semantic Scholar](https://www.semanticscholar.org/), as well as more tailored solutions like the [ORKG](https://orkg.org/) and [OKMaps](https://openknowledgemaps.org/). While the general approach stays the same (Search -> Find -> Open -> Read document), there are differences when using each solution:

* **Google Scholar**
    * Many operators for querying https://scholar.google.de/ are depricated, only ```A -B```, ```A OR B``` and ```"A B"```, as well as ```author:A``` and ```source:B``` still work. It is best to use https://scholar.google.de/schhp?#d=gs_asd to get a feeling for how it's best queried. The most important are the following two:
    * ```"knowledge work"``` searches for the exact word group and ignores instances such as "work requires specific knowledge".
    * ```knowledge OR work``` searches for works in which include either word.<br>Note that ```OR``` links the objects (word or word group) left and right, making these identical:
        *   ```"knowledge work" OR Curation wiki```
        *   ```Curation OR "knowledge work" wiki ```
* **Semantic Scholar**
    * Available under https://www.semanticscholar.org/, one queries Semantic scholar similar to Google Scholar, just with a different advanced search syntax. Semantic scholar is more focussed on API access, leading to inventions of tools like the next one.
* **Open Knowledge Maps (OKMaps)**
    * https://openknowledgemaps.org/ uses the Semantic Scholar API to create "Knowledge Maps", which present up to 100 Documents most relevant to a given topic, clustured by their overarching topics. The user queries it the same way as Semantic Scholar, with a list of keywords in no particular syntax.
* **Open Research Knowledge Graph (ORKG)**
    * https://orkg.org/ can be directly queried for related research fields and questions:
        * https://orkg.org/search/knowledge?types=ResearchField,Problem
    * Additional types such as Paper, List, Comparison and Review are also recommendable, once explicitly formalized research questions are evaluated.

### Step 2.2: Refine scientific interest
Requirements:
* use specific vocabulary
* state a clear scope
* state a clear perspective|

```
The goal is to transform a given scientific interest into scientific research questions.

A research question meets 3 requirements:
1. Specificity of topic: The topic must be precisely defined and identifiable.
2. Focus of topic: There must be one clearly identifiable subject of study.
3. Perspective: A researchable interest emerges from the vantage point of an academic discipline.

Provide a list "Potential topics" and a list "Potential perspectives" that are relevant for this scientific interest.

Provide clear and researchable research questions. Provide the minimum amount of research questions to cover the scope of the research interest. Shortly describe how each of research question meets all requirements.

Provide a list "Covered topics" and a list "Covered perspectives" that are covered by these research questions.

Here is an example of a successful transformation:
Scientific Interest (Input): "I am interested in why students are not achieving."
Research Question (Output): "What effect does understanding specific academic language have on achievement in the natural sciences for third-grade Hispanic second-language learners?"

Scientific interest (Input):
"<Your explicit informal research interest from Step 1.1>"
```

#### Example
```
The goal is to transform a given scientific interest into scientific research questions.

A research question meets 3 requirements:
1. Specificity of topic: The topic must be precisely defined and identifiable.
2. Focus of topic: There must be one clearly identifiable subject of study.
3. Perspective: A researchable interest emerges from the vantage point of an academic discipline.

Provide a list "Potential topics" and a list "Potential perspectives" that are relevant for this scientific interest.

Provide clear and researchable research questions. Provide the minimum amount of research questions to cover the scope of the research interest. Shortly describe how each of research question meets all requirements.

Provide a list "Covered topics" and a list "Covered perspectives" that are covered by these research questions.

Here is an example of a successful transformation:
Scientific Interest (Input): "I am interested in why students are not achieving."
Research Question (Output): "What effect does understanding specific academic language have on achievement in the natural sciences for third-grade Hispanic second-language learners?"

Scientific interest (Input):
"I aim to investigate how **engineers** articulate and share their knowledge using various technologies, including **CPACS** and **XML**. Additionally, I seek to explore the role of **ontologies** and **knowledge graphs** in knowledge representation. The study will analyze the utilization of **natural language**, **data models**, and **markup language**. The investigation extends to examining the application of **Semantic Web** technologies, **metadata**, **RDF (Resource Description Framework)**, **linked data**, and other related concepts. Furthermore, the research will delve into the realms of **knowledge representation**, **data integration**, **machine learning**, **collaborative knowledge** practices, **taxonomy**, and **text mining** within the context of engineering knowledge sharing."
```

#### Reply
**Potential Topics:**
1. **Utilization of CPACS and XML in Engineering Knowledge Sharing**
2. **Role of Ontologies and Knowledge Graphs in Engineering Knowledge Representation**
3. **Analysis of Natural Language Usage in Engineering Communication**
4. **Impact of Data Models on Engineering Knowledge Articulation**
5. **Evaluation of Markup Language in Engineering Documentation**
6. **Application of Semantic Web Technologies in Engineering**
7. **Role of Metadata in Organizing and Describing Engineering Knowledge**
8. **Effectiveness of RDF (Resource Description Framework) in Engineering Knowledge Representation**
9. **Utilization of Linked Data in Interconnecting Engineering Information**
10. **Implementation of Knowledge Representation Frameworks in Engineering Practices**

**Potential Perspectives:**
1. **Engineering Education**
2. **Information Science**
3. **Computer Science**
4. **Data Science**
5. **Knowledge Management**

**Research Questions:**
1. **How do engineers utilize CPACS and XML in articulating and sharing their knowledge within the field of [chosen perspective] education?**
   - *Meets Specificity:* Focuses on the specific use of CPACS and XML in knowledge sharing.
   - *Meets Focus:* Examines the subject from the perspective of engineering education.
   - *Meets Perspective:* Relates to the vantage point of engineering education.

2. **What is the role of ontologies and knowledge graphs in representing engineering knowledge, and how do they contribute to [chosen perspective] information science?**
   - *Meets Specificity:* Focuses on the role of ontologies and knowledge graphs in knowledge representation.
   - *Meets Focus:* Examines the subject from the perspective of information science.
   - *Meets Perspective:* Relates to the vantage point of information science.

3. **In what ways does natural language play a significant role in the communication practices of engineers within the context of [chosen perspective] computer science?**
   - *Meets Specificity:* Focuses on the role of natural language in engineering communication.
   - *Meets Focus:* Examines the subject from the perspective of computer science.
   - *Meets Perspective:* Relates to the vantage point of computer science.

**Covered Topics:**
- Utilization of CPACS and XML
- Role of Ontologies and Knowledge Graphs
- Natural Language Usage
- Data Models
- Markup Language
- Semantic Web Technologies
- Metadata
- RDF (Resource Description Framework)
- Linked Data
- Knowledge Representation Frameworks

**Covered Perspectives:**
- Engineering Education
- Information Science
- Computer Science
- Data Science
- Knowledge Management



### Step 2.3 Formulate search query
Requirements:
* use relevant keywords
* weight individual keywords

Following the query guide for Google Scholar, we create at least one search query per research question. Each query meets the following requirements:
* Topic specific: Each query must include keywords to narrow down on topic specific documents.
* Reduce blind spots: Each additional keyword reduces the amount of potential results, trying to match all of them. ```OR``` merged keywords should be used for similar keywords to include more of the relevant documents without excluding others.
* length: Google Scholar will cut the query after 256 characters. Make sure it does not exceed this length.

Important notes in completing this task:
* LLMs can once again assist in finding relevant keywords, but are (currently) terrible at e.g. character length. They should be used for keyword collection, which should be manually curated, completed and weighted. 
* The query creation should also be done manually. A good query has between 200 and 500 results.
* Google Scholar is just one of the search engines. This query is the baseline for all engines and might need modifications depending on the specific search syntax.

<code>
The goal is to define the most important keywords for a search query to answer a research question.

Cluster at least 20 keywords into 4 informatively titled groups. 
Each group contains at least 5 keywords and has a title that accurately reflects the group.
Each keyword has a relevancy weight from 1 (low) to 10 (high). 

Example:
```
### title_group_1
keyword_1_group_1:: 10
keyword_2_group_1:: 7
...
### title_group_2
keyword_1_group_2:: 9
...
```

Research question: "<RQ 1 from above>"

Provide the entire answer as a code block.
</code>

#### Example
<code>
The goal is to define the most important keywords for a search query to answer a research question.

Cluster at least 20 keywords into 4 informatively titled groups. 
Each group contains at least 5 keywords and has a title that accurately reflects the group.
Each keyword has a relevancy weight from 1 (low) to 10 (high). 

Example:
```
### title_group_1
keyword_1_group_1:: 10
keyword_2_group_1:: 7
...
### title_group_2
keyword_1_group_2:: 9
...
```

Research question: "How do engineers utilize CPACS and XML in articulating and sharing their knowledge within the field of aerospace engineers?"

Provide the entire answer as a code block.
</code>

#### Reply
```
### CPACS in Aerospace Engineering
CPACS utilization:: 10
XML integration in CPACS:: 9
Aerospace engineering standards:: 8
CPACS data exchange:: 8
XML-based knowledge sharing:: 7

### Knowledge Articulation Methods
Engineering knowledge sharing:: 9
Articulation techniques in aerospace:: 8
Collaborative knowledge platforms:: 7
Knowledge dissemination strategies:: 7
Engineering documentation standards:: 6

### XML in Aerospace Engineering
XML in aviation:: 9
Aerospace data representation:: 8
XML schema for engineering:: 7
XML-based communication in aerospace:: 7
Data interoperability in aerospace:: 6

### Collaborative Engineering Practices
Collaborative aerospace projects:: 9
Knowledge sharing in engineering teams:: 8
Team-based engineering workflows:: 7
Collaborative tools in aerospace:: 7
Engineering best practices sharing:: 6
```

#### Manually synthesized query
```CPACS OR XML "Knowledge sharing" OR "knowledge communication" OR "Knowledge exchange" OR "Knowledge articulation" OR "Best Practice sharing" "Aerospace engineering"```

### Step 2.4 Refine with related literature
**Requirements**:
* extend the vocabulary
* identify polysemes and synonyms

**Workload**: Up to 60 min per research question 

By using the search query and reading the resulting documents, the fitness for the research question can be evaluated. Different types of findings might warrant a re-evaluation of the research query:
* **Missing keywords**: Keywords that are relevant to the research question, but missing from the list, should be added.
* **Wrong weight**: Many reasons can lead to the re-evaluation of a keyword weight. 
    * **Polysemes** like "web" lead to inclusion of unrelated work. Their weight should be **reduced**.
    * **Synonyms** like "work" and "labour" are generally harmless. They can be added to the list as missing keywords. There is usually a gradient in fitness to the research question between them, which should be represented in their weights. If the query is below the length limit, they can also be included via ```OR``` next to the original word.

### Step 2.5 Re-evaluate with domain experts
Requirements:
* validate research questions
* validate keywords
* validate weights
* validate search query

The final step of stage I is the re-evaluation with domain experts. The goal of Task 2 is to be exceptionally prepared for this final evaluation, reducing the time external workforce is required.

The domain expert is presented the **research questions**, including the **weighted keywords** and **search queries**. The validation includes:
* re-formulation, addition and removal of research questions
* addition, removal and re-structuring of the keywords
* in- and decreasing their weights
* re-formulation, addition and removal of search queries

Once this step is completed, the research questions for the SWIM survey are set. 

#### Modifying the research questions after Step 2.5
Modifications throughout the survey process could lead to inconsistencies and reduced overall quality. Findings that may warrant a modification may be preferably situated in the Discussion section, or warrant a soft reset of the SWIM survey by repeating Step 2.5 and re-assessing all following steps.