# Microsourcing: A literature review (illustration)

Gerit Wagner (Otto-Friedrich-Universität Bamberg)  
Julian Prester (The University of Sydney Business School, University of Sydney)  
Roman Lukyanenko (McIntire School of Commerce, University of Virginia)  
Guy Paré (Department of Information Technologies, HEC Montréal)

> **Important**
>
> The data repository (colrev project) is currently stored separately: <a href="https://github.com/fs-ise/C5-DM-vignette" target="_blank">fs-ise/C5-DM-vignette</a>. Before submission, the quarto manuscript (vignette) will be added as the last commit on top of the colrev projet (it should be available in the `data/data/paper.md`). At the moment, it will be kept separately to allow for forced-push updaes.
>
> **TODO**:
>
> -   Update links in this document
> -   Include vignette (screenshot?) in the paper

## Plan

The review is conducted using a <a href="https://github.com/fs-ise/C5-DM-vignette" target="_blank">shared GitHub repository</a>, which was be synchronized locally by the team. <span style="display:inline-block;padding:.15rem .5rem;border-radius:999px;
background:#eef;color:#224;font-size:.85em;">Curate</span>

## Search

We specified search strategies for the DBLP and Crossref application programming interfaces (APIs)[1] using the core keyword *microsourcing* and a set of semantically related synonyms. We also reused samples from prior reviews ([Wagner, Prester, and Paré 2021](#ref-WagnerPresterPare2021); [Fiers 2023](#ref-Fiers2023)). The resulting query formulations were systematically tabulated to document the conceptual scope of the search and to enable consistent execution across data sources (see <a href="#tbl-searches" class="quarto-xref">Table 1</a>).

| Source | Search strategy | Search results |
|:---|:---|:---|
| Crossref (API search) | [crossref_search_history.json](https://github.com/fs-ise/C5-DM-vignette/blob/main/data/search/crossref_search_history.json) | [crossref.bib](https://github.com/fs-ise/C5-DM-vignette/blob/main/data/search/crossref.bib) |
| DBLP (API search) | [dblp_search_history.json](https://github.com/fs-ise/C5-DM-vignette/blob/main/data/search/dblp_search_history.json) | [dblp.bib](https://github.com/fs-ise/C5-DM-vignette/blob/main/data/search/dblp.bib) |
| Prior review: Wagner, Prester, and Paré ([2021](#ref-WagnerPresterPare2021)) | [Wagner2021_search_history.json](https://github.com/fs-ise/C5-DM-vignette/blob/main/data/search/WagnerPresterPare2021_search_history.json) | [Wagner2021.bib](https://github.com/fs-ise/C5-DM-vignette/blob/main/data/search/WagnerPresterPare2021.bib) |
| Prior review: Fiers ([2023](#ref-Fiers2023)) | [Fiers2023_search_history.json](https://github.com/fs-ise/C5-DM-vignette/blob/main/data/search/Fiers2023_search_history.json) | [Fiers2023.csv](https://github.com/fs-ise/C5-DM-vignette/blob/main/data/search/Fiers2023.csv) |

Table 1: Overview of search strategies and results.

The search strategie are stored in JSON format together with the raw data files in the <a href="https://github.com/fs-ise/C5-DM-vignette/tree/main/data/search" target="_blank">data/search</a> directory, in line with the standard of Haddaway et al. ([2022](#ref-HaddawayRethlefsenDaviesEtAl2022)).

Record metadata is curated as follows: <span style="display:inline-block;padding:.15rem .5rem;border-radius:999px;
background:#eef;color:#224;font-size:.85em;">Curate</span>

-   Data retrieved in the search is stored in the <a href="https://github.com/fs-ise/C5-DM-vignette/tree/main/data/search" target="_blank">data/search</a> directory; the <a href="https://github.com/fs-ise/C5-DM-vignette/commits/main/data/search" target="_blank">Git history of this path</a> shows that the files were preserved in their original form, i.e., treated as raw data
-   Records were imported into the <a href="https://github.com/fs-ise/C5-DM-vignette/blob/main/data/records.bib" target="_blank">data/records.bib</a> as the primary data structure; the <a href="https://github.com/fs-ise/C5-DM-vignette/commits/main/data/records.bib" target="_blank">Git history of this file</a> documents how each record evolved through the process (e.g., manual or computational change of metadata, merging of records, prescreening decisions)

For primary data (record metadata), the Bibtex format was chosen and consistent formatting was ensured by CoLRev ([Wagner and Prester 2025](#ref-WagnerPrester2025)). BibTex is machine readable and the changes can easily be interpreted when inspecting the git history.

> **Explanation**
>
> Data was structured as follows:
>
> <figure>
> <img src="attachment:figures/recommendation.png" alt="Data structures" />
> <figcaption aria-hidden="true">Data structures</figcaption>
> </figure>

> **TODO**
>
> -   Search-query was used to validate syntactic correctness and …
> -   scope (????)

## Dedupe

Metadta was prepared using CoLRev and extensions (see <a href="https://github.com/fs-ise/C5-DM-vignette/commit/051e115fff389f209afb9a4fbe77e6a33271264c" target="_blank">prep commit</a>).

Deduplication was done using BibDedupe ([Wagner 2024](#ref-Wagner2024)). Deduplication changes are in <a href="https://github.com/fs-ise/C5-DM-vignette/commit/c22178d10fb90954d681f428fc5b08c72b5e6d48" target="_blank">dedupe commit</a>.

> **TODO**
>
> Dedupe changes were validated using the max-diff strategy (`colrev validate XXXX`). Preparation changes were validated using the max-diff strategy (`colrev validate XXXX`).

## Prescreen

> **TODO**
>
> Replace this figure:
>
> <figure>
> <img src="attachment:figures/illustration-lr-transparency.png" alt="Data structures" />
> <figcaption aria-hidden="true">Data structures</figcaption>
> </figure>
>
> -   For prescreening, we tested the new <a href="temp_file.txt" target="_blank">llm-prescreener</a> in <a href="temp_file.txt" target="_blank">ref</a>. Comparison with prescreening decisions of GW showed low reliability with the llm-prescreener (command + kappa). Results were therefore reverted ([ref](temp_file.txt)) and a fully manual prescreen was implemented.
>
> Note: this could also be done in a separate branch, or the changes could be undone using a hard git reset.
>
> -   Screen: fulltext documents were shared in a protected drive (link to Dropbox)

## Data extraction

For data extraction, four scenarios were considered:

-   A bibliometric analysis (the citation network is <a href="temp_file.txt" target="_blank">TODO</a>).
-   An emergent mapping study (the notes are [TODO](temp_file.txt) and illustrated <a href="temp_file.txt" target="_blank">here</a>).
-   A structured extraction of evidence (a preliminary coding scheme is <a href="temp_file.txt" target="_blank">TODO</a> and the pilot coding <a href="temp_file.txt" target="_blank">TODO</a>).

> **Explanation**
>
> In line with recommendation XY, data structures range from unstructured to structured; they should be aligned with the type of review.
>
> <figure>
> <img src="attachment:figures/data_structures.png" alt="Data structures" />
> <figcaption aria-hidden="true">Data structures</figcaption>
> </figure>

## Synthesis

The narrative synthesis is in the <a href="https://github.com/fs-ise/C5-DM-vignette/blob/main/data/data/paper.md" target="_blank">paper docuemnt</a> in Markdown format, allowing for larger teams to work on the same document (similar to the [covid19-review](https://github.com/greenelab/covid19-review)).

To make the review reusable, we added the [CC BY 4.0](https://github.com/fs-ise/C5-DM-vignette/blob/main/LICENSE.txt) license[2].

The current status of the project is automatically updated with every change and reflected in the PRISMA chart ([Page et al. 2021](#ref-PageMcKenzieBossuytEtAl2021)):

[1] For the illustration, we relied on open-access API-searches, because licensing issues do not allow for publication of raw data exported from databases like WOS or EBSCO.

[2] Indexing in SYNERGY, SearchRXiv is planned once the review progresses beyond the *illustration* stages.

In [1]:
from py_prisma.py_prisma import plot_prisma_from_records

plot_prisma_from_records(records_path="/home/gerit/ownCloud/action-office/LRDM/C5-DM-vignette/data/records.bib", show=True)

> **TODO**
>
> The PRISMA chart is generated but needs to be tested properly.

# References

Fiers, Fien. 2023. “Inequality and Discrimination in the Online Labor Market: A Scoping Review.” *New Media & Society* 25 (12): 3714–34. <https://doi.org/10.1177/14614448221128379>.

Haddaway, Neal R., Melissa L. Rethlefsen, Melinda Davies, Julie Glanville, Bethany McGowan, Kate Nyhan, and Sarah Young. 2022. “A Suggested Data Structure for Transparent and Repeatable Reporting of Bibliographic Searching.” *Campbell Systematic Reviews* 18 (4): 1–12. <https://doi.org/10.1002/CL2.1288>.

Page, Matthew J., Joanne E. McKenzie, Patrick M. Bossuyt, Isabelle Boutron, Tammy C. Hoffmann, Cynthia D. Mulrow, Larissa Shamseer, et al. 2021. “The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews.” *Systematic Reviews* 10 (1). <https://doi.org/10.1186/S13643-021-01626-4>.

Wagner, Gerit. 2024. “BibDedupe: An Open-Source Python Library for Bibliographic Record Deduplication.” *Journal of Open Source Software* 9 (97): 6318. <https://doi.org/10.21105/JOSS.06318>.

Wagner, Gerit, and Julian Prester. 2025. “CoLRev: An Open-Source Environment for Collaborative Reviews.” <https://github.com/CoLRev-Environment/colrev>.

Wagner, Gerit, Julian Prester, and Guy Paré. 2021. “Exploring the Boundaries and Processes of Digital Platforms for Knowledge Work: A Review of Information Systems Research.” *The Journal of Strategic Information Systems* 30 (4): 101694. <https://doi.org/10.1016/j.jsis.2021.101694>.