Skip to content

ScienceDeJargonizer simplifies scientific jargon for journalists without scientific backgrounds, using LLMs . This tool transforms complex terms into clear explanations, aiding accurate and accessible science reporting.

Notifications You must be signed in to change notification settings

ericlee878/ScienceDeJargonizer

Repository files navigation

Science De-jargonizer

About

We built a little system and wrote a short paper about it to submit to the Computation+Journalism Symposium 2024: the Science De-jargonizer can simplify scientific jargon for journalists without scientific backgrounds, using Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG). This tool transforms complex terms into clear explanations, aiding accurate and accessible science reporting. It is a prototype, intended as a roof-of-concept to demonstrate the potential benefits and drawbacks of such an application. Here's our codebase for the study, with all our data collection code, data analysis, prompts, and datasets.

Features

  • Jargon Identification: Automatically identifies complex scientific terms within academic texts.
  • Personalization: Identifies jargon terms based on the user’s expertise and background.
  • Clear Explanations: Generates easy-to-understand definitions and explanations based on the context of a paper.

How we built this

We ran a short pilot study to evaluate the potential of GPT-4 and RAG for identifying and defining jargon terms for the benefit of science reporters. We tested different prompts to (i) personalize jargon identification based on the reader's science expertise, and (ii) to generate accurate, high-quality definitions of jargon terms for easy reading.

We evaluated the identified jargon terms and definitions by comparing them to ground-truth annotations from two annotators with varying scientific expertise. This was a relatively small-scale study, we looked at jargon terms for arXiv CS abstracts (n=64), sampled from articles published in March 2024 and in the following primary categories: cs.AI, cs.HC, cs.CY. We also compared two different approaches for definition generation: using RAG with context from the fulltext, and just using the paper abstract as context in the prompt. The abstract-only condition performed a little better!

We aim to continue this work further, at scale, and with improved appraoches to prompting and UI design.

Running the interface locally

The interface displays the jargon identified by one of our annotators, as well as all the correct definitions for it generated by the abstract-only system. You can run it locally as follows:

cd my-app
npm start

Prompts we used

Jargon Identification

System prompt for GPT-4:

Your task is to identify jargon terms in scientific abstracts for readers, specific to the methods and concepts introduced in the study discussed. Jargon terms can encompass multiple words that refer to a concept. Only identify jargon that prevents readers from developing a basic understanding of the important concepts in the study. If a term is defined in the abstract, it is not jargon. Here is some information about the reader, their expertise, and the domain of the scientific abstract they are reading:
Reader Description: ANNOTATOR DESCRIPTION
Domain of Abstract: Computer Science, focusing on SUB-CATEGORY OF ABSTRACT.
From the provided abstract, only return a comma-separated list of jargon terms given what you know about the reader, their expertise, and the abstract domain. Retain the exact wording of the jargon terms as they appear in the abstract. Do not make any changes in wording or punctuation.

Jargon Definition

We retained the default system prompt from llama-index for both the RAG-based and the abstract-based jargon definitions:

You are an expert Q&A system that is trusted around the world. Always answer the query using the provided context information, and not prior knowledge. Some rules to follow:

  1. Never directly reference the given context in your answer.
  2. Avoid statements like "Based on the context, ..." or "The context information ..." or anything along those lines.

The query prompt for the RAG-based system:

Context information is below:

CONTEXT FROM RETRIEVAL STEP

Given the context information and not prior knowledge, answer the query. Query: Please use 1-2 sentences to explain the following term so that even a reader without deep scientific and technical knowledge can understand it easily: JARGON TERM Answer:

The query prompt for the abstract-based system was similar to the RAG-based system, but it used the entire abstract as context instead of retrieved fulltext snippets.

Contact

Acknowledgements

  • Thanks to the Computational Journalism Lab at Northwestern for their support.

About

ScienceDeJargonizer simplifies scientific jargon for journalists without scientific backgrounds, using LLMs . This tool transforms complex terms into clear explanations, aiding accurate and accessible science reporting.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published