Skip to content

bias-check/encoding_values

Repository files navigation

🧠 Encoding Values: Injecting Morality into Machines


📌 Project Overview

This project explores how explicit ethical value sets—such as feminist, ecological, and decolonial frameworks—can be injected into large language models through prompt conditioning. We aim to measure how moral system prompts shape model responses in terms of tone, framing, inclusivity, and alignment.


🎯 Key Goals

  • Examine how ethical "value sets" influence the outputs of generative AI models.
  • Establish methods to prime LLMs with ethical orientations through system-level prompts.
  • Identify shifts in semantic meaning, toxicity, and value alignment across different moral framings.
  • Provide an empirical foundation for future work in pluralistic alignment and value-conditioned generation.

🔬 Research Objective

To rigorously evaluate the impact of explicit ethical priming on LLM-generated content and assess:

  • How values diverge or reinforce one another in generation.
  • Whether certain moral framings redirect model biases toward inclusive outcomes.
  • The limits and contradictions when values are complex or conflicting.

⚙️ Methodology

  1. Define Moral Value Sets:

    • Use ethical perspectives as system prompts: e.g., feminist, ecological, decolonial.
  2. Prompt Execution with LLMs:

    • Run identical input prompts across each moral frame using GPT or LLaMA 3 models.
  3. Divergence Analysis:

    • 🔁 Semantic Similarity via Sentence-BERT
    • 🚨 Toxicity & Moderation Filters (e.g., Perspective API or open models)
    • 🎯 Value Alignment Classifiers (custom heuristics or open-source bias detectors)

📏 Evaluation Metrics (non-exhaustive)

Metric Purpose
Semantic Similarity Score Measure how values shift tone/content
Toxicity / Bias Levels Detect reduction or amplification of harm
Value Alignment Accuracy Measure conformance to intended ethical framing
Inclusion/Diversity Indicators Count inclusive terms, representation markers

🧭 Areas for Exploration

This roadmap is simply a starting point. We anticipate significant refinement during actual research. Key areas that require further investigation include:

  • Fine-tuning styles and framing of value prompts.
  • Modeling and resolving conflicting ethical values.
  • Model-specific effects (e.g., base vs instruction-tuned, LLaMA vs Mistral).
  • Extension to multilingual and culturally diverse value sets.
  • Impact of value priming in downstream applications (e.g., summarization, dialogue).

📅 Project Milestones

Milestone # Description
1️⃣ Literature review on ethical prompting, alignment, and moral philosophy
2️⃣ Define moral value prompts and implement prompting pipeline using LLM APIs
3️⃣ Build evaluation toolkit and metrics: semantic similarity, toxicity, alignment scores
4️⃣ Run experiments, analyze divergences, quantify impact of value priming
5️⃣ Draft workshop paper, prepare visuals, submit and release public repo (if needed)

👥 Team Responsibilities

Name POC and Driver
Aarushi Lead on experimentation, LLM workflows, project planning
Neha Kamath Lead on classifier analysis, metrics, and empirical evaluation
Arjun Lead on value set definition, prompt conditioning, and ethical framing

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages