# LLM Model Ranking via Self-Generated Challenge Questions

This notebook orchestrates a simple model-ranking workflow:
- Ask one model to generate a challenging, nuanced question.
- Pose that question to multiple models and collect their answers.
- Compare answers to produce a rough ranking.

Notes:
- Keep prompts deterministic when possible to reduce variance.
- Be mindful of API usage and costs.
- Do not print full API keys; only partial, masked previews.

In [1]:
# Imports
# - os/json: environment and simple serialization
# - dotenv: load API keys from .env
# - openai: client for OpenAI-compatible APIs
# - display utils: richer notebook output
import os
import json
from dotenv import load_dotenv
from openai import OpenAI
from IPython.display import Markdown, display


In [2]:
# Load environment variables from a local .env file
# override=True allows .env to replace existing env vars if needed
load_dotenv(override=True)

True

In [3]:
# Fetch API keys (masked preview only)
openai_api_key = os.getenv("OPENAI_API_KEY")
google_api_key = os.getenv("GOOGLE_API_KEY")

if openai_api_key:
    print(f"OpenAI API key found and begins {openai_api_key[:8]}")
else:
    print("OpenAI API key not found")

if google_api_key:
    print(f"Google API key found and begins {google_api_key[:2]}")
else:
    print("Google API key not found")


OpenAI API key found and begins sk-proj-
Google API key found and begins AI


In [4]:
# Prompt to generate a single challenging evaluation question
request = (
    "Please come up with a challenging, nuanced question that I can ask a number of LLMs "
    "to evaluate their intelligence. Answer only with the question, no explanation."
)
messages = [{"role": "user", "content": request}]


In [5]:
# Inspect the message payload to be sent to the model
messages

[{'role': 'user',
  'content': 'Please come up with a challenging, nuanced question that I can ask a number of LLMs to evaluate their intelligence. Answer only with the question, no explanation.'}]

In [6]:
# Call the model to generate the evaluation question
# Note: ensure your OPENAI_API_KEY is set; adjust model name as needed
openai = OpenAI()
response = openai.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
)
question = response.choices[0].message.content
print(question)


What are the ethical implications and potential social consequences of implementing AI algorithms in decision-making processes for areas such as criminal justice, hiring, and healthcare?


In [7]:
# Prepare containers for competitor models and their answers
competitors = []  # e.g., [("gpt-4o-mini", clientA), ("gemini-1.5-pro", clientB)]
answers = []

# Re-seed the conversation with the generated question
messages = [{'role': 'user', 'content': question}]


In [8]:
# Populate `competitors` with the models/clients you want to compare.
# The API we know well gpt-4o-mini
# Query each model with the same `messages` and store the response text in `answers`.

model_name = "gpt-4o-mini"

response = openai.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)


The implementation of AI algorithms in decision-making processes across various sectors, including criminal justice, hiring, and healthcare, presents several ethical implications and potential social consequences. Here are some key considerations:

### Ethical Implications

1. **Bias and Discrimination**: AI algorithms can perpetuate and amplify existing biases present in training data. In areas like criminal justice and hiring, this can lead to discrimination against marginalized groups, reinforcing societal inequalities.

2. **Transparency and Accountability**: Many AI algorithms operate as "black boxes," making it difficult to understand how decisions are made. This lack of transparency raises ethical concerns about accountability when outcomes adversely affect individuals' lives.

3. **Informed Consent**: In healthcare, the use of AI in decision-making may not sufficiently involve patient consent. Patients might not be aware of how AI influences their diagnosis or treatment, raising ethical issues around autonomy and informed consent.

4. **Surveillance and Privacy**: In criminal justice, the use of AI for predictive policing can lead to over-surveillance of certain communities, raising ethical concerns about privacy rights and the potential for criminalization based on flawed data.

5. **Dehumanization**: Relying too heavily on algorithms can lead to a dehumanized approach in sectors like healthcare, where patient care could become overly mechanized, undermining the human aspects of compassion and empathy.

### Potential Social Consequences

1. **Erosion of Trust**: The use of AI can lead to public skepticism if individuals feel that decisions are made by impersonal algorithms rather than human judgment. Building trust in institutions may be challenging if AI outcomes are viewed with suspicion.

2. **Deepening Inequality**: If AI exacerbates existing biases, it could widen the gap between different social groups, leading to economic and social disparities being entrenched in hiring practices or access to justice and healthcare.

3. **Job Displacement**: In hiring processes, the reliance on AI can lead to job displacement for roles that require human judgment and nuance, exacerbating unemployment and economic insecurity for certain demographics.

4. **Social Silos**: Algorithms can create echo chambers or filter bubbles, particularly in hiring and media. This homogenization can limit diversity in workplaces and societal discourse, reinforcing existing social divides.

5. **Dependency on Technology**: As sectors increasingly adopt AI-driven decision-making, there can be a growing dependence on technology, which may reduce human oversight and critical thinking in decision-making processes.

### Conclusion

Implementing AI algorithms in decision-making processes raises complex ethical dilemmas that need careful consideration. Policymakers, organizations, and developers must prioritize ethical frameworks that address bias, accountability, and transparency, alongside social initiatives aimed at mitigating adverse consequences. The goal should be to harness the benefits of AI while safeguarding fundamental human rights and social equity. This requires ongoing dialogue and engagement among stakeholders, including communities affected by these technologies.

In [9]:
# The API we know well emini-2.0-flash
# Query each model with the same `messages` and store the response text in `answers`.

gemini = OpenAI(api_key=google_api_key, base_url="https://generativelanguage.googleapis.com/v1beta/openai/")
model_name = "gemini-2.0-flash"

response = gemini.chat.completions.create(model=model_name, messages=messages)
answer = response.choices[0].message.content

display(Markdown(answer))
competitors.append(model_name)
answers.append(answer)



## Ethical Implications and Social Consequences of AI in Decision-Making:

Implementing AI algorithms in decision-making processes for areas like criminal justice, hiring, and healthcare presents significant ethical implications and potential social consequences. While AI promises efficiency, accuracy, and reduced bias, its misuse or poorly designed implementation can exacerbate existing inequalities and introduce new challenges.

**I. Ethical Implications:**

*   **Bias and Discrimination:**
    *   **Data Bias:** AI algorithms learn from the data they are trained on. If the training data reflects existing societal biases (e.g., gender, race, socioeconomic status), the AI will likely perpetuate and amplify these biases in its decisions.  This can lead to unfair or discriminatory outcomes.
    *   **Algorithmic Bias:** Even with seemingly unbiased data, the way algorithms are designed, the features they prioritize, and the criteria they optimize for can unintentionally introduce or reinforce bias.
    *   **Example:**  A hiring algorithm trained on historical data where mostly men were hired for technical roles might discriminate against female applicants, even if qualifications are equal.
*   **Transparency and Explainability (Explainable AI - XAI):**
    *   **Black Box Problem:** Many AI algorithms, especially complex deep learning models, are "black boxes." It can be difficult or impossible to understand how they arrive at a specific decision. This lack of transparency raises concerns about accountability and fairness.
    *   **Justification and Due Process:** Without understanding the reasoning behind an AI's decision, it becomes challenging to challenge or appeal the outcome. This undermines fundamental principles of due process, particularly in criminal justice and hiring contexts.
    *   **Example:** If an AI-powered risk assessment tool labels someone as a "high risk" for recidivism without clear justification, they are unfairly prejudiced by the system and may not have a fair chance to defend themselves.
*   **Autonomy and Human Oversight:**
    *   **Over-reliance on AI:**  Blindly trusting AI decisions without human oversight can lead to errors and unintended consequences. Human judgment, critical thinking, and ethical considerations are still necessary, especially in high-stakes situations.
    *   **De-skilling:** Relying heavily on AI can de-skill human professionals (e.g., doctors, judges, recruiters) in making informed decisions. This can weaken their judgment and ability to handle situations outside the AI's capabilities.
    *   **Accountability:**  Who is responsible when an AI makes a mistake? Is it the developer, the user, or the AI itself?  Assigning responsibility becomes complex when AI systems operate autonomously.
*   **Privacy and Data Security:**
    *   **Data Collection and Usage:** AI systems require vast amounts of data, often including sensitive personal information. This raises concerns about data privacy, security, and potential misuse.
    *   **Surveillance and Monitoring:** AI can be used for mass surveillance, tracking individuals, and profiling their behavior. This can infringe on fundamental rights to privacy and freedom.
    *   **Example:**  Using AI to analyze healthcare data without proper anonymization and consent could expose patients' sensitive medical history.
*   **Job Displacement and Economic Inequality:**
    *   **Automation:** AI-powered automation can displace workers in various sectors, leading to job losses and increased economic inequality.
    *   **Concentration of Power:**  The development and deployment of AI are often concentrated in the hands of a few powerful tech companies. This can exacerbate existing power imbalances and limit access to opportunities for marginalized communities.

**II. Potential Social Consequences:**

*   **Erosion of Trust:**
    *   **Distrust in Institutions:**  If AI systems are perceived as unfair, biased, or opaque, they can erode public trust in institutions like the justice system, healthcare providers, and employers.
    *   **Social Fragmentation:**  Algorithmic bias can exacerbate social divisions and create further distrust between groups.
*   **Reinforcement of Social Stratification:**
    *   **Perpetuation of Inequality:** If AI systems are used to allocate resources (e.g., loans, healthcare, educational opportunities) in a biased way, they can reinforce existing social stratification and limit opportunities for marginalized communities.
    *   **Creation of New Forms of Discrimination:** AI can create new forms of discrimination based on factors that are not traditionally recognized as protected characteristics (e.g.,  zip code, social network connections).
*   **Dehumanization:**
    *   **Loss of Empathy and Compassion:**  Over-reliance on AI in areas like healthcare can lead to a loss of empathy and compassion in human interactions.  The focus may shift to data and algorithms, potentially neglecting the human element.
    *   **Reduced Human Agency:** When AI makes decisions that affect people's lives, it can diminish their sense of agency and control over their own futures.
*   **Increased Social Control:**
    *   **Surveillance State:** AI-powered surveillance technologies can be used to monitor and control populations, potentially leading to a more authoritarian society.
    *   **Chilling Effect on Free Speech:**  The fear of being monitored and penalized by AI systems can have a chilling effect on free speech and dissent.
*   **Unintended Consequences:**
    *   **Unpredictable Outcomes:**  The complexity of AI systems means that they can produce unintended consequences that are difficult to foresee or control.
    *   **Systemic Risks:**  The widespread use of AI can create systemic risks that could have devastating consequences if the systems fail or are misused.

**III. Mitigating the Risks:**

To mitigate these risks, it is crucial to:

*   **Develop Ethical Guidelines and Regulations:** Establish clear ethical guidelines and regulations for the development and deployment of AI systems.
*   **Promote Transparency and Explainability:**  Prioritize the development of XAI methods that can help us understand how AI systems make decisions.
*   **Address Data Bias:**  Actively work to identify and mitigate bias in training data. Ensure data diversity and representativeness.
*   **Incorporate Human Oversight:**  Maintain human oversight in AI-driven decision-making processes, especially in high-stakes situations.
*   **Protect Privacy and Data Security:**  Implement robust data privacy and security measures to protect sensitive personal information.
*   **Promote Education and Awareness:**  Educate the public about the potential benefits and risks of AI. Foster critical thinking and responsible AI adoption.
*   **Focus on Fairness and Equity:**  Design AI systems that promote fairness and equity, rather than reinforcing existing inequalities.
*   **Promote Diverse and Inclusive AI Development Teams:** Ensure diverse perspectives are included in the development process to mitigate potential biases.

By carefully considering these ethical implications and social consequences, we can work to harness the potential benefits of AI while minimizing its risks and ensuring that AI systems are used in a way that benefits all of humanity.


In [10]:
print(competitors)
print(answers)

['gpt-4o-mini', 'gemini-2.0-flash']
['The implementation of AI algorithms in decision-making processes across various sectors, including criminal justice, hiring, and healthcare, presents several ethical implications and potential social consequences. Here are some key considerations:\n\n### Ethical Implications\n\n1. **Bias and Discrimination**: AI algorithms can perpetuate and amplify existing biases present in training data. In areas like criminal justice and hiring, this can lead to discrimination against marginalized groups, reinforcing societal inequalities.\n\n2. **Transparency and Accountability**: Many AI algorithms operate as "black boxes," making it difficult to understand how decisions are made. This lack of transparency raises ethical concerns about accountability when outcomes adversely affect individuals\' lives.\n\n3. **Informed Consent**: In healthcare, the use of AI in decision-making may not sufficiently involve patient consent. Patients might not be aware of how AI 

In [11]:
# zip answers
for competitor, answer in zip(competitors, answers):
    print(f"Competitor: {competitor}\n\n{answer}")

Competitor: gpt-4o-mini

The implementation of AI algorithms in decision-making processes across various sectors, including criminal justice, hiring, and healthcare, presents several ethical implications and potential social consequences. Here are some key considerations:

### Ethical Implications

1. **Bias and Discrimination**: AI algorithms can perpetuate and amplify existing biases present in training data. In areas like criminal justice and hiring, this can lead to discrimination against marginalized groups, reinforcing societal inequalities.

2. **Transparency and Accountability**: Many AI algorithms operate as "black boxes," making it difficult to understand how decisions are made. This lack of transparency raises ethical concerns about accountability when outcomes adversely affect individuals' lives.

3. **Informed Consent**: In healthcare, the use of AI in decision-making may not sufficiently involve patient consent. Patients might not be aware of how AI influences their diagn

In [12]:
# Let's bring this together - note the use of "enumerate"

together = ""
for index, answer in enumerate(answers):
    together += f"# Response from competitor {index+1}\n\n"
    together += answer + "\n\n"

print(together)


# Response from competitor 1

The implementation of AI algorithms in decision-making processes across various sectors, including criminal justice, hiring, and healthcare, presents several ethical implications and potential social consequences. Here are some key considerations:

### Ethical Implications

1. **Bias and Discrimination**: AI algorithms can perpetuate and amplify existing biases present in training data. In areas like criminal justice and hiring, this can lead to discrimination against marginalized groups, reinforcing societal inequalities.

2. **Transparency and Accountability**: Many AI algorithms operate as "black boxes," making it difficult to understand how decisions are made. This lack of transparency raises ethical concerns about accountability when outcomes adversely affect individuals' lives.

3. **Informed Consent**: In healthcare, the use of AI in decision-making may not sufficiently involve patient consent. Patients might not be aware of how AI influences their 

In [13]:
judge = f"""You are judging a competition between {len(competitors)} competitors.
Each model has been given this question:

{question}

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{{"results": ["best competitor number", "second best competitor number", ...]}}

Here are the responses from each competitor:

{together}

Now respond with the JSON with the ranked order of the competitors, nothing else. Do not include markdown formatting or code blocks."""


In [14]:
print(judge)

You are judging a competition between 2 competitors.
Each model has been given this question:

What are the ethical implications and potential social consequences of implementing AI algorithms in decision-making processes for areas such as criminal justice, hiring, and healthcare?

Your job is to evaluate each response for clarity and strength of argument, and rank them in order of best to worst.
Respond with JSON, and only JSON, with the following format:
{"results": ["best competitor number", "second best competitor number", ...]}

Here are the responses from each competitor:

# Response from competitor 1

The implementation of AI algorithms in decision-making processes across various sectors, including criminal justice, hiring, and healthcare, presents several ethical implications and potential social consequences. Here are some key considerations:

### Ethical Implications

1. **Bias and Discrimination**: AI algorithms can perpetuate and amplify existing biases present in training 

In [15]:
judge_messages = [{"role": "user", "content": judge}]

In [16]:
# Judgement time!

openai = OpenAI()
response = openai.chat.completions.create(
    model="o3-mini",
    messages=judge_messages,
)
results = response.choices[0].message.content
print(results)


{"results": ["2", "1"]}


In [17]:
# OK let's turn this into results!

results_dict = json.loads(results)
ranks = results_dict["results"]
for index, result in enumerate(ranks):
    competitor = competitors[int(result)-1]
    print(f"Rank {index+1}: {competitor}")

Rank 1: gemini-2.0-flash
Rank 2: gpt-4o-mini
