# **How Accurate are GPT Metrics Compared to Human Analysis?**

### Overview
In this demonstration, you will compare GPT4s assessment of Coherence against human graded samples. This demo uses the SummEval dataset, SummEval contains LLM generated summarizations that have been human graded on Coherence. Coherence measures the quality of all sentences in a model's predicted answer and how they fit together naturally.

You will utilize Azure PromptFlow to generate Coherence evaluations using GPT4. Using PromptFlow you will use two Coherence prompts - a standard template and a template using 'emotion' as referenced in the linked paper below. You will then use this notebooks to compare and analyze the GPT4, GPT4 with emotion, and human graded coherence scores.

 **_Go Deeper_**  
- [Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?](https://ar5iv.labs.arxiv.org/html/2309.07462)
- [GptEval: NLG Evaluation using Gpt-4 with Better Human Alignment](https://ar5iv.labs.arxiv.org/html/2303.16634)
- [EmotionPrompt: Leveraging Psychology for Large Language Models Enhancement via Emotional Stimulus](https://arxiv.org/pdf/2307.11760v3.pdf)
- SummEval: Re-evaluating Summarization Evaluation [[Paper]](https://arxiv.org/pdf/2007.12626.pdf) / [[Repository]](https://github.com/Yale-LILY/SummEval#data)  
  
**_Prerequisites_**  

Ensure that your environment is setup by completing the steps outlines in [0_setup.ipynb](./0_setup.ipynb)  
_Optional_: For an overview of GPT based metrics, please see [1_gpt_evaluation.ipynb](./1_gpt_evaluation.ipynb)

### Step 1: Examine Input Data
The SummEval dataset provides human labeled analysis on LLM generated summaries.The annotations include summaries generated by 16 models from 100 source news articles (~15000 examples in total). You will use a subset of ~250 samples.
Each of the summaries was annotated by 5 indepedent crowdsource workers and 3 independent experts (8 annotations in total).
Summaries were evaluated across 4 dimensions: coherence, consistency, fluency, relevance.
Each source news article comes with the original reference from the CNN/DailyMail dataset and 10 additional crowdsources reference summaries.  
  
Take a look at the data [summEval_human_labeled.jsonl](../data/inputs/summEval_human_labeled.jsonl)  

_Note: Human labels are subjective and are **not** perfect, comparison is for benchmarking purposes and should be considered separate from result quality_

### Step 2: Run Evaluation Pipeline

**IMPORTANT:** Be sure to analyze and expirament with the [gpt_eval_benchmark](../src/promptflow/evaluation_flows/gpt_eval_benchmark/) PromptFlow used in this step.  
  
  _This step may take upt to 10 minutes to complete_

In [9]:
from promptflow import PFClient

# PFClient can help manage your runs and connections.
pf = PFClient()

# Define Flows and Data
eval_flow = "../src/promptflow/evaluation_flows/gpt_eval_benchmark" # set flow directory
data = "../data/inputs/summEval_human_labeled_subset.jsonl" # set the data file

# Run evaluation flow to evaluate chat results
eval_run = pf.run(
    flow=eval_flow,
    data=data,
    stream=False,
    column_mapping={  # map the url field from the data to the url input of the flow
      "expert_annotations": "${data.expert_annotations}",
      "turker_annotations": "${data.turker_annotations}",
      "response": "${data.decoded}",
    }
)



Helpful Documentation:  
[Run and Evaluate a PromptFlow](https://microsoft.github.io/promptflow/how-to-guides/run-and-evaluate-a-flow/index.html)  
[PFClient Documentation](https://microsoft.github.io/promptflow/reference/python-library-reference/promptflow.html)

### Step 3: Analyze Outputs

In [10]:
import pandas as pd

output_data = "../data/outputs/gpt_benchmark_results.json"

output_df = pd.read_json(output_data)
display(output_df)

Unnamed: 0,response,gpt_coherence,gpt_coherence_w_emotion,expert_coherence,turker_coherence
0,paul merson was brought on with only seven min...,3,3,1,3
1,paul merson has restarted his row with andros ...,4,5,2,2
2,paul merson has restarted his row with andros ...,5,5,2,4
3,paul merson has restarted his row with andros ...,4,4,2,5
4,paul merson has restarted his row with andros ...,5,5,3,2
...,...,...,...,...,...
242,world no 1 williams said her struggle to beat ...,4,5,2,5
243,serena williams said her struggle to beat sara...,5,5,4,1
244,twice french open champion serena williams sai...,5,5,4,4
245,serena williams beat sara errani 4-6 7-6(3) 6-...,5,5,2,3


In [None]:
#TODO: General historgram overlay fpr each metric

#TODO: Distribution of gpt_coherence / w_emotion variance against expert

#TODO: Distribution of gpt_coherence / w_emotion variance against turker (crowdsourced)

#TODO: Overall aggregate metrics for accuracy
