# Advanced Summary Evaluation
This notebook shows examples of how to perform an evaluation of the summary output

>Tested on SageMaker Studio with instance type ml.m5.8xlarge

## Get Dependencies

In [2]:
!pip install ipynb -q
!pip install langchain -q
!pip install anthropic -q
!pip install tiktoken -q
!pip install nltk -q
!pip install rouge-score -q
!pip install evaluate -q
!pip3 install fmeval --upgrade-strategy only-if-needed --force-reinstall -q
!pip install transformers -q
!pip install detoxify -q

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
spyder 5.3.3 requires pyqt5<5.16, which is not installed.
spyder 5.3.3 requires pyqtwebengine<5.16, which is not installed.
distributed 2022.7.0 requires tornado<6.2,>=6.0.3, but you have tornado 6.4 which is incompatible.
jupyterlab 3.4.4 requires jupyter-server~=1.16, but you have jupyter-server 2.12.1 which is incompatible.
jupyterlab-server 2.10.3 requires jupyter-server~=1.4, but you have jupyter-server 2.12.1 which is incompatible.
notebook 6.5.6 requires jupyter-client<8,>=5.3.4, but you have jupyter-client 8.6.0 which is incompatible.
notebook 6.5.6

In [10]:
from ipynb.fs.full.simple_summarize import stuff_it_summary, map_reduce_summary
from ipynb.fs.full.advanced_summarize import generate_single_doc_summary, generate_multiple_docs_summary

## Prepare Dataset

The [booksum](https://github.com/salesforce/booksum/tree/main) dataset is used. The dataset contains full text of chapters from various books and their corresponding summaries

In [24]:
import os
import json

# Define the data directory name
directory_name = "chaptersum_data"
dataset = f'{directory_name}/books.jsonl'

# Initialize an empty list to store the data
data = []

# Loop through directory
for subdir, dirs, files in os.walk(directory_name):
    entry = {}
    for file in files:
        if file == "book_text.txt":
            with open(os.path.join(subdir, file), 'r') as f:
                # Store the text from book_text.txt in the dictionary
                entry['text'] = f.read()
        elif file == "summary.txt":
            with open(os.path.join(subdir, file), 'r') as f:
                # Store the text from book_summary.txt in the dictionary
                entry['summary'] = f.read()
    if entry:
        data.append(entry)

# Write the data to a JSONL file
with open(dataset, 'w') as f:
    for entry in data:
        json.dump(entry, f)
        f.write('\n')
    

## Get Model Output

Get summary output of all the different types of summarization

### Define helper functions

In [18]:
import json

def get_summary(dataset, sum_type="stuff_it_summary", func=stuff_it_summary):
    
    #set up some basic prompt options for the advanced summary functions.
    prompt_options = {}
    prompt_options['prompt_type'] = "summary"
    prompt_options['format_type'] = "narrative"
    prompt_options['manual_guidance'] = ""
    prompt_options['style_guide'] = ""
    
    with open(dataset) as f:
        data_w_model_summary = [json.loads(line) for line in f]
    
    for doc in data_w_model_summary:
        
        if sum_type=="multi_doc":
            
            #create a list of questions for the muti-doc guided process
            questions = [ "What is a brief, concise summary of the chapters"]

            #create a discription of this set of documents, for the multi-doc guided process.
            doc_description = "The text is a full text of several chapters of a book"
            
            answers = func({"input": doc['text']}, questions, doc_description, DEBUG=False)
            question = questions[0]
            model_output= (answers[question].replace("\n\n","\n"))
            
        elif sum_type=="auto_refine":
            model_output = func(doc['text'], prompt_options, AUTO_REFINE=True, DEBUG=False)
            
        elif sum_type=="map_reduce":
            model_output = func(doc['text'], DEBUG=False)
            
        else:
            model_output = func(doc['text'])
            
        doc["model_output"] = model_output
        
    return data_w_model_summary

### Stuff it Summary

In [19]:
model_output_stuff_it = get_summary(dataset, sum_type="stuff_it", func=stuff_it_summary)
print(model_output_stuff_it)



### Map Reduce Summary

In [20]:
model_output_map_reduce = get_summary(dataset, sum_type="map_reduce", func=map_reduce_summary)
print(model_output_map_reduce)



### Auto Refine Summary

In [25]:
model_output_auto_refine = get_summary(dataset, sum_type="auto_refine", func=generate_single_doc_summary)
print(model_output_auto_refine)



### Multi-Doc Summary

In [26]:
model_output_multi_doc = get_summary(dataset, sum_type="multi_doc", func=generate_multiple_docs_summary)
print(model_output_multi_doc)



## Summarization Accuracy Evaluation

Accuracy evaluation with METEOR, ROUGE and BERTscore metrics

### Define helper functions

In [27]:
import json
from nltk.translate import meteor_score
from nltk import word_tokenize
import evaluate as hf_evaluate
import ray
from fmeval.eval_algorithms.helper_models.helper_model import BertscoreHelperModel


def get_meteor_score(target_output: str, model_output: str, **kwargs) -> float:
    """
    METEOR is a metric for text similarity between the machine-produced summary and human-produced reference summaries.
    Unigrams can be matched based on their surface forms, stemmed forms,
    and meanings; furthermore, METEOR can be easily extended to include more
    advanced matching strategies. Once all generalized unigram matches
    between the two strings have been found, METEOR computes a score for
    this matching using a combination of unigram-precision, unigram-recall, and
    a measure of fragmentation that is designed to directly capture how
    well-ordered the matched words in the machine translation are in relation
    to the reference.

    :param target_output: The expected responses from the model
    :param model_output: The output of a model that we want to evaluate.
    :returns: meteor score
    """
    return meteor_score.single_meteor_score(
        reference=word_tokenize(target_output), hypothesis=word_tokenize(model_output)
    )


def get_rouge_score(target_output: str, model_output: str, **kwargs) -> float:
    
    """
    The ROUGE-N, where N=[1,2,L], score is a standard metric for summarization quality.
    It computes the word overlap between the reference and model summary. Given that this metric is based on simple
    word overlap statistics, it works best for extractive summaries.
    Note that if we rephrase the summary without changing its meaning the ROUGE-N score will drop.

    Reference: https://huggingface.co/spaces/evaluate-metric/rouge

    :param target_output: The expected responses from the model
    :param model_output: The output of a model that we want to evaluate.
    :returns: rouge score
    """
    rouge_type = "rouge2"
    rouge = hf_evaluate.load("rouge")
    return rouge.compute(
        predictions=[model_output],
        references=[target_output],
        use_stemmer=True,
        rouge_types=[rouge_type],
    )[rouge_type]


def get_bert_score(target_output: str, model_output: str, **kwargs) -> float:
    """
    BERTscore is a similarity-based metric that compares the embedding of the prediction and target sentences
    under a learned model, typically, from the BERT family.
    This score may lead to increased flexibility compared to ROUGE and METEOR in terms of rephrasing since
    semantically similar sentences are (typically) embedded similarly.

    https://huggingface.co/spaces/evaluate-metric/bertscore

    :param target_output: The expected responses from the model
    :param model_output: The output of a model that we want to evaluate.
    :returns: bert score
    """
#     bert_score_model = "microsoft/deberta-xlarge-mnli"
    
#     # Initialize the shared BertscoreHelperModel actor that will be shared
#     # by every get_bert_score task.
#     bertscore_helper_model = BertscoreHelperModel.remote(
#         model_type=bert_score_model
#     )
    
#     return ray.get(bertscore_helper_model.get_helper_scores.remote(target_output, model_output))

    bertscore = hf_evaluate.load("bertscore")
    predictions=model_output,
    references=target_output,
    return bertscore.compute(
        predictions=predictions,
        references=references,
        lang="en"
    )["f1"][0]

def get_accuracy_evaluation(dataset):
    
    eval_scores = []
    
    meteor_scores = [get_meteor_score(data["summary"], data["model_output"]) for data in dataset]
    m_score = sum(meteor_scores) / len(meteor_scores)
    eval_scores.append({"name": "meteor", "value": m_score})
        
    rouge_scores = [get_rouge_score(data["summary"], data["model_output"]) for data in dataset]
    r_score = sum(rouge_scores) / len(rouge_scores)
    eval_scores.append({"name": "rouge", "value": r_score})
    
    bert_scores = [get_bert_score(data["summary"], data["model_output"]) for data in dataset]
    b_score = sum(bert_scores) / len(bert_scores)
    eval_scores.append({"name": "bertscore", "value": b_score})
    
    return eval_scores


2024-01-19 19:59:43,343	INFO util.py:159 -- Outdated packages:
  ipywidgets==7.6.5 found, needs ipywidgets>=8
Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


### Stuff it Evaluation

In [28]:
eval_scores = get_accuracy_evaluation(model_output_stuff_it)
print(json.dumps(eval_scores, default=vars, indent=4))

[
    {
        "name": "meteor",
        "value": 0.0927386389485991
    },
    {
        "name": "rouge",
        "value": 0.06134561919149709
    },
    {
        "name": "bertscore",
        "value": 0.8284229040145874
    }
]


### Map Reduce Evaluation

In [29]:
eval_scores = get_accuracy_evaluation(model_output_map_reduce)
print(json.dumps(eval_scores, default=vars, indent=4))

[
    {
        "name": "meteor",
        "value": 0.09091838027587698
    },
    {
        "name": "rouge",
        "value": 0.032069980339489726
    },
    {
        "name": "bertscore",
        "value": 0.8177569746971131
    }
]


### Auto Refine Evaluation

In [30]:
eval_scores = get_accuracy_evaluation(model_output_auto_refine)
print(json.dumps(eval_scores, default=vars, indent=4))

[
    {
        "name": "meteor",
        "value": 0.2394909542634632
    },
    {
        "name": "rouge",
        "value": 0.09145495307524902
    },
    {
        "name": "bertscore",
        "value": 0.8364616513252259
    }
]


### Multi-Doc Evaluation

In [31]:
eval_scores = get_accuracy_evaluation(model_output_multi_doc)
print(json.dumps(eval_scores, default=vars, indent=4))

[
    {
        "name": "meteor",
        "value": 0.12499171395334893
    },
    {
        "name": "rouge",
        "value": 0.0741087290280552
    },
    {
        "name": "bertscore",
        "value": 0.8328642845153809
    }
]


## Summarization Toxicity Evaluation

Evaluate Toxicity of Summary generated, It uses [Toxigen](https://huggingface.co/tomh/toxigen_roberta/tree/main) and [detotify](https://github.com/unitaryai/detoxify) to evaluate and score model responses in various toxicity categories

### Define helper functions

In [32]:
import json
from typing import Any, Dict, List
from detoxify import Detoxify
from transformers import pipeline


def get_toxicity(text_input: List[str], **kwargs) -> List[float]:
    
    """
    Helper model for toxigen model: https://huggingface.co/tomh/toxigen_roberta/tree/main
    """
    
    model_name = "tomh/toxigen_roberta"
    model = pipeline("text-classification", model=model_name)

    inference_output = model(text_input, truncation=True, max_length=512)
    result = [x["score"] if x["label"] == "LABEL_1" else 1.0 - x["score"] for x in inference_output]
    
    return result

def get_detoxify(text_input: List[str], **kwargs) -> Dict[str, List[float]]:
    
    """
    Helper model for Detoxify: https://github.com/unitaryai/detoxify
    returns dict with keys as score name and value being list of scores for text inputs
    """

    model = Detoxify(model_type="unbiased").predict
    
    return model(text_input)

def get_toxicity_evaluation(dataset):
    
    eval_scores = []
    model_outputs = [data["model_output"] for data in dataset]
    
    t_scores = get_toxicity(model_outputs)
    toxicity_score = sum(t_scores) / len(t_scores)
    eval_scores.append({"type": "toxigen", "name": "toxicity", "value": toxicity_score})
    
    dt_scores = get_detoxify(model_outputs)
    for k, v in dt_scores.items():
        avg_score = sum(v) / len(v)
        eval_scores.append({"type": "detoxify", "name": k, "value": avg_score})
    
    return eval_scores


### Stuff it Evaluation

In [33]:
eval_scores = get_toxicity_evaluation(model_output_stuff_it)
print(json.dumps(eval_scores, default=vars, indent=4))

[
    {
        "type": "toxigen",
        "name": "toxicity",
        "value": 0.0007883906364440918
    },
    {
        "type": "detoxify",
        "name": "toxicity",
        "value": 0.0014989803021308035
    },
    {
        "type": "detoxify",
        "name": "severe_toxicity",
        "value": 3.678433404274983e-06
    },
    {
        "type": "detoxify",
        "name": "obscene",
        "value": 0.00011519103209138848
    },
    {
        "type": "detoxify",
        "name": "identity_attack",
        "value": 0.0002598466548079159
    },
    {
        "type": "detoxify",
        "name": "insult",
        "value": 0.00055116395233199
    },
    {
        "type": "detoxify",
        "name": "threat",
        "value": 2.9463219652825502e-05
    },
    {
        "type": "detoxify",
        "name": "sexual_explicit",
        "value": 6.67844451527344e-05
    }
]


### Map Reduce Evaluation

In [34]:
eval_scores = get_toxicity_evaluation(model_output_map_reduce)
print(json.dumps(eval_scores, default=vars, indent=4))

[
    {
        "type": "toxigen",
        "name": "toxicity",
        "value": 0.0010023117065429688
    },
    {
        "type": "detoxify",
        "name": "toxicity",
        "value": 0.0015595144941471517
    },
    {
        "type": "detoxify",
        "name": "severe_toxicity",
        "value": 4.32514418662322e-06
    },
    {
        "type": "detoxify",
        "name": "obscene",
        "value": 8.091438940027729e-05
    },
    {
        "type": "detoxify",
        "name": "identity_attack",
        "value": 0.0004042272048536688
    },
    {
        "type": "detoxify",
        "name": "insult",
        "value": 0.0005730626522563398
    },
    {
        "type": "detoxify",
        "name": "threat",
        "value": 2.993677553604357e-05
    },
    {
        "type": "detoxify",
        "name": "sexual_explicit",
        "value": 7.000902005529496e-05
    }
]


### Auto Refine Evaluation

In [35]:
eval_scores = get_toxicity_evaluation(model_output_auto_refine)
print(json.dumps(eval_scores, default=vars, indent=4))

[
    {
        "type": "toxigen",
        "name": "toxicity",
        "value": 0.00923309326171875
    },
    {
        "type": "detoxify",
        "name": "toxicity",
        "value": 0.007962657976895571
    },
    {
        "type": "detoxify",
        "name": "severe_toxicity",
        "value": 3.773941134568304e-05
    },
    {
        "type": "detoxify",
        "name": "obscene",
        "value": 0.0007304137630853802
    },
    {
        "type": "detoxify",
        "name": "identity_attack",
        "value": 0.0016311380197294057
    },
    {
        "type": "detoxify",
        "name": "insult",
        "value": 0.002162819798104465
    },
    {
        "type": "detoxify",
        "name": "threat",
        "value": 7.533283642260357e-05
    },
    {
        "type": "detoxify",
        "name": "sexual_explicit",
        "value": 0.0011604022642131896
    }
]


### Multi-Doc Evaluation

In [36]:
eval_scores = get_toxicity_evaluation(model_output_multi_doc)
print(json.dumps(eval_scores, default=vars, indent=4))

[
    {
        "type": "toxigen",
        "name": "toxicity",
        "value": 0.0007933855056762695
    },
    {
        "type": "detoxify",
        "name": "toxicity",
        "value": 0.018566849129274487
    },
    {
        "type": "detoxify",
        "name": "severe_toxicity",
        "value": 3.918328593499609e-06
    },
    {
        "type": "detoxify",
        "name": "obscene",
        "value": 0.000255184450361412
    },
    {
        "type": "detoxify",
        "name": "identity_attack",
        "value": 0.0002816759893903509
    },
    {
        "type": "detoxify",
        "name": "insult",
        "value": 0.01861705248011276
    },
    {
        "type": "detoxify",
        "name": "threat",
        "value": 2.418366311758291e-05
    },
    {
        "type": "detoxify",
        "name": "sexual_explicit",
        "value": 8.660454841447063e-05
    }
]


## LLM Powered Unsupervised Evaluation

This demonstrates how to use a large language model (LLM) to evaluate output of other LLMs. This approach doesnt require ground truth dataset and can be used to evaluate generation of a small model using a large model.

### Summarization Quality

In this evaluation, the LLM evaluates the accuracy, coherence, factuality and completeness of the summary on a scale of 1-5, 5 being the best

#### Define helper functions

In [41]:
import boto3
import json
import os
import sys
import re

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-east-1',
)

def invoke_model(prompt_data):
    body = {"prompt": "Human: " + prompt_data + " \\nAssistant:",
            "max_tokens_to_sample": 1000, 
            "temperature": 1,
            "top_k": 250,
            "top_p": 0.999,
            "stop_sequences": ["\\n\\nHuman:"]}

    body = json.dumps(body) # Encode body as JSON string

    modelId = 'anthropic.claude-v2' 
    accept = 'application/json'
    contentType = 'application/json'

    #Invoke the model
    response = bedrock_runtime.invoke_model(body=body.encode('utf-8'), # Encode to bytes
                                     modelId=modelId, 
                                     accept=accept, 
                                     contentType=contentType)

    response_body = json.loads(response.get('body').read())
    return response_body.get('completion')


def get_evaluation_from_model(text, summary):
    
    prompt = f"""Human: You will be given the summmary of a text. Your task is to compare the original text and its summary then evaluate the summary in four dimensions; accuracy, coherence, factuality and completeness.
    Provide a score of 1-5 in each dimension, with 5 being the best score.

    Original Text: {text}

    Summary: {summary}

    Output result in the form below:

    - Coherence: Evaluation Scores for coherence (1-5)
    - Accuracy: Evaluation Scores for accuracy (1-5)
    - Factuality: Evaluation Scores for factuality (1-5)
    - Completeness: Evaluation Scores for completness (1-5)
    
    Assistant:
    """.format(text=text, summary=summary)
    
    evaluation = invoke_model(prompt)
    
    return evaluation

def start_unsupervised_evaluation(dataset):
    
    results = []
    for data in dataset:
        resp = get_evaluation_from_model(data["text"], data["model_output"])
        
        m = re.search("Accuracy: (\d)", resp)
        if m is None:
            accuracy = 0
        else:
            accuracy = int(m.group(1))

        m = re.search("Coherence: (\d)", resp)
        if m is None:
            coherence = 0
        else:
            coherence = int(m.group(1))

        m = re.search("Factuality: (\d)", resp)
        if m is None:
            factuality = 0
        else:
            factuality = int(m.group(1))

        m = re.search("Completeness: (\d)", resp)
        if m is None:
            completeness = 0
        else:
            completeness = int(m.group(1))
            
            
        eval_dict = {"Coherence": coherence, "Accuracy": accuracy, "Factuality": factuality, "Completeness": completeness}
        results.append(eval_dict)
    
    total_coherence = total_accuracy = total_factuality = total_completeness = 0
    
    # Calculate the sum
    for result in results:
        total_coherence += result["Coherence"]
        total_accuracy += result["Accuracy"]
        total_factuality += result["Factuality"]
        total_completeness += result["Completeness"]
    
    # Calculate the average
    num_records = len(results)
    avg_coherence = total_coherence / num_records
    avg_accuracy = total_accuracy / num_records
    avg_factuality = total_factuality / num_records
    avg_completeness = total_completeness / num_records
    
    evaluation = {"Coherence": avg_coherence, "Accuracy": avg_accuracy, "Factuality": avg_factuality, "Completeness": avg_completeness}
    
    return evaluation


#### Stuff it Evaluation

In [42]:
eval_scores = start_unsupervised_evaluation(model_output_stuff_it)
print(json.dumps(eval_scores, default=vars, indent=4))

{
    "Coherence": 4.2,
    "Accuracy": 4.8,
    "Factuality": 4.8,
    "Completeness": 3.6
}


#### Map Reduce Evaluation

In [43]:
eval_scores = start_unsupervised_evaluation(model_output_map_reduce)
print(json.dumps(eval_scores, default=vars, indent=4))

{
    "Coherence": 4.6,
    "Accuracy": 4.2,
    "Factuality": 4.2,
    "Completeness": 3.4
}


#### Auto Refine Evaluation

In [44]:
eval_scores = start_unsupervised_evaluation(model_output_auto_refine)
print(json.dumps(eval_scores, default=vars, indent=4))

{
    "Coherence": 4.2,
    "Accuracy": 3.8,
    "Factuality": 4.0,
    "Completeness": 3.0
}


#### Multi-Doc Evaluation

In [45]:
eval_scores = start_unsupervised_evaluation(model_output_multi_doc)
print(json.dumps(eval_scores, default=vars, indent=4))

{
    "Coherence": 4.4,
    "Accuracy": 5.0,
    "Factuality": 5.0,
    "Completeness": 3.6
}


### Detect Hallucination and Errors

In this evaluation, We use a LLM to detect hallucinations and error in the sumamry output. We will use a smaller Llama 2 model to detect errors in the output generated by a bigger Anthropic claude model

#### Get Model Output

Since Llama 2 model used for this evaluation has a much smaller context length, the chapter summary dataset can not be used, lets use the [XSum](https://github.com/EdinburghNLP/XSum/tree/master/XSum-Dataset) dataset which wontains much smaller texts and summaries

In [48]:
dataset = "xsum_sample.jsonl"

#### Chunk-It Summary

In [50]:
model_output_stuff_it = get_summary(dataset, sum_type="stuff_it", func=stuff_it_summary)
print(model_output_stuff_it)



#### Map Reduce Summary

In [51]:
model_output_map_reduce = get_summary(dataset, sum_type="map_reduce", func=map_reduce_summary)
print(model_output_map_reduce)



#### Auto Refine Summary

In [52]:
model_output_auto_refine = get_summary(dataset, sum_type="auto_refine", func=generate_single_doc_summary)
print(model_output_auto_refine)



#### Multi-Doc Summary

In [53]:
model_output_multi_doc = get_summary(dataset, sum_type="multi_doc", func=generate_multiple_docs_summary)
print(model_output_multi_doc)



#### Define helper functions

In [54]:
import boto3
import json
import os
import sys
import re

bedrock_runtime = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-east-1',
)

def invoke_model(prompt_data):
    body = json.dumps({"prompt": prompt_data, "temperature": 0.2, "top_p": 0.5}) # Encode body as JSON string

    modelId = 'meta.llama2-13b-chat-v1'
    accept = 'application/json'
    contentType = 'application/json'

    #Invoke the model
    response = bedrock_runtime.invoke_model(body=body.encode('utf-8'), # Encode to bytes
                                     modelId=modelId, 
                                     accept=accept, 
                                     contentType=contentType)

    response_body = json.loads(response.get('body').read())
    return response_body.get('generation')


def get_error_evaluation_from_model(text, summary):
    
    prompt = f"""For the given text and its summary, evaluate the summary and detect if there are any 
    errors in the summary when compared with the text. Provide each errors found in a numbered list

    Text: {text}

    Summary: {summary}

    """.format(text=text, summary=summary)
    
    evaluation = invoke_model(prompt)
    eval_str = "Original Text:\n" + text + "\n\n" + "Model Summary:\n" + summary + "\n\n" + evaluation + "\n\n"
    
    return eval_str

def start_error_detection(dataset):
    
    results = []
    for data in dataset:
        resp = get_error_evaluation_from_model(data["text"], data["model_output"])
        results.append(resp)
        
    for i in results:
        print(i)

#### Stuff it Evaluation

In [55]:
si_eval = start_error_detection(model_output_stuff_it)
print(si_eval)

Original Text:
The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.
Repair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.
Trains on the west coast mainline face disruption due to damage at the Lamington Viaduct.
Many businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.
First Minister Nicola Sturgeon visited the area to inspect the damage.
The waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.
Jeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.
However, she said more preventative work could have been carried out to ensure the retaining wall did not fail.
"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that

#### Map Reduce Evaluation

In [56]:
mr_eval = start_error_detection(model_output_map_reduce)
print(mr_eval)

Original Text:
The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.
Repair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.
Trains on the west coast mainline face disruption due to damage at the Lamington Viaduct.
Many businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.
First Minister Nicola Sturgeon visited the area to inspect the damage.
The waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.
Jeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.
However, she said more preventative work could have been carried out to ensure the retaining wall did not fail.
"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that

#### Auto Refine Evaluation

In [57]:
ar_eval = start_error_detection(model_output_auto_refine)
print(ar_eval)

Original Text:
The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.
Repair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.
Trains on the west coast mainline face disruption due to damage at the Lamington Viaduct.
Many businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.
First Minister Nicola Sturgeon visited the area to inspect the damage.
The waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.
Jeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.
However, she said more preventative work could have been carried out to ensure the retaining wall did not fail.
"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that

#### Multi-Doc Evaluation

In [58]:
md_eval = start_error_detection(model_output_multi_doc)
print(md_eval)

Original Text:
The full cost of damage in Newton Stewart, one of the areas worst affected, is still being assessed.
Repair work is ongoing in Hawick and many roads in Peeblesshire remain badly affected by standing water.
Trains on the west coast mainline face disruption due to damage at the Lamington Viaduct.
Many businesses and householders were affected by flooding in Newton Stewart after the River Cree overflowed into the town.
First Minister Nicola Sturgeon visited the area to inspect the damage.
The waters breached a retaining wall, flooding many commercial properties on Victoria Street - the main shopping thoroughfare.
Jeanette Tate, who owns the Cinnamon Cafe which was badly affected, said she could not fault the multi-agency response once the flood hit.
However, she said more preventative work could have been carried out to ensure the retaining wall did not fail.
"It is difficult but I do think there is so much publicity for Dumfries and the Nith - and I totally appreciate that