In [3]:
# Install HF components used in this task
! pip install transformers datasets evaluate sentencepiece



In [4]:
import os

with open('sample_data/combined.txt', 'r') as file:
  context = file.read()

print(len(context))

9564


# PART 1

In [5]:
from transformers import pipeline

default_qa_pipeline = pipeline('question-answering')

def ask_question(qa_pipeline, context, question):
    result = qa_pipeline(question=question, context=context)
    return result

def print_answer(question, answer):
    print(f"Question: {question}")
    print(f"Answer: {answer['answer']} (Score: {answer['score']})", end="\n\n")

No model was supplied, defaulted to distilbert-base-cased-distilled-squad and revision 626af31 (https://huggingface.co/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/473 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/261M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [6]:
questions = [
    "Who is the antagonist in the text?",
    "Who is the protagonist in the text?",
    "Who is the perpetrator in the text?",
    "What was the crime?",
    "What is the setting of the crime scene?",
    "What is the evidence of the crime?",
    "What is the case against the perpetrator?",
    "What are the characteristics of Holmes?",
    "What are the characteristics of Watson?",
    "What are the characteristics of Stapleton?",
]

for idx, question in enumerate(questions):
    answer = ask_question(default_qa_pipeline, context, question)
    print(f"Question #{idx+1}")
    print_answer(question, answer)

Question #1
Question: Who is the antagonist in the text?
Answer: Dr. Watson (Score: 0.6252009272575378)

Question #2
Question: Who is the protagonist in the text?
Answer: Dr. Mortimer (Score: 0.2579243779182434)

Question #3
Question: Who is the perpetrator in the text?
Answer: Dr. Mortimer (Score: 0.6852825880050659)

Question #4
Question: What was the crime?
Answer: smoking a cigar (Score: 0.06112584099173546)

Question #5
Question: What is the setting of the crime scene?
Answer: hearth-rug (Score: 0.4477247893810272)

Question #6
Question: What is the evidence of the crime?
Answer: an almost
incredible facial distortion (Score: 0.17967890202999115)

Question #7
Question: What is the case against the perpetrator?
Answer: new baronet might refuse to live here (Score: 7.210580952232704e-05)

Question #8
Question: What are the characteristics of Holmes?
Answer: physical and spiritual (Score: 0.42503899335861206)

Question #9
Question: What are the characteristics of Watson?
Answer: phys

### PART 1 Observations

The question-answering results based on context of selected text from "The Hound of the Baskervilles" were generally inaccurate, misidentifying both Dr. Watson and Dr. Mortimer as the antagonist and perpetrator, roles actually filled by Jack Stapleton. The crime was incorrectly defined as "smoking a cigar," rather than Stapleton's plot involving the death of Sir Charles Baskerville. The setting of the crime scene and the case against the perpetrator were also inaccurately described, with the answers missing the novel's complexity and details. Characteristics of the main characters, Holmes, Watson, and Stapleton, were either vague or only partially correct, focusing on physical traits rather than their more significant psychological and behavioral qualities. Overall, the pipeline's responses highlighted the challenge of extracting precise answers from the given text.

# PART 1 DEMO

In [31]:
from transformers import pipeline
from ipywidgets import widgets, Layout
from IPython.display import display, clear_output

def build_ui():
    question_input = widgets.Text(
        value='',
        placeholder='Type your question here...',
        description='Question:',
        disabled=False,
        layout=Layout(width='50%')
    )

    submit_button = widgets.Button(
        description='Get Answer',
        disabled=False,
        button_style='success',
        tooltip='Click to get the answer',
        icon='check'
    )

    output_area = widgets.Output()

    return question_input, submit_button, output_area

question_input, submit_button, output_area = build_ui()

def on_submit_button_clicked(_):
    with output_area:
        clear_output(wait=True)
        if question_input.value.strip() == '':
            print("Please enter the question.")
        else:
            answer = default_qa_pipeline(question=question_input.value, context=context)
            print_answer(question_input.value, answer)

def render_ui(output_area, question_input, submit_button, click_handler):
    submit_button.on_click(click_handler)
    display(question_input, submit_button, output_area)

render_ui(output_area, question_input, submit_button, on_submit_button_clicked)


Text(value='', description='Question:', layout=Layout(width='50%'), placeholder='Type your question here...')

Button(button_style='success', description='Get Answer', icon='check', style=ButtonStyle(), tooltip='Click to …

Output()

# PART 1.2

In [8]:
distilbert_qa_pipeline = pipeline('question-answering', model='distilbert-base-uncased-distilled-squad')

for idx, question in enumerate(questions):
    answer = ask_question(distilbert_qa_pipeline, context, question)
    print(f"Question #{idx+1}")
    print_answer(question, answer)

config.json:   0%|          | 0.00/451 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Question #1
Question: Who is the antagonist in the text?
Answer: Dr. Mortimer (Score: 0.7157095670700073)

Question #2
Question: Who is the protagonist in the text?
Answer: Dr. Mortimer (Score: 0.6830291152000427)

Question #3
Question: Who is the perpetrator in the text?
Answer: Dr. Mortimer (Score: 0.7829236388206482)

Question #4
Question: What was the crime?
Answer: cardiac exhaustion (Score: 0.4838252067565918)

Question #5
Question: What is the setting of the crime scene?
Answer: the moor (Score: 0.077842578291893)

Question #6
Question: What is the evidence of the crime?
Answer: medical evidence (Score: 0.14137911796569824)

Question #7
Question: What is the case against the perpetrator?
Answer: dyspnœa and
death from cardiac exhaustion (Score: 0.07204890251159668)

Question #8
Question: What are the characteristics of Holmes?
Answer: physical and spiritual (Score: 0.5203794836997986)

Question #9
Question: What are the characteristics of Watson?
Answer: dignified, solid, and re

### PART 1.2 Observations

When comparing the DistilBERT model's answers to those from the previous default model on "The Hound of the Baskervilles," both models incorrectly identified Dr. Mortimer as the antagonist, protagonist, and perpetrator, with DistilBERT displaying slightly higher confidence levels. DistilBERT provided a somewhat more accurate context for the crime and setting, citing "cardiac exhaustion" and "the moor" compared to the default model's "smoking a cigar" and "hearth-rug." However, both models struggled with specificity, particularly in characterizing Holmes, Watson, and Stapleton, offering vague descriptors like "physical and spiritual" or "dignified, solid, and reassuring." The answers suggest that neither model had a strong grasp of the text, as expressed in the limited text selections in the context. Overall, while DistilBERT showed slight improvements, it still failed to accurately answer basic questions about the provided context.

# PART 2

In [9]:
def augment_answer(question, answer):
    if "antagonist" in question.lower():
        return f"The antagonist is {answer}."
    elif "protagonist" in question.lower():
        return f"The protagonist is {answer}."
    elif "crime scene" in question.lower():
        return f"The setting of the crime scene is {answer}."
    elif "crime" in question.lower():
        return f"The crime was {answer}."
    elif "evidence" in question.lower():
        return f"The evidence of the crime is {answer}."
    elif " case " in question.lower():
        return f"The case against the perpetrator involves {answer}."
    elif "perpetrator" in question.lower():
        return f"{answer} is the perpetrator."
    elif "characteristic" in question.lower():
        return f"They are characterized by {answer}."
    else:
        return answer

In [28]:
def ask_question(qa_pipeline, en_to_es_pipeline, es_to_en_pipeline, context, question):
    result = qa_pipeline(question=question, context=context)
    # augmenting the answer here because the model gives us back pithy responses
    answer = augment_answer(question, result['answer'])
    es_result = en_to_es_pipeline(answer)
    es_answer = es_result[0]['translation_text']
    en_result = es_to_en_pipeline(es_answer)
    en_answer = en_result[0]['translation_text']
    return answer, es_answer, en_answer

def print_translation_answer(question, answer, es_answer, en_answer):
    print(f"Question: {question}")
    print(f"Answer in English: {answer}")
    print(f"Answer in Spanish: {es_answer}")
    print(f"Answer in English, translated from the above: {en_answer}", end="\n\n")

In [29]:
translator_en_to_es = pipeline('translation', model='Helsinki-NLP/opus-mt-en-es')
translator_es_to_en = pipeline('translation', model='Helsinki-NLP/opus-mt-es-en')

for idx, question in enumerate(questions):
    answer, es_answer, en_answer = ask_question(default_qa_pipeline, translator_en_to_es, translator_es_to_en, context, question)
    print(f"Question #{idx+1}")
    print_translation_answer(question, answer, es_answer, en_answer)



Question #1
Question: Who is the antagonist in the text?
Answer in English: The antagonist is Dr. Watson.
Answer in Spanish: El antagonista es el Dr. Watson.
Answer in English, translated from the above: The antagonist is Dr. Watson.

Question #2
Question: Who is the protagonist in the text?
Answer in English: The protagonist is Dr. Mortimer.
Answer in Spanish: El protagonista es el Dr. Mortimer.
Answer in English, translated from the above: The protagonist is Dr. Mortimer.

Question #3
Question: Who is the perpetrator in the text?
Answer in English: Dr. Mortimer is the perpetrator.
Answer in Spanish: El Dr. Mortimer es el autor.
Answer in English, translated from the above: Dr. Mortimer is the author.

Question #4
Question: What was the crime?
Answer in English: The crime was smoking a cigar.
Answer in Spanish: El crimen era fumar un cigarro.
Answer in English, translated from the above: The crime was smoking a cigarette.

Question #5
Question: What is the setting of the crime scene?


### PART 2 Observations

The cyclical translation process using the default QA pipeline and Helsinki-NLP models has led to a series of errors and inconsistencies. The QA model inaccurately identified Dr. Watson as the antagonist and Dr. Mortimer as the protagonist, which remained uncorrected through translation. A significant error arose in the translation of the crime scene setting, where "hearth-rug" was mistranslated to "a firearm" in Spanish and then back into English. The characteristics of Holmes, Watson, and Stapleton, while translated accurately, were initially provided as vague by the QA model. Overall, the translation process preserved the semantics of the QA model's output but also preserved its inaccuracies, highlighting the importance of initial accuracy in QA models for effective translation.

# PART 2.2

In [30]:
translator_en_to_es = pipeline('translation', model='domenicrosati/opus-mt-en-es-scielo')
translator_es_to_en = pipeline('translation', model='domenicrosati/opus-mt-es-en-scielo')

for idx, question in enumerate(questions):
    answer, es_answer, en_answer = ask_question(default_qa_pipeline, translator_en_to_es, translator_es_to_en, context, question)
    print(f"Question #{idx+1}")
    print_translation_answer(question, answer, es_answer, en_answer)

Question #1
Question: Who is the antagonist in the text?
Answer in English: The antagonist is Dr. Watson.
Answer in Spanish: El antagonista es el doctor Watson.
Answer in English, translated from the above: The antagonist is Dr. Watson.

Question #2
Question: Who is the protagonist in the text?
Answer in English: The protagonist is Dr. Mortimer.
Answer in Spanish: El protagonista es el Dr. Mortimer.
Answer in English, translated from the above: The protagonist is Dr. Mortimer.

Question #3
Question: Who is the perpetrator in the text?
Answer in English: Dr. Mortimer is the perpetrator.
Answer in Spanish: El autor es el Dr. Mortimer.
Answer in English, translated from the above: The author is Dr. Mortimer.

Question #4
Question: What was the crime?
Answer in English: The crime was smoking a cigar.
Answer in Spanish: El delito fue fumar un cigarro.
Answer in English, translated from the above: The crime was smoking a cigarette.

Question #5
Question: What is the setting of the crime scen

In [34]:
question_input, submit_button, output_area = build_ui()

def on_submit_translation_button_clicked(_):
    with output_area:
        clear_output(wait=True)
        if question_input.value.strip() == '':
            print("Please enter the question.")
        else:
            answer, es_answer, en_answer = ask_question(default_qa_pipeline, translator_en_to_es, translator_es_to_en, context, question_input.value)
            print_translation_answer(question_input.value, answer, es_answer, en_answer)

render_ui(output_area, question_input, submit_button, on_submit_translation_button_clicked)

Text(value='', description='Question:', layout=Layout(width='50%'), placeholder='Type your question here...')

Button(button_style='success', description='Get Answer', icon='check', style=ButtonStyle(), tooltip='Click to …

Output()

### PART 2.2 Observations

In the translation results using the domenicrosati models, misidentifications made by the QA pipeline, such as Dr. Watson being the antagonist and Dr. Mortimer the protagonist and perpetrator, were preserved post-translation, similar to the previous Helsinki-NLP model results. Notably, both translation models introduced their own errors: the domenicrosati model incorrectly translated "hearth-rug" as "centrifuged," diverging from the Helsinki-NLP model's "firearm" mistranslation. The evidence of the crime was consistently translated as "an almost incredible facial distortion" across both models, maintaining the QA pipeline's original output. Both models conveyed the case against the perpetrator with some semantic shifts, yet the domenicrosati model's translation was closer to the original incorrect QA response than the Helsinki-NLP model's rendition. Overall, while the translations by domenicrosati models were syntactically coherent, they, like the Helsinki-NLP translations, failed to correct or improve upon the QA model's inaccuracies and sometimes added additional errors.

# PART 3

In [13]:
def summarize(summarizer_pipeline):
  file_names = [
      'antagonist',
      'protagonist',
      'crime',
      'evidence',
      'resolution'
  ]

  for file_name in file_names:
    with open(f'sample_data/{file_name}.txt', 'r') as file:
      context = file.read()
      summary = summarizer_pipeline(context, min_length=25, max_length=50, do_sample=False)
      print(f'{file_name.title()} Summary:')
      print(f"Summary: {summary[0]['summary_text']}", end='\n\n')

In [14]:
default_summarizer = pipeline('summarization')

summarize(default_summarizer)

No model was supplied, defaulted to sshleifer/distilbart-cnn-12-6 and revision a4f8f3e (https://huggingface.co/sshleifer/distilbart-cnn-12-6).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Antagonist Summary:
Summary:  A small, slim, clean-shaven, prim-faced man, flaxen-haired and lean-jawed, dressed in a grey suit and wearing a straw hat . A tin box for botanical specimens hung over his

Protagonist Summary:
Summary:  Holmes picked up the stick which our visitor had left behind him the night before . It was a fine, thick piece of wood, of the sort which is known as a “Penang lawyer.” Just under the head was

Crime Summary:
Summary:  Sir Charles Baskerville had declared his intention of starting next day for London, and had ordered his master to prepare his luggage . That night he went out for his nocturnal walk, in the course of which he was in the

Evidence Summary:
Summary:  He led me back into the house with a candle in his hand, and he held it up against the time-stained portrait on the wall . He looked at the broad plumed hat, the curling love-locks, the lace

Resolution Summary:
Summary:  Stapleton had hoped that his wife might lure Sir Charles Baskerville to his 

# PART 3.2

In [15]:
fb_summarizer = pipeline('summarization', model='facebook/bart-large-cnn')

summarize(fb_summarizer)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Antagonist Summary:
Summary: Dr. Watson was walking on the moor when he heard a voice calling him by name. A small, slim, clean-shaven,prim-faced man, flaxen-haired and lean-jawed, between thirty

Protagonist Summary:
Summary: The stick was a fine, thick piece of wood, of the sort which is known as a “Penang lawyer.” Just under the head was a broad silver band nearly an inch wide. “To James Mortimer

Crime Summary:
Summary: Sir Charles Baskerville was found dead on the moor at the end of a night walk. The coroner's jury returned a verdict in accordance with the medical evidence. No signs of violence were to be discovered upon Sir Charles’

Evidence Summary:
Summary: He led me back into the                banqueting-hall, his bedroom candle in his hand, and he held it up against the time-stained portrait on the wall. The face of Stapleton had sprung out of the canvas

Resolution Summary:
Summary: Stapleton had hoped that his wife might lure Sir Charles Baskerville to his ruin. But she p

### PART 3 Observations

The default distilbart model did an overall poorer job at summarization than the Facebook bart model. The default distilbart model produced numerous gramatical errors (eg extra spaces before periods and incomplete sentences), whereas the Facebook bart model produced cleaner, more readable summaries. The summaries produced by both models are mostly inaccurate, though the Crime and Resolution summaries by the Facebook bart model was fairly accurate. The default model did a poorer job of those two summaries, comparatively.