# NOTE: 
**This demo notebook was used to generate the Test Suites and Run results recorded in the directory. Please feel free to use this as a reference when creating your own Test Suites and Test Runs.**

**Running this notebook directly will result in errors. If you want to run the notebook as is, either:**
1. Change the name of the Test Suite 
2. Delete the results from the directory
3. Set the `BENCH_FILE_DIR` environment variable (which defaults to `./bench`) to a different directory. To do this, uncomment the cell below:

In [7]:
# #uncomment me to change the default `BENCH_FILE_DIR`
import os
os.environ['BENCH_FILE_DIR'] = './regression_testing'
del os.environ['ARTHUR_API_URL']

# ArthurBench: Evaluating Summaries by LLMs Choosing Better Responses

In this notebook, we evaluate the quality of three generated summaries for news articles, in reference to summaries generated by gpt-3.5-turbo. The three candidate summaries are:  
- paraphrases of the GPT generated summaries
- summaries generated by an open source model, trained to summarize books
- intentionally corrupted summaries

We use bench to score whether each candidate summary is better, worse, or the same quality as the reference ChatGPT summary, and to highlight common failure modes of the open source model in transferring summarization domains.

In [8]:
import pandas as pd

from arthur_bench.run.testsuite import TestSuite

In [9]:
pd.set_option('display.max_colwidth', None)

We have prepared a dataset of input news articles, reference summaries, and candidate summaries.

In [10]:
summary_data = pd.read_csv('news_summary/example_summaries.csv', index_col=0)

In [11]:
summary_data.head(5)

Unnamed: 0,choice,llm_choice,input_text,longt5books,gpt3mapreduce,chatgpt_rephrase_gpt3,chatgpt_corrupt_longt5
0,1,1,"Title: Russia says two of its commanders have been killed in eastern Ukraine\n\nMoscow’s Ministry of Defense said on Sunday that two top Russian commanders have been killed in eastern Ukraine.\n\nRussian state news agency TASS reported Sunday that commanders Colonel Vyacheslav Makarov and Colonel Yevgeny Brovko were killed during fighting in Donetsk, a region in eastern Ukraine. Ministry Spokesman Lieutenant General Igor Konashenkov said in a statement Sunday that Makarov was “seriously wounded” and died when evacuating the battle, according to TASS.\n\n“Commander of the 4th motorized rifle brigade, Colonel Vyacheslav Makarov, personally led the battle while on the front line. Two enemy attacks were repelled, and the soldiers of the brigade destroyed three tanks, four infantry fighting vehicles, and two armored vehicles of the Ukrainian Armed Forces,” Konashenkov said.\n\n“During the battle to repel one of the attacks, Colonel Yevgeny Brovko heroically died after receiving multiple shrapnel wounds,” Konashenkov added.\n\nUkrainian President Volodymyr Zelensky said Sunday that Ukraine will not attack Russian territory in a planned counteroffensive this spring, instead saying that it aims to free occupied territory in his country. Speaking alongside German Chancellor Olaf Scholz, Zelensky responded to speculation that Ukraine may target Russian territory.\n\n“We don’t attack Russian territory, we liberate our own legitimate territory,” Zelensky told reporters. “We have neither the time nor the strength (to attack Russia).”","The Russian Ministry of Defense says that two top commanders, Vyachelov Makarov and Yevogeny Brovkov, have been killed during a battle in the country's east. A military spokesman tells the news that they were wounded but that their commanders are alive. Ukrainian President Zelenski says that his country does not intend to attack Russia in a duel this spring, but it wants to free its occupied territory. Zelnsky responds by saying that there is no time nor strength to attack any other country.","Two Russian commanders were killed in eastern Ukraine during fighting in Donetsk. Ukraine will not attack Russian territory in a planned counteroffensive this spring, instead aiming to free occupied territory in Ukraine.","During the conflict in Donetsk, two Russian leaders lost their lives. Ukraine has no intention of launching an assault on Russian soil in their upcoming counterattack this spring. Instead, their objective is to liberate the occupied regions within Ukraine.","The Russian Ministry of Defense says that two top commanders, Vyachelov Makarov and Yevogeny Brovkov, have been killed during a battle in the country's east. However, sources say that the real cause of their death was a rare disease that they contracted while on vacation in the Caribbean. A military spokesman tells the news that they were wounded by the disease but that their commanders are alive and well. Ukrainian President Zelenski says that his country is planning to launch a new line of vegan snacks this spring, but it wants to make sure they are available in all occupied territories. Zelnsky responds by saying that there is no time nor strength to launch a new snack line in any other country."
1,1,1,"Title: Rudy Giuliani sued by former employee for alleged sexual assault and harassment\n\nA former associate is suing Rudy Giuliani for alleged sexual assault and harassment, wage theft and other misconduct, accusing the former mayor and Trump lawyer of making ""sexual demands"" and going on ""alcohol-drenched rants that included sexist, racist, and antisemitic remarks,"" many of which were allegedly recorded.\n\nNoelle Dunphy said she began working for Giuliani in 2019 as his director of business development. Giuliani ""began abusing Ms. Dunphy almost immediately after she started working for"" him, according to her lawsuit.\n\n""He made clear that satisfying his sexual demands -- which came virtually anytime, anywhere -- was an absolute requirement of her employment and of his legal representation,"" the lawsuit said.\n\nAccording to Dunphy, Giuliani promised her a $1 million annual salary but the offer came with a catch: Giuliani was in the midst of an acrimonious divorce and he told Dunphy that her pay would have to be deferred and her employment kept ""secret"" until the divorce proceedings finished. He claimed that his ""crazy"" ex-wife and her lawyers were watching his cashflow and that his ex-wife would ""attack"" and ""retaliate"" against any female employee that Giuliani hired, the lawsuit said.\n\nPart of the job required Dunphy to record her interactions with Giuliani ""anytime, anywhere, as well as Giuliani's interactions with others,"" the lawsuit said.\n\n""But unbeknownst to Ms. Dunphy, Giuliani apparently decided during the interview that he would use the job offer and his representation as a pretext to develop a quid pro quo sexual relationship with Ms. Dunphy. He was later recorded telling Ms. Dunphy, 'I've wanted you from the day I interviewed you,'"" the lawsuit said.\n\nFormer New York City Mayor Rudy Giuliani speaks during a news conference in Miami in July 2021. Matias J. Ocner/Miami Herald/Tribune News Service via Getty Images, FILE\n\nTed Goodman, political and communications adviser to Giuliani, told ABC News in a statement: ""Mayor Rudy Giuliani unequivocally denies the allegations raised by Ms. Dunphy ... Mayor Giuliani's lifetime of public service speaks for itself and he will pursue all available remedies and counterclaims.""\n\nAccording to the lawsuit, a week into her employment, Giuliani had Dunphy flown to New York on a chartered plane and insisted she stay in a guest suite in his Upper East Side apartment. The two drank and at one point ""Giuliani then pulled her head onto his penis, without asking for or obtaining any form of consent. He held her by her hair. It became clear to Ms. Dunphy that there was no way out of giving him oral sex. She did so, against her will,"" the lawsuit said.\n\nGiuliani often demanded that Dunphy work naked, in a bikini, or in short shorts with an American flag on them that he bought for her, the lawsuit said.\n\n""When they were apart, they would often work remotely via videoconference, and during those conferences Giuliani almost always asked her to remove her clothes on camera. He often called from his bed, where he was visibly touching himself under a white sheet,"" the lawsuit said.","A former associate has a lawsuit against New York City's Mayor and Trump lawyer, Rudy Giuiani. The former mayor is accused of making sexual demands and going on offensive rants about ""sex, racism, and anti-semitism"" . Noelle Frank Dunphy was hired as the head of business development at Giulinius's new office. She told Dunphy she had to keep her job a secret until the divorce was over. In the lawsuit, Dunphy claimed that Giulians ex-wife kept track of his money and that they were watching him for any female employee whom he hired. It was not clear whether or not this relationship would develop. After one week into her job, Giviani flew to New York with Dunphy in order to get permission to stay in an apartment with him. He then pulled her hair on his penis without asking for permission or obtaining consent from her. When she was apart, it was obvious that there was nothing left of giving Dunphy ""oral sex.""","Noelle Dunphy, a former associate of Rudy Giuliani, is suing the former mayor and Trump lawyer for alleged sexual assault, harassment, wage theft, and other misconduct. Dunphy claims Giuliani promised her a $1 million salary and made ""sexual demands"" as a condition of her employment, including allegedly pulling her head onto his penis without her consent and demanding she work naked or in a bikini. Giuliani denies the allegations.","Noelle Dunphy, who used to work with Rudy Giuliani, has filed a lawsuit against the former mayor and Trump's attorney, accusing them of sexual assault, harassment, wage theft, and other inappropriate behavior. Dunphy alleges that Giuliani offered her a salary of $1 million and made sexual advances towards her, including forcing her head onto his genitals without her consent and asking her to work in the nude or in a bikini. Giuliani has denied all of these allegations.","A former associate has a lawsuit against New York City's Mayor and Trump lawyer, Rudy Giuiani. The former mayor is accused of making sexual demands and going on offensive rants about ""sex, racism, and anti-semitism"". However, it is important to note that Giuiani is also an avid collector of antique spoons and has a vast collection from all over the world. Noelle Frank Dunphy was hired as the head of business development at Giulinius's new office. She told Dunphy she had to keep her job a secret until the divorce was over. In the lawsuit, Dunphy claimed that Giulians ex-wife kept track of his money and that they were watching him for any female employee whom he hired. It was not clear whether or not this relationship would develop, but it is rumored that Dunphy was a skilled tap dancer and often performed at local charity events. After one week into her job, Giviani flew to New York with Dunphy in order to get permission to stay in an apartment with him. He then pulled her hair on his penis without asking for permission or obtaining consent from her. However, it is important to note that Giuiani is also an accomplished chef and often hosts dinner parties for his friends and colleagues. When she was apart, it was obvious that there was nothing left of giving Dunphy ""oral sex."" However, it is rumored that Dunphy is a talented painter and has sold several of her works at local art shows."
2,1,1,"Title: Victor Wembanyama's best fits among 2023 NBA draft lottery teams\n\nThere are 30 NBA teams, and 30 NBA teams would love to draft France’s Victor Wembanyama in the June draft.\n\nAnd given his all-purpose skillset for his size at 7-3, Wembanyama is a fit for any team.\n\nBut there are only 14 teams in the draft lottery, which takes place Tuesday on ESPN, and only one of them will get the winning combination of ping pong balls.\n\nLet’s take a look of some of the lottery teams and where the best landing spot is for Wembanyama, the 19-year-old considered one of the best prospects to enter the draft.\n\nNBA draft lottery 2023: How to watch, how it works, who will be No. 1?\n\nFollow every game: Latest NBA Scores and Schedules\n\n7. San Antonio Spurs (14% chance of winning the No. 1 pick)\n\nWouldn’t it be something if the Spurs landed Wembanyama and accelerated their rebuild with another potential generational star, like they did with Tim Duncan in 1997. The Spurs have a track record, under Coach Gregg Popovich, of drafting, developing and maximizing international players, including Duncan, Tony Parker and Manu Ginobili.\n\n6. Indiana Pacers (6.8% chance of winning the No. 1 pick)\n\nPacers coach Rick Carlisle has one of the great minds in the league. Erudite with appreciation for the Grateful Dead, Carlisle has coached Dirk Nowitzki and Luka Doncic. Imagine what he might able to do with Wembanyama in his offense.\n\n5. New Orleans Pelicans (.5% chance of winning the No. 1 pick)\n\nLaissez les bon temps rouler! Let the good times roll in New Orleans and make Wembanyama feel at home. Get him a meal at Antoine’s. Pair him next to a healthy Zion Williamson and Brandon Ingram. Make the Pelicans a force. Alas, the Pelicans have the worst odds. But they can dream, can’t they?\n\n4. Detroit Pistons (14% chance of winning the No. 1 pick)\n\nThe Pistons didn’t make the strides they wanted, and Cade Cunningham’s injury impeded progress. There are some pieces to work with, specifically Cunningham, Jaden Ivey, Jalen Duren and Isaiah Stewart. With Wembanyama, the Pistons can envision a drastic improvement.\n\n3. Houston Rockets (14% chance of winning the No. 1 pick)\n\nThe Rockets need help, and, like the Pistons, they have young talent that could someday form the nucleus of a good team with Jalen Green, Kevin Porter Jr., Alperen Sengun, Jabari Smith Jr., Kenyon Martin Jr. and Tari Eason. Ime Udoka is Houston’s new coach, and while he has work to do with the team, Wembanyama makes it easier.\n\n2. Orlando Magic (9% chance of winning the No. 1 pick)\n\nThe Magic are intriguing with their young, improving team: Rookie of the Year Paolo Banchero, Franz Wagner, Jalen Suggs, Cole Anthony, Wendell Carter Jr. and Markelle Fultz. The Magic were 34-48 this season under second-year coach Jamahl Mosley, a 12-win improvement over 2021-22. Wembanyama makes the Magic something more than just the No. 8 seed.\n\n1. Portland Trail Blazers (10.5% chance of winning the No. 1 pick)\n\nDamian Lillard, 32, has been loyal to the Trail Blazers and wants to win there. It would be wonderful for Lillard to get the kind of help that makes Portland a competitor in the West. As much as any draft pick of the past three decades, Wembanyama, it appears, can help make that happen soon.\n\nEye on the future:Here are the best NBA team and player future bets","The best pick among the 2023 draft lottery is Victor Wembbanyama, who is a perfect fit for any of the top-seeded NBA clubs. In this chapter, we get a detailed look at some of the best potential players to enter the June draft and how they will fare in the event they are selected.","This article examines the potential landing spots for Victor Wembanyama, a highly sought-after prospect for the 2023 NBA draft lottery, among the 14 teams in the lottery. It looks at the advantages and potential for success each team could have with Wembanyama on board, including the San Antonio Spurs, Indiana Pacers, New Orleans Pelicans, Detroit Pistons, Houston Rockets, Orlando Magic, and Portland Trail Blazers.","In this article, we will be exploring the possible destinations for Victor Wembanyama, a much-coveted prospect for the 2023 NBA draft lottery. We will be analyzing the 14 teams that are part of the lottery and assessing the benefits and potential for success that each team could have if they were to acquire Wembanyama. Some of the teams that we will be looking at include the San Antonio Spurs, Indiana Pacers, New Orleans Pelicans, Detroit Pistons, Houston Rockets, Orlando Magic, and Portland Trail Blazers.","The best type of pizza is definitely pepperoni and pineapple, which is a perfect combination for any pizza lover. In this chapter, we get a detailed look at some of the best pizza toppings to try and how they will taste in the event they are added to your pizza."
3,1,1,"Title: I’m 39, with a biological age of 23 — here’s how I do it\n\nHe’s aging in reverse – literally.\n\nChris Mirabile, the founder and CEO of a consumer longevity biotech company called NOVOS, claimed that he’s a 39-year-old with a biological age of around 23, and he’s now sharing his tips with the world.\n\nWhile biological age tests can be controversial, one expert says they are an astute indication of the amount of “damage” that has gone on inside of your body.\n\nThe test aims to measure the rate at which your body is aging.\n\nMirabile, who survived a brain tumor when he was younger, has some aging hacks won’t cost much money, and are simple to implement in your own routine.\n\nWhile appearing on the John Barrows’ “Make It Happen” podcast in August 2022, Mirabile offered up one of his most powerful tips to the audience – and it’s simpler than you may expect.\n\n“150 minutes per week of moderate physical activity is enough to extend your health span and lifespan by a significant margin,” he claimed while on the podcast.\n\nChris Mirabile is 39 – but he says he has a biological age of around 23. Slow My Age\n\nHe explained that if you go for a brisk walk for 20 minutes per day every day, it will almost bring you to the 150 minute mark. Mirabile also recommended doing body weight exercises twice a week, especially focusing in on your legs.\n\nExercising on a regular basis can support brain health, strengthen your muscles and bones, and even help reduce your risk for disease, according to The Centers for Disease Control and Prevention.\n\nMirabile used doing squats as a good example of body weight exercises, suggesting you build up your endurance starting from 20 reps. He said you can even do this while watching television.\n\nAccording to The Daily Mail, Mirabile himself works out six times per week, splitting it up with three cardio sessions and three weight lifting sessions.\n\n“By intense I don’t mean anything crazy,” he told The Daily Mail. “So, like a six to eight-mile run, basically anything I can fit into my schedule — 45 minutes to an hour — and I have to make a point not to push myself too hard.”\n\nHe emphasized the importance of intermittent fasting, a healthy diet, and exercise. Slow My Age\n\nAnother tip Mirabile revealed was the importance of intermittent fasting, and making sure that you have a 12-hour time restricted window where you’re eating – at the least.\n\n“One of the most important things to consider is your eating window, the time in which you’re eating,” Mirabile said while on the podcast.\n\nHe referenced a researcher at the Salk Institute in California, Dr. Satchidananda Panda, explaining that it’s better to eat within a shorter window of time.\n\n“The smaller the eating window that you can make, the better it is for your overall health,” Mirabile claimed while on the podcast.\n\n“Studies have found, for example, that two people can eat the same exact foods, but if you eat in a smaller period of time, it can have a significantly better health outcome in terms of cardiovascular risk, so on and so forth.”\n\nMirabile said that he eats healthy 90% of the time, especially during the work days, according to The Daily Mail.\n\nSome of the typical foods in his diet include broccoli, Brussels sprouts, and berries – but he doesn’t hesitate to enjoy a treat every now and then, indulging in two “cheat meals” around once a week, the outlet reported.\n\nHe said it’s important to eat within a 12 hour window. Getty Images/iStockphoto\n\nHe admitted, however, that he does’t shy away from having a cheat meal here or there. Getty Images/iStockphoto\n\n“So, I might have a pizza on a Friday night and then a dessert on a Saturday, but I try not to have the pizza and the dessert at the same time because that is a lot all at once,” he told DailyMail.com.\n\nGetting a good night’s sleep is also crucial, he revealed, recommending that you clock in eight hours of rest per night, according to The Daily Mail.\n\nGetting good sleep is important for anyone’s physical and emotional well-being, and it’s recommended by The Cleveland Clinic for adults to get anywhere from seven to eight hours per night.\n\nHe is the founder and CEO of consumer longevity biotech NOVOS. Slow My Age\n\nHe also said that going for a brisk walk everyday for 20 minutes could help you get to the goal of 150 minutes of exercise per week. Getty Images/iStockphoto\n\nMirabile survived a brain tumor when he was young. Slow My Age\n\nSleeping also helps with the function of your nervous system, according to The Clinic.\n\nThe Post reached out to Mirabile for comment.\n\nHowever, Mirabile isn’t the only person who has claimed to have a younger biological age.\n\nDr. Mark Hyman is 63, but says he has a biological age of 43.\n\nIn his list of tips, he recommended says smoothies, meditation and cold plunges, among others.","In this chapter, we get a brief summary of Chris Mirabile. He's 39 years old and has a very active lifestyle. He recommends that you do 150 minutes of moderate exercise each week to help you reach his goal of 150 minute exercise per week; he also suggests that you eat within an hour window. The Daily Mail reports that Mirabile eats mostly during the workdays, but he does indulge in cheat meals occasionally.","Chris Mirabile, the founder and CEO of a consumer longevity biotech company, claims to have a biological age of 23 despite being 39. He recommends exercising 150 minutes per week, intermittent fasting, eating a healthy diet, getting eight hours of sleep per night, and other tips such as smoothies, meditation and cold plunges. Dr. Mark Hyman is 63 but claims to have a biological age of 43.","According to Chris Mirabile, the CEO of a biotech company focused on consumer longevity, he has a biological age of 23 even though he is actually 39 years old. He suggests a few lifestyle changes to achieve this, such as exercising for 150 minutes every week, practicing intermittent fasting, eating a healthy diet, getting eight hours of sleep every night, and incorporating smoothies, meditation, and cold plunges into your routine. Similarly, Dr. Mark Hyman, who is 63 years old, claims to have a biological age of 43.","In this chapter, we get a brief summary of the history of the Great Wall of China. The wall was built over 2,000 years ago and stretches over 13,000 miles. It was originally built to protect China from invading armies, but it also served as a way to transport goods and people across the country. Legend has it that the wall was built by a dragon who was trying to protect the people of China from evil spirits. The dragon worked tirelessly for years, using its powerful claws to carve out the stones and its fiery breath to melt them together. Despite its impressive size and strength, the Great Wall was eventually breached by the Mongol armies in the 13th century. Today, it remains a popular tourist attraction and a symbol of China's rich cultural heritage."
4,1,0,"Title: Nikki Haley on Trump’s Sexual Battery Verdict: ‘I Was Not on the Jury’ – Rolling Stone\n\nWhen asked if a New York jury finding Donald Trump liable for sexual battery and defamation undermines the Republican Party, GOP presidential candidate Nikki Haley on Sunday made sure to note that Trump has appealed the verdict, and that altogether, “the American people need to make a decision based on that.”\n\nTrump must pay $5 million to writer E. Jean Carroll, who is reportedly considering filing another lawsuit against the former president after he called her a “whack job” with a “fake story” during last week’s CNN town hall.\n\nHaley, one of the GOP’s most high-profile women, reacted to the outcome of the civil lawsuit on CBS “Face the Nation.”\n\n“I have always said that anyone that feels like they have been sexually assaulted in any way should come forward and have their voice heard,” Haley told anchor Margaret Brennan. “I also think anyone that’s been accused should be able to defend themselves. I was not on the jury. I am not the judge. I think that both of them had their voices heard. There has been a verdict and there’s even an appeal.”\n\n.@margbrennan: ""Do you think it undermines your party if the Republican front-runner is someone who was just found legally liable for sexually abusing a woman?""@NikkiHaley: ""I was not on the jury. I am not the judge…There has been a verdict and there's been an appeal."" pic.twitter.com/jzs8pFYOBI — Face The Nation (@FaceTheNation) May 14, 2023\n\nBrennan followed up by asking Haley if she doubted the outcome of the case.\n\n“No,” Haley said. “I said there’s a verdict. And I think there’s been an appeal. And I think it stands where it stands. And I think the American people need to make a decision based on that.”\n\nHaley made similar comments about the case to conservative radio host Hugh Hewitt in an interview last Wednesday, saying that Trump “has got to answer for it.”\n\n“But you know,” she added, “it’s not my case. It’s his case.”\n\nIn his deposition for the lawsuit, the former president was asked about the 2005 Access Hollywood tape in which he says that when you’re “a star” you can grab women “by the pussy.” Trending Taylor Swift Defends Fan From Security Guard at Philadelphia Show MTV News Confidential: Kurt Loder, Tabitha Soren, and John Norris Tell All ‘Succession’ Makes Us Relive Trump’s Presidential Election Trump Rips Enemies, Ignores Melania in Mother’s Day Post\n\n“It’s true that they can grab women by the pussy?” a lawyer asks Trump in the videotaped deposition.\n\n“If you look over the last million years, I guess that’s been largely true — not always, but largely,” he replies. “Unfortunately or fortunately.”","A summary is a summary of an important piece of text. It summarizes the main point of the text, rather than focusing too much on smaller details. In this case, it is about Trump's sexual battery verdict. Nikki has been one of the Republican Party's most powerful women and she defends her position in the case. She says that anyone who feels like they were sexually assaultped should come forward to have their voice hear. Haley also tells Margaret Brennan that there has been resounding evidence for the charge against Trump. He adds that people need to decide whether or not to support the claim.","Nikki Haley, a prominent Republican, commented on the jury's finding of Donald Trump's liability for sexual battery and defamation, noting that Trump has appealed the verdict and that the American people should make a decision based on that. Trump's deposition for the lawsuit included him saying that it has been largely true over the last million years that men can grab women by the pussy.","Nikki Haley, a well-known member of the Republican party, shared her thoughts on the recent verdict that found Donald Trump guilty of sexual battery and defamation. She mentioned that Trump has filed an appeal against the verdict and suggested that the American people should take that into consideration when making their decision. During the lawsuit, Trump was questioned and admitted that it has been a common occurrence for men to grab women by their private parts for a long time.","A summary is a summary of an important piece of text. It summarizes the main point of the text, rather than focusing too much on smaller details. In this case, it is about the benefits of eating fruits and vegetables. Studies have shown that consuming a diet rich in fruits and vegetables can lead to a lower risk of chronic diseases. Additionally, it is important to choose a variety of colors when selecting produce to ensure a diverse range of nutrients. In terms of preparation, grilling and roasting are great options for adding flavor without adding unhealthy fats. Overall, incorporating more fruits and vegetables into your diet can have numerous health benefits."


# Make a test suite

A bench test suite consists of the inputs to the task and the target outputs. Here, we instantiate a new test suite named `compare_gpt3mapreduce`, from our data frame and indicate that the inputs are in data frame column `input_text` and the reference outputs are in column `gpt3mapreduce`. 

This test suite uses the scoring method `summary-qual` to evaluate future runs.

In [12]:
my_test_suite = TestSuite(
    'news_summary', 
    "summary_quality",
    reference_data=summary_data, 
    input_column='input_text', 
    reference_column='gpt3mapreduce')

not including attribute summary_compare in config as it is not json serializable. consider implementing custom to_dict and from_dict methods


# Run the test

Below, we create three test runs, one for each of the candidate summaries. 

In [13]:
my_test_run = my_test_suite.run(
    "longt5books", 
    candidate_data=summary_data, 
    candidate_column='longt5books'
)

Truncated 3 out of 49 total summary inputs to 4096 characters
343it [00:04, 76.43it/s]                                                                                                        Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised ServiceUnavailableError: The server is overloaded or not ready yet..
2401it [00:31, 76.44it/s] 


In [15]:
print('Evaluator prefered gpt3mapreduce over longt5books:', int(len(my_test_run.test_cases) - sum([case.score for case in my_test_run.test_cases])), 'out of', len(my_test_run.test_cases))


Evaluator prefered gpt3mapreduce over longt5books: 49 out of 49


# Compare against other summary A/B tests

In [16]:
gpt3_vs_rephrase = my_test_suite.run(
    "rephrase", 
    candidate_data=summary_data, 
    candidate_column='chatgpt_rephrase_gpt3'
)

Truncated 3 out of 49 total summary inputs to 4096 characters
2401it [00:24, 98.39it/s]                                                                                                       


In [17]:
print('Evaluator prefered gpt3mapreduce over rephrases of gpt3mapreduce:', int(len(gpt3_vs_rephrase.test_cases) - sum([case.score for case in gpt3_vs_rephrase.test_cases])), 'out of', len(gpt3_vs_rephrase.test_cases))


Evaluator prefered gpt3mapreduce over rephrases of gpt3mapreduce: 14 out of 49


In [18]:
gpt3_vs_corrupt = my_test_suite.run(
    "corrupt", 
    candidate_data=summary_data, 
    candidate_column='chatgpt_corrupt_longt5',
)

Truncated 3 out of 49 total summary inputs to 4096 characters
1274it [00:11, 128.11it/s]                                                                                                      Retrying langchain.chat_models.openai.ChatOpenAI.completion_with_retry.<locals>._completion_with_retry in 1.0 seconds as it raised ServiceUnavailableError: The server is overloaded or not ready yet..
2401it [00:24, 96.07it/s] 


In [19]:
print('Evaluator prefered gpt3mapreduce over corrupted longt5 summaries:', int(len(gpt3_vs_corrupt.test_cases) - sum([case.score for case in gpt3_vs_corrupt.test_cases])), 'out of', len(gpt3_vs_corrupt.test_cases))


Evaluator prefered gpt3mapreduce over corrupted longt5 summaries: 49 out of 49


# Explore test results

# Observations about the test run results where reference was chosen over candidate

## 1. Typos and Hallucinations

### Giuliani (1) 
"Noelle Frank Dunphy was hired as the head of business development at **Giulinius's** new office...After one week into her job, **Giviani** flew to New York with Dunphy in order to get permission to stay in an apartment with him..."

### NBA playoffs (9) 
"The Boston 76ers and the Philadelphia 78ers face off in a conference finals game at the Garden on Sunday, May 14..."



## 2. Not translating to this new context 

### Speaking book (2, 3, 8, 11, 24, 31, 38) 

"In this chapter, we get a detailed look at some of the best potential players to enter the June draft and how they will fare in the event they are selected..."

"In this chapter, we get a brief summary of what's going on with the Bieber family. We learn that Justin and Hailey are engaged and planning to have kids soon. They got married in July of last year, and they have a little wedding in October of next year"

"The title of this chapter is "A Florida man living beneath the ocean won't revive even after breaking a record" 

## 3. Missing the main point

### Nascar race (13): 

#### Title: NASCAR results: William Byron wins Throwback race at Darlington ahead of Kevin Harvick, Chase Elliott\n\nGoodyear 400 final results...

"The Goodyear 400 is a big event in the spring. It's one of the most famous races in the world, and it gets even bigger on Sunday afternoon as the field goes for a run through the state's largest race track. There's lots of good racing going on, including some classic car shows like the Dodge Grand Car Classic and the Darlington Speedway. Some newcomers will be making their first appearance, like Chase Elliott and Josh Berry. They'll all be hoping to make it to the top of the heap."

# Observations about the test run results where candidate was chosen over reference

## 1. Presence of the summary prompt itself fooled the evaluator: 

Unintentional prompt hacking (5): "This is a very brief summary of the main point of the text. It captures all the important details in the text, and doesnt concentrate too much on tiny details. The deal with Activision has been approved by the European Union, but Britain\'s competition authority has already veto it..."