From c3d5380944d666bc14956901074236a36450cec5 Mon Sep 17 00:00:00 2001 From: Aakash Thatte Date: Thu, 13 Jun 2024 21:54:26 +0530 Subject: [PATCH 1/5] Add docs for summarization metric --- docs/concepts/metrics/summarization_score.md | 75 ++++++++++++++++++++ 1 file changed, 75 insertions(+) create mode 100644 docs/concepts/metrics/summarization_score.md diff --git a/docs/concepts/metrics/summarization_score.md b/docs/concepts/metrics/summarization_score.md new file mode 100644 index 000000000..3f8ed2fed --- /dev/null +++ b/docs/concepts/metrics/summarization_score.md @@ -0,0 +1,75 @@ +# Summarization Score + +This metric gives a measure of how well the `summary` captures the important information from the `contexts`. The intuition behind this metric is that a good summary shall contain all the important information present in the context(or text so to say). + +We first extract a set of important keyphrases from the context. These keyphrases are then used to generate a set of questions. The answers to these questions are always `yes(1)` for the context. We then ask these questions to the summary and calculate the summarization score as the ratio of correctly answered questions to the total number of questions. The flowchart below illustrates the process: + + +```{mermaid} +graph LR + A[Context] --> B[Keyphrases] + B ---> C[Questions] + A --> C + + C --> D[Answers] + E[Summary] ----> D + D --> F[Summarization Score] +```` + +```{math} +```` + +```{hint} +**Summary**: JPMorgan Chase & Co. is an American multinational finance company headquartered in New York City. It is the largest bank in the United States and the world's largest by market capitalization as of 2023. Founded in 1799, it is a major provider of investment banking services, with US$3.9 trillion in total assets, and ranked #1 in the Forbes Global 2000 ranking in 2023. + + +**keyphrases**: [ + "JPMorgan Chase & Co.",\ + "American multinational finance company",\ + "headquartered in New York City",\ + "largest bank in the United States",\ + "world's largest bank by market capitalization",\ + "founded in 1799",\ + "major provider of investment banking services",\ + "US$3.9 trillion in total assets",\ + "ranked #1 in Forbes Global 2000 ranking",\ + ] + +**Questions**: [ + "Is JPMorgan Chase & Co. an American multinational finance company?",\ + "Is JPMorgan Chase & Co. headquartered in New York City?",\ + "Is JPMorgan Chase & Co. the largest bank in the United States?",\ + "Is JPMorgan Chase & Co. the world's largest bank by market capitalization as of 2023?",\ + "Is JPMorgan Chase & Co. considered systemically important by the Financial Stability Board?",\ + "Was JPMorgan Chase & Co. founded in 1799 as the Chase Manhattan Company?",\ + "Is JPMorgan Chase & Co. a major provider of investment banking services?",\ + "Is JPMorgan Chase & Co. the fifth-largest bank in the world by assets?",\ + "Does JPMorgan Chase & Co. operate the largest investment bank by revenue?",\ + "Was JPMorgan Chase & Co. ranked #1 in the Forbes Global 2000 ranking?",\ + "Does JPMorgan Chase & Co. provide investment banking services?",\ + ] + +**Answers**: ["0", "1", "1", "1", "0", "0", "1", "1", "1", "1", "1"] +```` + + +We compute the question-answer score using the answers, which is a list of `1`s and `0`s. The question-answer score is then calculated as the ratio of correctly answered questions(answer = `1`) to the total number of questions. + +```{math} +:label: question-answer-score +\text{QA score} = \frac{|\text{correctly answered questions}|}{|\text{total questions}|} +```` + +We also introduce an option to penalize larger summaries by proving a conciseness score. If this option is enabled, the final score is calculated as the average of the summarization score and the conciseness score. This conciseness scores ensures that summaries that are just copies of the text do not get a high score, because they will obviously answer all questions correctly. + +```{math} +:label: conciseness-score +\text{conciseness score} = \frac{\text{length of summary}}{\text{length of context}} +```` + +The final summarization score is then calculated as: + +```{math} +:label: summarization-score +\text{Summarization Score} = \frac{\text{QA score} + \text{conciseness score}}{2} +```` From 388e38013d1fca9e44d0311c7e9c57e031a7390e Mon Sep 17 00:00:00 2001 From: Aakash Thatte Date: Thu, 13 Jun 2024 22:07:40 +0530 Subject: [PATCH 2/5] Add docs for summarization metric --- docs/concepts/metrics/index.md | 4 ++ docs/concepts/metrics/summarization_score.md | 49 +++++++++++++------- 2 files changed, 35 insertions(+), 18 deletions(-) diff --git a/docs/concepts/metrics/index.md b/docs/concepts/metrics/index.md index e2f6abc65..228cc7412 100644 --- a/docs/concepts/metrics/index.md +++ b/docs/concepts/metrics/index.md @@ -15,6 +15,9 @@ Just like in any machine learning system, the performance of individual componen - [Context precision](context_precision.md) - [Context relevancy](context_relevancy.md) - [Context entity recall](context_entities_recall.md) +- [Summarization Score](summarization_score.md) + +```{toctree} ## End-to-End Evaluation @@ -36,5 +39,6 @@ context_entities_recall semantic_similarity answer_correctness critique +summarization_score ``` diff --git a/docs/concepts/metrics/summarization_score.md b/docs/concepts/metrics/summarization_score.md index 3f8ed2fed..7c302f043 100644 --- a/docs/concepts/metrics/summarization_score.md +++ b/docs/concepts/metrics/summarization_score.md @@ -16,7 +16,25 @@ graph LR D --> F[Summarization Score] ```` +We compute the question-answer score using the answers, which is a list of `1`s and `0`s. The question-answer score is then calculated as the ratio of correctly answered questions(answer = `1`) to the total number of questions. + +```{math} +:label: question-answer-score +\text{QA score} = \frac{|\text{correctly answered questions}|}{|\text{total questions}|} +```` + +We also introduce an option to penalize larger summaries by proving a conciseness score. If this option is enabled, the final score is calculated as the average of the summarization score and the conciseness score. This conciseness scores ensures that summaries that are just copies of the text do not get a high score, because they will obviously answer all questions correctly. + ```{math} +:label: conciseness-score +\text{conciseness score} = \frac{\text{length of summary}}{\text{length of context}} +```` + +The final summarization score is then calculated as: + +```{math} +:label: summarization-score +\text{Summarization Score} = \frac{\text{QA score} + \text{conciseness score}}{2} ```` ```{hint} @@ -52,24 +70,19 @@ graph LR **Answers**: ["0", "1", "1", "1", "0", "0", "1", "1", "1", "1", "1"] ```` +## Example -We compute the question-answer score using the answers, which is a list of `1`s and `0`s. The question-answer score is then calculated as the ratio of correctly answered questions(answer = `1`) to the total number of questions. - -```{math} -:label: question-answer-score -\text{QA score} = \frac{|\text{correctly answered questions}|}{|\text{total questions}|} -```` - -We also introduce an option to penalize larger summaries by proving a conciseness score. If this option is enabled, the final score is calculated as the average of the summarization score and the conciseness score. This conciseness scores ensures that summaries that are just copies of the text do not get a high score, because they will obviously answer all questions correctly. - -```{math} -:label: conciseness-score -\text{conciseness score} = \frac{\text{length of summary}}{\text{length of context}} -```` +```{code-block} python +from datasets import Dataset +from ragas.metrics import summarization_score +from ragas import evaluate -The final summarization score is then calculated as: +data_samples = { + 'contexts' : [[c1], [c2]], + 'summary': [s1, s2] +} +dataset = Dataset.from_dict(data_samples) +score = evaluate(dataset,metrics=[summarization_score]) +score.to_pandas() +``` -```{math} -:label: summarization-score -\text{Summarization Score} = \frac{\text{QA score} + \text{conciseness score}}{2} -```` From e2ecfcb45e736b5a98191a7724a3faf6e01a5dd2 Mon Sep 17 00:00:00 2001 From: Aakash Thatte Date: Thu, 13 Jun 2024 22:08:00 +0530 Subject: [PATCH 3/5] Add docs for summarization metric --- docs/references/metrics.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/references/metrics.rst b/docs/references/metrics.rst index dbf2a89f2..8cc5e60bf 100644 --- a/docs/references/metrics.rst +++ b/docs/references/metrics.rst @@ -9,6 +9,7 @@ Metrics ragas.metrics.context_precision ragas.metrics.context_recall ragas.metrics.context_entity_recall + ragas.metrics.summarization_score .. automodule:: ragas.metrics :members: From c9a8f74d5e2a92989564afc9b6402eea34b5dbac Mon Sep 17 00:00:00 2001 From: Aakash Thatte Date: Thu, 13 Jun 2024 22:08:37 +0530 Subject: [PATCH 4/5] Add mermaid --- docs/conf.py | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/conf.py b/docs/conf.py index 8f379141e..b87dd95a5 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -36,6 +36,7 @@ "sphinxawesome_theme.highlighting", # "sphinxawesome_theme.docsearch", "myst_nb", + "sphinxcontrib.mermaid", ] source_suffix = [".rst", ".md"] From bd328706294a56fd747127428427fc8164395f8d Mon Sep 17 00:00:00 2001 From: Aakash Thatte Date: Fri, 14 Jun 2024 23:04:24 +0530 Subject: [PATCH 5/5] Remove mermaid --- docs/concepts/metrics/summarization_score.md | 14 +------------- docs/conf.py | 1 - 2 files changed, 1 insertion(+), 14 deletions(-) diff --git a/docs/concepts/metrics/summarization_score.md b/docs/concepts/metrics/summarization_score.md index 7c302f043..224acd8c5 100644 --- a/docs/concepts/metrics/summarization_score.md +++ b/docs/concepts/metrics/summarization_score.md @@ -2,19 +2,7 @@ This metric gives a measure of how well the `summary` captures the important information from the `contexts`. The intuition behind this metric is that a good summary shall contain all the important information present in the context(or text so to say). -We first extract a set of important keyphrases from the context. These keyphrases are then used to generate a set of questions. The answers to these questions are always `yes(1)` for the context. We then ask these questions to the summary and calculate the summarization score as the ratio of correctly answered questions to the total number of questions. The flowchart below illustrates the process: - - -```{mermaid} -graph LR - A[Context] --> B[Keyphrases] - B ---> C[Questions] - A --> C - - C --> D[Answers] - E[Summary] ----> D - D --> F[Summarization Score] -```` +We first extract a set of important keyphrases from the context. These keyphrases are then used to generate a set of questions. The answers to these questions are always `yes(1)` for the context. We then ask these questions to the summary and calculate the summarization score as the ratio of correctly answered questions to the total number of questions. We compute the question-answer score using the answers, which is a list of `1`s and `0`s. The question-answer score is then calculated as the ratio of correctly answered questions(answer = `1`) to the total number of questions. diff --git a/docs/conf.py b/docs/conf.py index b87dd95a5..8f379141e 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -36,7 +36,6 @@ "sphinxawesome_theme.highlighting", # "sphinxawesome_theme.docsearch", "myst_nb", - "sphinxcontrib.mermaid", ] source_suffix = [".rst", ".md"]