# Study Summaries Comparison

## Clustering

We are going to organize both summaries into predefined themes.  

To analyze both summaries we define 4 themes for both summaries in such a way that the theme content is discussing roughly the same portion of the original study.

The generic themes which are easily identifiable at both summaries are `Introduction`, `Methodology`, `Findings` and `Conclusion`.

We are going to define a dictionary for both summaries a dictionary with the aforementioned keys and the values will be the actually title in the summary text.


In [86]:
# MySummary.txt themes
my_themes = {
    'Introduction': 'Introduction and Motivation',
    'Methodology' : 'Research Methodology',
    'Findings'    : 'Findings and Analysis',
    'Conclusion'  : 'Limitations and Future Work'}

# LLM_Summary.txt themes
llm_themes = {
    'Introduction': 'Introduction',
    'Methodology' : 'Methodology',
    'Findings'    : 'Key Findings',
    'Conclusion'  : 'Conclusion and Future Work'}

We are defining a function that is splitting the summaries text into sections based on the text headings.  
The function is going to return a sections dictionary with the generic headings `Introduction`, `Methodology`, `Findings` and `Conclusion` as keys and the correponding text as values.


In [104]:
import re

def extract_sections(text, themes):
    """
    Splits text into sections based on the provided theme headings.
    Returns a dictionary with theme as key and corresponding text as value.
    """
    sections = {}
    # get a list of the actual text headings
    text_headings = themes.values()
    # create a reverse mapping of actual text_headings -> generic headings
    generic_headings = {v:k for k,v in themes.items()}
    # Create a regex pattern that matches any of the text headings.
    # (Assuming that text headings appear at the beginning of a line)
    pattern = r'(?m)^(' + '|'.join(re.escape(heading) for heading in text_headings) + r')'
    
    # Find all matches and split text accordingly.
    splits = re.split(pattern, text)
    # re.split returns a list where headings are also part of the result.
    # The first element is any text before the first heading (if any).
    current_heading = None
    for segment in splits:
        segment = segment.strip()
        if segment in text_headings:
            current_heading = segment
            sections[generic_headings[current_heading]] = ""
        elif current_heading:
            sections[generic_headings[current_heading]] += segment + "\n"
    return sections


Reading the summary files:

In [299]:
# Read the files
with open("MySummary.txt", "r", encoding="utf-8") as file:
    my_summary = file.read()

with open("LLM_Summary.txt", "r", encoding="utf-8") as file:
    llm_summary = file.read()


Extracting the sections for each summary.

In [None]:
# Extract sections for each summary
my_sections = extract_sections(my_summary, my_themes)
llm_sections = extract_sections(llm_summary, llm_themes)

Checking and validation the section split result:

In [111]:
# check the number of sections
print("my_sections length:", len(my_sections))
print("llm_sections length:", len(llm_sections))

my_sections length: 4
llm_sections length: 4


Visually inspecting the last section for both cases:

In [112]:
print(my_sections.keys())

dict_keys(['Introduction', 'Methodology', 'Findings', 'Conclusion'])


In [114]:
print(my_sections['Conclusion'])

While domain models can clearly highlight missing requirements, this study did not evaluate whether analysts effectively identify and correct those omissions in practice. Future research should include user studies to explore the practical effectiveness of domain models in supporting requirements validation.

Conclusion
This empirical study provides concrete evidence supporting domain models' value as effective tools for completeness checking in natural-language requirements specifications. By systematically highlighting omissions, particularly entirely missing requirements, domain models can significantly improve requirements quality, making them valuable components of requirements engineering practice.



In [115]:
print(llm_sections.keys())

dict_keys(['Introduction', 'Methodology', 'Findings', 'Conclusion'])


In [116]:
print(llm_sections['Conclusion'])

The study provides empirical evidence that domain models can help identify missing and under-specified requirements, though their effectiveness depends on how frequently concepts are referenced in the requirements. The results suggest that domain models should be complemented by other techniques for completeness checking. Future work should focus on user studies to evaluate whether analysts can effectively leverage domain models in practice.



## Diffing with Python's `difflib`

Using the built-in `difflib` module to compute and print the similarity ratio between two sections.  
We are defining the function `similarity_ratios` which is going to compute the similarity between the sections of the two summary.

In [187]:
from difflib import SequenceMatcher

def compute_similarity_ratios(sequences1, sequences2):

    ratios = {}
    for theme in sequences1:
        ratio = SequenceMatcher(
            None,
            sequences1[theme],
            sequences2[theme]
        ).ratio()
        ratios[theme] = ratio
    return ratios


Computing and displaying the similarity ratios between `my_sections` and `llm_sections`:

In [188]:
my_llm_similarity_ratios = compute_similarity_ratios(my_sections, llm_sections)
for theme in my_llm_similarity_ratios:
    print(f'{theme:12}: {my_llm_similarity_ratios[theme]:.2f}')

Introduction: 0.05
Methodology : 0.13
Findings    : 0.10
Conclusion  : 0.05


We can observe that the `Introduction` and `Conclusion` sections got only `5%` similarity ratio, while the `Methodology` got `13%` and the `Findings` section got `10%`.

We're getting these low ratios because `difflib` is used mostly for detailed, line-by-line comparison (e.g., program code comparison)

## TF-IDF and Cosine Similarity

Converting each section into a vector representation using TF-IDF (via scikit-learn), 
then calculating the cosine similarity.  
This method highlights the overall textual differences:

In [170]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity


def compute_cosine_similarities(sections1, sections2):
    cos_similarities = {}
    for theme in sections1:
        vectorizer = TfidfVectorizer()
        texts = [sections1[theme], sections2[theme]]
        tfidf_matrix = vectorizer.fit_transform(texts)
        # Use the imported cosine_similarity function from scikit-learn
        sim = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
        cos_similarities[theme] = sim[0][0]
    return cos_similarities


In [176]:
my_llm_cosine_similarities = compute_cosine_similarities(my_sections, llm_sections)
for theme in my_llm_cosine_similarities:
    print(f'{theme:12}: {my_llm_cosine_similarities[theme]:.2f}')

Introduction: 0.65
Methodology : 0.59
Findings    : 0.65
Conclusion  : 0.47


TF-IDF/Cosine Similarity results show a more pronunciated sense of overall content 
similarity between the two summaries.

## Embedding-Based Comparison

In [183]:
from sentence_transformers import SentenceTransformer, util

def compute_semantic_similarities(sections1, sections2):
    sem_similarities = {}
    for theme in sections1:
        model = SentenceTransformer('all-MiniLM-L6-v2')
        embeddings = model.encode([my_sections[theme], llm_sections[theme]], convert_to_tensor=True)
        sem_similarities[theme] = util.pytorch_cos_sim(embeddings[0], embeddings[1])
    return sem_similarities

Semantic similarity: 0.8764227628707886


In [184]:
my_llm_semantic_similarities = compute_semantic_similarities(my_sections, llm_sections)
for theme in my_llm_semantic_similarities:
    print(f'{theme:12}: {my_llm_semantic_similarities[theme][0][0]:.2f}')

Introduction: 0.86
Methodology : 0.88
Findings    : 0.92
Conclusion  : 0.88


The Embedding -based approach show that there are relatively small semantic differences between the summaries.

## Keyword Extraction Using RAKE

In [285]:
import pandas as pd
from rake_nltk import Rake
import nltk
# nltk.download('stopwords')
# nltk.download('punkt_tab')

def extract_keywords(text):
    # Initialize RAKE (make sure you've downloaded the NLTK stopwords if not already done)
    rake = Rake()
    rake.extract_keywords_from_text(text)
    return set(rake.get_ranked_phrases())


Extracting the keywords for the generic themes and feeding the number of keywords into
a pandas DataFrame.

In [296]:
# Defining themes that match the keys in my_sections and llm_sections dictionaries
themes = ['Introduction', 'Methodology', 'Findings', 'Conclusion']

# Prepare a dictionary to hold the counts for each theme
data = {}

for theme in themes:
    # Get text for the theme; if a theme isn't found, default to an empty string.
    my_text = my_sections.get(theme, "")
    llm_text = llm_sections.get(theme, "")
    
    # Extract keywords from both texts
    my_keywords = extract_keywords(my_text)
    llm_keywords = extract_keywords(llm_text)
    
    # Calculate common and unique keyword sets
    common_keywords = my_keywords.intersection(llm_keywords)
    unique_to_my = my_keywords - llm_keywords
    unique_to_llm = llm_keywords - my_keywords
    
    # Calculate total keywords for each summary
    total_my = len(common_keywords) + len(unique_to_my)
    total_llm = len(common_keywords) + len(unique_to_llm)
    
    # Calculate ratios and format as an integer followed by "%"
    ratio_my = int((len(common_keywords) / total_my * 100)) if total_my > 0 else 0
    ratio_llm = int((len(common_keywords) / total_llm * 100)) if total_llm > 0 else 0
    
    # Store the counts and formatted ratios in a sub-dictionary for this theme
    data[theme] = {
        'Common Keywords': len(common_keywords),
        'Unique to MySummary': len(unique_to_my),
        'Unique to LLM_Summary': len(unique_to_llm),
        'MySummary Common Ratio [%]': f"{ratio_my}%",
        'LLM_Summary Common Ratio [%]': f"{ratio_llm}%"
    }

# Create a DataFrame where columns are themes and rows are the categories
df = pd.DataFrame(data)

# Optionally, reorder the rows
desired_order = [
    'Common Keywords', 
    'Unique to MySummary', 
    'Unique to LLM_Summary', 
    'MySummary Common Ratio [%]', 
    'LLM_Summary Common Ratio [%]'
]
df = df.reindex(desired_order)

print(df)

                             Introduction Methodology Findings Conclusion
Common Keywords                        10           6        7          3
Unique to MySummary                    68          29       82         22
Unique to LLM_Summary                  56          42       48         16
MySummary Common Ratio [%]           12 %        17 %      7 %       12 %
LLM_Summary Common Ratio [%]         15 %        12 %     12 %       15 %


# Optional Challenge

## Reading the Original Study into a Markdown File

In [203]:
# downloading the original study pdf
import wget
url = 'https://link.springer.com/content/pdf/10.1007/s10664-019-09693-x.pdf'
pdf_filename = wget.download(url)


100% [..........................................................................] 3582534 / 3582534

We are going to convert the original pdf file into markdown with Mistral AI.  
We need to register and we can subscribe for a free plan on [Mistral](https://console.mistral.ai/home) homepage and we need to generate and API key.

The API key can be stored in the current directory in a file called `.env` in format:
```
MISTRAL_API_KEY=Ta7...
```
We should add this file to the `.gitignore` in order to not upload in a public repository. This file can store more API keys in separate lines.



In [213]:
# Initialize Mistral client with API key
from mistralai import Mistral
import os
from dotenv import load_dotenv
load_dotenv()

# api_key = "your API key" # plain text API key can be used when notebook is not shared
api_key = os.getenv("MISTRAL_API_KEY")
client = Mistral(api_key=api_key)

In [215]:
# Import required libraries
from pathlib import Path
from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
import json


In [216]:
# Verify PDF file exists
pdf_file = Path(pdf_filename)
assert pdf_file.is_file()

In [217]:
# Upload PDF file to Mistral's OCR service
uploaded_file = client.files.upload(
    file={
        "file_name": pdf_file.stem,
        "content": pdf_file.read_bytes(),
    },
    purpose="ocr",
)

In [251]:
# Get URL for the uploaded file
signed_url = client.files.get_signed_url(file_id=uploaded_file.id, expiry=3)

# Process PDF with OCR
pdf_response = client.ocr.process(
    document=DocumentURLChunk(document_url=signed_url.url),
    model="mistral-ocr-latest",
    include_image_base64=True
)

# Convert response to JSON format
response_dict = json.loads(pdf_response.model_dump_json())

In [226]:
# checking the json object structure
response_dict.keys()

dict_keys(['pages', 'model', 'usage_info'])

In [227]:
# checking the length of the scanned pdf
len(response_dict['pages'])

31

In [281]:
# Displaying a random page from the scanned pdf
from IPython.display import display, Markdown
display(Markdown(response_dict['pages'][17]['markdown']))

applied. Since the loop of L. 5 has a random component, it has to be run multiple times to account for random variation (L. 4). The scatter plot resulting from running the algorithm of Fig. 5 is the basis for answering RQ1 in Section 4.

# 4 Results and Discussion 

In this section, we (i) discuss the results of our case studies, (ii) answer the research questions posed in Section 3.1, and (iii) argue about the representativeness of the requirements samples used in our case studies.

Table 1 provides key statistics about the outcomes of our data collection as per the procedures discussed in Sections 3.3.1, 3.3.2 and 3.3.3. For each case study, the table provides the following information: (1) the number of omittable elements of different types: as explained in Section 3.3.2, an omittable element can be an entire requirement or a certain segment of a requirement, namely a condition, constraint, or object; (2) the number and type of interdependencies between the omittable constraints and objects (the types were discussed in Section 3.3.2); (3) interrater agreement, computed as Cohen's $\kappa$ (1960), for the identification and classification of omittable segments by two coders. The $\kappa$ scores indicate strong, almost perfect or perfect agreement in all case studies; (4) the number of domain model elements (of which the number of elements tacit in the requirements is shown in brackets); and (5) the number of trace links from the requirements to the domain model.

## Algorithm Simulation

Input: - A domain model

- Set of requirements, their omittable segments, and the interdependencies between these segments
- Set $\mathcal{L}$ of trace links from the requirements to the domain model
- Omission type $T \in\{\mathrm{Req}$, Cond, Cons, Obj $\}$ to simulate
- Number $n$ of simulation runs

Output: - Scatter plot Plot showing the percentage of unsupported domain model elements versus the number of omissions

1. Let $\mathcal{K}$ be the set of domain model elements not supported by $\mathcal{L}$
2. Let $\mathcal{O}$ be the set of all omittable segments of type $T$
3. Let Plot be initially empty
4. for $i=1$ to $n$ :
5. for $j=1$ to $|\mathcal{O}|$ :
6. Randomly pick $j$ elements, $\mathcal{E}=\left\{e_{1}, \cdots, e_{j}\right\}$, from $\mathcal{O}$
7. if $(T=$ Cons or $T=$ Obj $)$

8 . Minimally modify $\mathcal{E}$ to resolve any interdependency violations
9. Let $\mathcal{L}^{-}$be the set of all trace links originating from $\mathcal{E}$
10. $\mathcal{L}^{\prime}=\mathcal{L} \backslash \mathcal{L}^{-}$
11. Let $\mathcal{U}$ be the set of domain model elements supported by $\mathcal{L}$ but not $\mathcal{L}^{\prime}$
12. Let $\mathcal{X}$ be the domain elements in $\mathcal{K}$ that will become unreachable from non- $\mathcal{K}$ elements if $\mathcal{U}$ is removed from the domain model
13. Place the following datapoint on Plot: $(|\mathcal{E}|,(|\mathcal{U}|+|\mathcal{X}|) / D)$, where $D$ is the total number of domain model elements
14. return Plot

Fig. 5 Simulation algorithm for computing sensitivity

In [280]:
# Concatenating the pages into a single Markdown file
original_study = ''
for page in response_dict['pages']:
    original_study += page['markdown']

In [276]:
# Backup of the original_study markdown file
with open("original_study.md", "w", encoding="utf-8") as file:
    file.write(original_study)


## Summarization Quality Evaluation with ROUGE

In [305]:
from rouge import Rouge

rouge = Rouge()

# For the 'rouge' package, the order is candidate summary first, then reference.
scores_my = rouge.get_scores(my_summary, original_study)
scores_llm = rouge.get_scores(llm_summary, original_study)

print("My Summary vs Original Study:")
print(scores_my)

print("LLM Summary vs Original Study:")
print(scores_llm)


My Summary vs Original Study:
[{'rouge-1': {'r': 0.08398133748055987, 'p': 0.6506024096385542, 'f': 0.14876032855341545}, 'rouge-2': {'r': 0.01447051085288314, 'p': 0.19020172910662825, 'f': 0.02689486421162684}, 'rouge-l': {'r': 0.07900466562986003, 'p': 0.6120481927710844, 'f': 0.1399449015561703}}]
LLM Summary vs Original Study:
[{'rouge-1': {'r': 0.0656298600311042, 'p': 0.694078947368421, 'f': 0.11992043036238709}, 'rouge-2': {'r': 0.019403639552729664, 'p': 0.34368932038834954, 'f': 0.036733422252820475}, 'rouge-l': {'r': 0.06314152410575427, 'p': 0.6677631578947368, 'f': 0.1153736841276613}}]


In [306]:
def round_scores(score_list, decimals=2):
    rounded_list = []
    for score in score_list:
        rounded_score = {}
        for metric, values in score.items():
            rounded_score[metric] = {k: round(v, decimals) for k, v in values.items()}
        rounded_list.append(rounded_score)
    return rounded_list

# Assuming scores_my and scores_llm hold the results you posted:
rounded_my = round_scores(scores_my)
rounded_llm = round_scores(scores_llm)

print("Rounded My Summary vs Original Study:")
print(rounded_my)
print("Rounded LLM Summary vs Original Study:")
print(rounded_llm)


Rounded My Summary vs Original Study:
[{'rouge-1': {'r': 0.08, 'p': 0.65, 'f': 0.15}, 'rouge-2': {'r': 0.01, 'p': 0.19, 'f': 0.03}, 'rouge-l': {'r': 0.08, 'p': 0.61, 'f': 0.14}}]
Rounded LLM Summary vs Original Study:
[{'rouge-1': {'r': 0.07, 'p': 0.69, 'f': 0.12}, 'rouge-2': {'r': 0.02, 'p': 0.34, 'f': 0.04}, 'rouge-l': {'r': 0.06, 'p': 0.67, 'f': 0.12}}]


Feeding the results into a dataframe:

In [307]:
import pandas as pd

# Build a DataFrame with rounded ROUGE scores for each metric and summary
data = {
    'Metric': ['rouge-1', 'rouge-2', 'rouge-l'],
    'My_Recall': [round(scores_my[0]['rouge-1']['r'], 2), round(scores_my[0]['rouge-2']['r'], 2), round(scores_my[0]['rouge-l']['r'], 2)],
    'My_Precision': [round(scores_my[0]['rouge-1']['p'], 2), round(scores_my[0]['rouge-2']['p'], 2), round(scores_my[0]['rouge-l']['p'], 2)],
    'My_F1': [round(scores_my[0]['rouge-1']['f'], 2), round(scores_my[0]['rouge-2']['f'], 2), round(scores_my[0]['rouge-l']['f'], 2)],
    'LLM_Recall': [round(scores_llm[0]['rouge-1']['r'], 2), round(scores_llm[0]['rouge-2']['r'], 2), round(scores_llm[0]['rouge-l']['r'], 2)],
    'LLM_Precision': [round(scores_llm[0]['rouge-1']['p'], 2), round(scores_llm[0]['rouge-2']['p'], 2), round(scores_llm[0]['rouge-l']['p'], 2)],
    'LLM_F1': [round(scores_llm[0]['rouge-1']['f'], 2), round(scores_llm[0]['rouge-2']['f'], 2), round(scores_llm[0]['rouge-l']['f'], 2)]
}

df = pd.DataFrame(data)
print(df)


    Metric  My_Recall  My_Precision  My_F1  LLM_Recall  LLM_Precision  LLM_F1
0  rouge-1       0.08          0.65   0.15        0.07           0.69    0.12
1  rouge-2       0.01          0.19   0.03        0.02           0.34    0.04
2  rouge-l       0.08          0.61   0.14        0.06           0.67    0.12
