# Study Summaries Comparison

## Clustering

We are going to organize both summaries into predefined themes.  

By analyzing both summaries we define 4 themes for both summaries in such a way that the theme content is discussing roughly the same portion of the original study.


In [25]:
# MySummary.txt themes
my_themes = [
    'Introduction and Motivation',
    'Research Methodology',
    'Findings and Analysis',
    'Limitations and Future Work']

# LLM_Summary.txt themes
llm_themes = [
    'Introduction',
    'Methodology',
    'Key Findings',
    'Conclusion and Future Work']

We are defining a function that is splitting the summaries text into sections based on the theme headings:


In [10]:
import re

def extract_sections(text, themes):
    """
    Splits text into sections based on the provided theme headings.
    Returns a dictionary with theme as key and corresponding text as value.
    """
    sections = {}
    # Create a regex pattern that matches any of the theme headings.
    # (Assuming that headings appear at the beginning of a line)
    pattern = r'(?m)^(' + '|'.join(re.escape(theme) for theme in themes) + r')'
    
    # Find all matches and split text accordingly.
    splits = re.split(pattern, text)
    # re.split returns a list where headings are also part of the result.
    # The first element is any text before the first heading (if any).
    current_heading = None
    for segment in splits:
        segment = segment.strip()
        if segment in themes:
            current_heading = segment
            sections[current_heading] = ""
        elif current_heading:
            sections[current_heading] += segment + "\n"
    return sections


Reading the summary files:

In [36]:
# Read the files
with open("MySummary.txt", "r", encoding="utf-8") as file:
    my_summary_text = file.read()

with open("LLM_Summary.txt", "r", encoding="utf-8") as file:
    llm_summary_text = file.read()
