<a href="https://colab.research.google.com/github/Vkreations/CCSO/blob/master/2025_11_28_%5BEKatis%5D_PyNLP_L1_School_Curricula_Text_Extraction_and_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text Extraction and Analysis of L1 School Curricula with lemmatization**

In this program, we will explore the fundamentals of text extraction and analysis using Python, with practical examples performed on Greek language texts extracted from education curricula.

The goal of this program is to provide a step-by-step guide that helps you understand how to work with textual data, even if you have no prior experience with Python.
The program calculates step-by-step the following metrics for each text/document, displays the results and also save them in Excel file:
*   Basic Metrics: Total pages, characters, words, unique words.
*   Cosine similarity.
*   Top-100 words frequencies (before processing)
*   Summary of POS words: nouns, adjectives, adverbs, verbs.
*   Top-25 of POS words: nouns, adjectives, adverbs, verbs.
*   POS words comparison.
*   Top-25 Bigrams.

First we will extract text from PDF documents, move on to analyzing and visualizing word frequencies, and then explore techniques such as text similarity analysis. By the end of this tutorial, you'll also learn how to apply linguistic tools like part-of-speech (POS) tagging and understand stop words and their importance in basic text analysis tasks.


# step 1. Uploading and Reading PDF Files

In this section, we set up **the necessary tools to upload and read text from PDF files:**

1. Install PyMuPDF: PyMuPDF is widely used for processing PDF files.

2. Import fitz: The main PyMuPDF module, which we‚Äôll use to interact with PDF documents.

3. Upload files from our computer to Google Colab.


In [None]:
# --- Step 1. Uploading and Reading PDF Files ---
# Install necessary library for reading PDF
!pip install PyMuPDF  # Install the PyMuPDF library

# Import the required library
import fitz  # PyMuPDF
from google.colab import files # To be able to upload files from our computer

# Upload and read our PDF files
print('-'*50)
print("Please upload the three PDF files of School Curricula for Text Extraction and Analysis:")
uploaded_files = files.upload()  # We use the files.upload() function to open
                                 # a dialog where we can select and upload our PDF files.
                                 # The files.upload() function returns a dictionary-like object.
                                 # This dictionary's keys are the filenames of the uploaded files
                                 # and the values are the contents of those files.
                                 # We assign this object to the variable uploaded_files.


# step 2. Extracting Text from PDF Files

In this section, we extract and store the text content of the uploaded PDF files for further analysis.

First, we initialize a dictionary.

Then we need to extract and store the extracted text.

We will then preview the extracted text of the two files to verify successful extraction by printing the first few characters.



In [None]:

# --- Step 2. Extracting Text from PDF Files ---
# Initialize a dictionary to store the extracted text for each PDF
# The keys will be the filenames, and the values will be the corresponding extracted text
pdf_texts = {} # this is empty, but will eventually hold the extracted texts
pdf_pages = {}

# --- TEXT EXTRACTION ---
# Extract text from each uploaded PDF
for filename in uploaded_files.keys(): # Loop through each uploaded file
    pdf_text = "" # Inside the loop, pdf_text is reset to an empty string at the beginning of each iteration.
    # The line `pdf_pages = ""` was removed as pdf_pages should remain a dictionary.

    with fitz.open(filename) as doc: # Open the PDF file using PyMuPDF
        pdf_pages[filename] = doc.page_count # Store the number of pages for the current PDF
        for page in doc: # This inner loop iterates through each page of the currently open PDF document
            pdf_text += page.get_text() # This method is used to extract the text from each page and append it
    pdf_texts[filename] = pdf_text # After processing all the pages, store the full text of each file in the dictionary, associating it with the file name
    print(f"Text has successfully extracted from {filename}")

'''# --- TEST THE RESULTS ---
# Print a summary of the extracted text (first 500 characters for each file)
for file_name, text in pdf_texts.items(): # Loop through the dictionary to preview the extracted text
   print('='*100)
   print(f"\nExtracted Text from {file_name} (First 500 characters):")
   print(text[:500])
'''


## Exported Filename
This function checks for specific Greek keywords within the uploaded filenames and assign a corresponding exported_filename for the desired XLS output file.

In [None]:
def generate_exported_excelfile(filename, all_filenames=None):
    # If we have all filenames, check for the cross-level 2023 group
    if all_filenames and len(all_filenames) >= 3:
        # Check if all files are from 2023 and represent different levels
        all_2023 = all('2023' in f for f in all_filenames)
        has_demotiko = any(('ŒîŒ∑ŒºŒøœÑ' in f) or ('ŒîŒóŒúŒüŒ§' in f) for f in all_filenames)
        has_gymnasio = any(('ŒìœÖŒºŒΩ' in f) or ('ŒìŒ•ŒúŒù' in f) for f in all_filenames)
        has_lykeio = any(('ŒõœÖŒ∫' in f) or ('ŒõŒ•Œö' in f) for f in all_filenames)

        if all_2023 and has_demotiko and has_gymnasio and has_lykeio:
            return 'Œ†Œ£_2023_ŒîŒ∑Œº_ŒìœÖŒºŒΩ_ŒõœÖŒ∫_Text_Metrics.xlsx'

    # Original logic for other cases
    exported_filename = 'Œ†Œ£_'

    # Add level information
    if ('ŒîŒ∑ŒºŒøœÑ' in filename) or ('ŒîŒóŒúŒüŒ§' in filename):
        exported_filename += 'ŒîŒ∑ŒºŒøœÑŒπŒ∫Œø'
    elif ('ŒìœÖŒºŒΩ' in filename) or ('ŒìŒ•ŒúŒù' in filename):
        exported_filename += 'ŒìœÖŒºŒΩŒ±œÉŒπŒø'
    elif ('ŒõœÖŒ∫' in filename) or ('ŒõŒ•Œö' in filename):
        exported_filename += 'ŒõœÖŒ∫ŒµŒπŒø'
    elif ('ŒùŒ∑œÄ' in filename) or ('ŒùŒóŒ†' in filename):
        exported_filename += 'ŒùŒ∑œÄŒµŒπŒ±Œ≥œâŒ≥ŒµŒπŒø'
    else:
        exported_filename += 'ŒëŒªŒªŒø'

    exported_filename += '_Text_Metrics.xlsx'
    return exported_filename

# Call this function in the main processing code like this:
# excel_filename = generate_exported_excelfile(list(pdf_texts.keys())[0], list(pdf_texts.keys()))

# step 3. Calculating Basic Text Metrics

The following part of the code defines a function to **calculate basic metrics**: number of characters, number of words, unique words





In [None]:
# --- Step 3. Calculating & Displaying Basic Text Metrics ---
# --- CALCULATE
# Compare basic metrics between the 3 texts
# Define a function to calculate basic text metrics
def Calculate_Metrics(text): # we define a function named calculate_metrics
                             # that takes a single argument, text, which represents the text to analyze
    word_list = text.split()  # Break the text into a list of words using spaces as the delimiter
    num_words = len(word_list)  # Total word count
    num_chars = len(text)  # Total character count including spaces and punctuation
    unique_words = len(set(word_list))  # Unique word count
    '''
    The set() function creates a set from the word_list.
    A set is a data structure that can only contain unique elements.
    Any duplicate elements in the original list are removed when you create a set
    '''
    return num_chars, num_words, unique_words # To be used to output the metrics


## Displaying and Comparing Text Metrics

This part of the code **calculates and displays the key metrics, using the function Calculate_Metrics above, for each uploaded text**.


In [None]:
# --- DISPLAY RESULTS
# Display metrics for each text
print("\nComparing Text Metrics:")
metrics = {} # Creates an empty dictionary to store the calculated metrics for each text.
             # Each filename will serve as a key, with its metrics stored as values
for filename, text in pdf_texts.items(): # Iterates through the uploaded files and their extracted text
    num_chars, num_words, unique_words = Calculate_Metrics(text) # Calls the calculate_metrics function with file_name
    metrics[filename] = {
        "Pages": pdf_pages[filename],
        "Characters": num_chars,
        "Words": num_words,
        "Unique Words": unique_words
    } # Saves the calculated metrics for the current file in the metrics dictionary
      # Each file is represented by its name as the key and its metrics as a dictionary of values
    print(f"\nMetrics for {filename}:") # Display the file name for the metrics to be printed
    print(f"  Number of Pages: {pdf_pages[filename]}")
    print(f"  Number of Characters: {num_chars}")
    print(f"  Number of Words: {num_words}")
    print(f"  Number of Unique Words: {unique_words}")



# step 4. Comparisong of the Text Metrics

This part **comparing the Text Matrics calculated before, of the files/texts uploaded**.


In [None]:
# --- Step 4. METRICS COMPARISON ---
import matplotlib.pyplot as plt
import pandas as pd
# !pip install xlsxwriter # Install the xlsxwriter library
!pip install openpyxl # Install openpyxl for appending sheets to existing Excel files

# Compare the 3 texts side by side
if len(metrics) == 3:             # Check if there are exactly 3 texts to compare
    files = list(metrics.keys())  # Get the list of filenames from the metrics dictionary

    # Print a header for the side-by-side comparison
    print("\nSide-by-Side Comparison:")
    print(f"{'Metric':<20} {files[0]:<28} {files[1]:<25} {files[2]:<25}")  # Print column headers with formatted spacing
    print("-" * 99)  # Print a separator line for better readability

    # Loop through the keys representing the metrics to display their values
    for key in ["Pages", "Characters", "Words", "Unique Words"]:
        # Print each metric name, and the corresponding values for both texts, side by side
        print(f"{key:<20} {metrics[files[0]][key]:<30} {metrics[files[1]][key]:<25} {metrics[files[2]][key]:<25} ")

# --- METRICS PLOTTING ---
# Create a DataFrame from the metrics dictionary for plotting
metrics_df = pd.DataFrame(metrics)

# --- METRICS STORING ---
# Export the metrics DataFrame to an Excel file
# Use openpyxl engine with mode='w' to create/overwrite the file initially
with pd.ExcelWriter(generate_exported_excelfile(list(pdf_texts.keys())[0], list(pdf_texts.keys())), engine='openpyxl', mode='w') as writer:
    metrics_df.to_excel(writer, sheet_name='Basic_Metrics', index=True)
print("\n Text Metrics also exported œÑŒø", generate_exported_excelfile(list(pdf_texts.keys())[0], list(pdf_texts.keys())), " \n")

# Plotting the 2-D column chart
metrics_df.plot(kind='bar', figsize=(12, 6))
plt.title('Comparison of Text Metrics across Documents')
plt.xlabel('Document')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
plt.tight_layout() # Adjust layout to prevent labels from overlapping
plt.show()


'''# Plotting one chart for each metric
for metric_name in metrics_df.index: # Iterate through 'Characters', 'Words', 'Unique Words'
    plt.figure(figsize=(5, 5))
    # Select the row corresponding to the current metric and plot it as a bar chart
    metrics_df.loc[metric_name].plot(kind='bar', color=['skyblue', 'lightcoral', 'lightgreen'])
    plt.title(f'{metric_name} Across Documents')
    plt.xlabel('Document')
    plt.ylabel(metric_name)
    plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for better readability
    plt.tight_layout() # Adjust layout to prevent labels from overlapping
    plt.show()'''

# step 5. Clean noise
This part cleans the extracted text by removing headers, footers, and page numbers. The text patterns are based on the extracted files.

In [None]:
# --- Step 5. Clean the noise (unwanted parts) ---

import re                           # import re module for Regular Expressions
from typing import List, Optional

# Function for cleaning the extracted text by removing headers, footers, and page numbers
def CleanText(text: str,                                     # the extracted text
               header: Optional[List[str]] = None,            # List of header patterns to remove
               footer: Optional[List[str]] = None,            # List of footer patterns to remove
               page_num: Optional[List[str]] = None) -> str:  # List of regex patterns to identify page numbers

    cleaned_text = text

    # ŒîŒïŒ†Œ†Œ£2003--NŒì_ŒîŒ∑ŒºŒøœÑŒπŒ∫ŒøœÖ.pdf & Œ†Œ£2011-NŒì_ŒîŒ∑ŒºŒøœÑŒπŒ∫ŒøœÖ.pdf Œ¥ŒµŒΩ Œ≠œáŒøœÖŒΩ Œ∫ŒµŒØŒºŒµŒΩŒø œÉŒµ ŒöŒµœÜŒ±ŒªŒØŒ¥Œ±-Œ•œÄŒøœÉŒ≠ŒªŒπŒ¥Œø.
    # Œ§Œø œÄŒ±œÅŒ±Œ∫Œ¨œÑœâ œÉœÖŒΩŒ±ŒΩœÑŒπœéŒΩœÑŒ±Œπ œÉœÑŒ∑ŒΩ ŒöŒµœÜŒ±ŒªŒØŒ¥Œ±-Œ•œÄŒøœÉŒ≠ŒªŒπŒ¥Œø œÑŒøœÖ Œ†Œ£2023-ŒùŒµŒøŒµŒªŒªŒ∑ŒΩŒπŒ∫ŒÆŒìŒªœâœÉœÉŒ±_ŒîHMOTIKOY_Œ†Œ£21v2.pdf
    # Customize the header pattern
    if header is None:
      header = [r'ŒùŒµŒøŒµŒªŒªŒ∑ŒΩŒπŒ∫ŒÆ ŒìŒªœéœÉœÉŒ± ŒîŒ∑ŒºŒøœÑŒπŒ∫Œøœç', r'ŒùŒµŒøŒµŒªŒªŒ∑ŒΩŒπŒ∫ŒÆ ŒìŒªœéœÉœÉŒ± ŒëŒÑ, ŒíŒÑ Œ∫Œ±Œπ ŒìŒÑ ŒìœÖŒºŒΩŒ±œÉŒØŒøœÖ']

    # Customize the footer pattern
    if footer is None:
      footer = [r'ŒïœÄŒπœáŒµŒπœÅŒ∑œÉŒπŒ±Œ∫œå Œ†œÅœåŒ≥œÅŒ±ŒºŒºŒ±', r'ŒïœÄŒπœáŒµŒπœÅŒ∑œÉŒπŒ±Œ∫œå Œ†œÅœåŒ≥œÅŒ±ŒºŒºŒ± ', r'ŒµœÄŒπœáŒµŒπœÅŒ∑œÉŒπŒ±Œ∫œå œÄœÅœåŒ≥œÅŒ±ŒºŒºŒ± ', r'ŒµœÄŒπœáŒµŒπœÅŒ∑œÉŒπŒ±Œ∫œå œÄœÅœåŒ≥œÅŒ±ŒºŒºŒ±.+',
                r'ŒëŒΩŒ¨œÄœÑœÖŒæŒ∑ ŒëŒΩŒ∏œÅœéœÄŒπŒΩŒøœÖ ŒîœÖŒΩŒ±ŒºŒπŒ∫Œøœç,', r'ŒëŒΩŒ¨œÄœÑœÖŒæŒ∑ŒëŒΩŒ∏œÅœéœÄŒπŒΩŒøœÖ ŒîœÖŒΩŒ±ŒºŒπŒ∫Œøœç,', r'Œ±ŒΩŒ¨œÄœÑœÖŒæŒ∑ Œ±ŒΩŒ∏œÅœéœÄŒπŒΩŒøœÖ Œ±œÖŒΩŒ±ŒºŒπŒ∫Œøœç,.+',
                r'ŒïŒ∫œÄŒ±ŒØŒ¥ŒµœÖœÉŒ∑ Œ∫Œ±Œπ ŒîŒπŒ¨ ŒíŒØŒøœÖ ŒúŒ¨Œ∏Œ∑œÉŒ∑', r'ŒµŒ∫œÄŒ±ŒØŒ¥ŒµœÖœÉŒ∑ Œ∫Œ±Œπ Œ¥ŒπŒ¨ Œ≤ŒØŒøœÖ ŒºŒ¨Œ∏Œ∑œÉŒ∑.+',
                r'ŒúŒµ œÑŒ∑ œÉœÖŒ≥œáœÅŒ∑ŒºŒ±œÑŒøŒ¥œåœÑŒ∑œÉŒ∑ œÑŒ∑œÇ ŒïŒªŒªŒ¨Œ¥Œ±œÇ Œ∫Œ±Œπ œÑŒ∑œÇ ŒïœÖœÅœâœÄŒ±œäŒ∫ŒÆœÇ ŒàŒΩœâœÉŒ∑œÇ',r'ŒúŒµ œÑŒ∑ œÉœÖŒ≥œáœÅŒ∑ŒºŒ±œÑŒøŒ¥œåœÑŒ∑œÉŒ∑ œÑŒ∑œÇ ŒïŒªŒªŒ¨Œ¥Œ±œÇ Œ∫Œ±Œπ œÑŒ∑œÇ ŒïœÖœÅœâœÄŒ±œäŒ∫ŒÆœÇ ŒàŒΩœâœÉŒ∑œÇ', r'ŒºŒµ œÑŒ∑ œÉœÖŒ≥œáœÅŒ∑ŒºŒ±œÑŒøŒ¥œåœÑŒ∑œÉŒ∑ œÑŒ∑œÇ ŒµŒªŒªŒ¨Œ¥Œ±œÇ Œ∫Œ±Œπ œÑŒ∑œÇ ŒµœÖœÅœâœÄŒ±œäŒ∫ŒÆœÇ Œ≠ŒΩœâœÉŒ∑œÇ .+',
                r'Œ†Œ°ŒüŒìŒ°ŒëŒúŒúŒëŒ§Œë', r'œÄœÅŒøŒ≥œÅŒ¨ŒºŒºŒ±œÑŒ±.+',
                r'Œ£Œ†ŒüŒ•ŒîŒ©Œù', r'œÉœÄŒøœÖŒ¥œéŒΩ',
                ]

    if page_num is None:
      page_num = [r"^\s*\d+\s*$", r"^\s*\d+\s*|", r"^\s*\d+\s* |" ]

    # Split text into lines for processing
    lines = text.split('\n')
    cleaned_lines = []

    for line in lines:
        line = line.strip()

        # Skip empty lines
        if not line:
            continue

        # Check for header patterns
        if any(re.fullmatch(pattern, line, re.IGNORECASE) for pattern in header):
            continue

        # Check for footer patterns
        if any(re.fullmatch(pattern, line, re.IGNORECASE) for pattern in footer):
            continue

        # Check for page number patterns
        if any(re.fullmatch(pattern, line) for pattern in page_num):
            continue

        # If line passed all filters, keep it
        cleaned_lines.append(line)

    # Reconstruct the text
    cleaned_text = '\n'.join(cleaned_lines)

    # Additional cleanup for remaining artifacts
    cleaned_text = re.sub(r'\n{3,}', '\n\n', cleaned_text)  # Reduce multiple newlines
    cleaned_text = re.sub(r'[^\S\n]{2,}', ' ', cleaned_text)  # Reduce multiple spaces

    return cleaned_text

'''# --- TEST RESULTS ---
# Test if the repeated text from header and footer are removed
for file_name, text in pdf_texts.items(): # Loop through the dictionary to preview the extracted text
   text = CleanText(text)
   print('='*100)
   print(f"\nExtracted Text from {file_name} (First 1500 characters):")
   print(text[:10000])
'''

# step 6. Cosine Similarity
In this section, we calculate the cosine similarity between the two texts to measure how similar or different they are. First, we convert the text data into numerical vectors using the CountVectorizer. Then, we compute the cosine similarity between these vectors and display the results in a clear table format, where values closer to 1 indicate high similarity, and values closer to 0 indicate more dissimilarity.

In [None]:
# --- Step 6. COSINE SIMILARITY
# Import necessary libraries
from sklearn.feature_extraction.text import CountVectorizer  # To convert text data into
                                                             # a matrix of token counts
from sklearn.metrics.pairwise import cosine_similarity  # To calculate the cosine similarity between vectors
import pandas as pd  # For handling data and displaying tables

# Create a vector representation of the texts
vectorizer = CountVectorizer().fit_transform(pdf_texts.values())  # Convert the texts
                                                                  # into a matrix of token counts
                                                                  # (bag of words model)

vectors = vectorizer.toarray()  # Convert the sparse matrix into a dense array for easier processing

# Calculate cosine similarity
cos_sim = cosine_similarity(vectors)  # Compute the cosine similarity
                                      # between the vectorized text representations

# Create a DataFrame for better readability
cos_sim_df = pd.DataFrame(cos_sim, index=pdf_texts.keys(), columns=pdf_texts.keys())

'''
Create a DataFrame (table) from the cosine similarity matrix,
using the filenames as both row and column labels
'''

# Explicitly call display() in Google Colab to render the DataFrame as a nicely formatted table
from IPython.display import display  # Import display from IPython to render
                                     # the DataFrame in a nice table format
display(cos_sim_df) # Display the cosine similarity matrix as a well-formatted table in Google Colab

# Provide an explanation for the reader of how to interpret the cosine similarity values
print("\nThe cosine similarity values range from 0 to 1. A value of 1 indicates that the texts are identical,")
print("while a value closer to 0 means the texts are more dissimilar.")

# Save the Cosine_Similarity to an Excel file in a sheet named 'Cosine_Similarity'
# Use 'openpyxl' engine and 'a' mode to append to an existing workbook
with pd.ExcelWriter(generate_exported_excelfile(list(pdf_texts.keys())[0], list(pdf_texts.keys())), engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:
    cos_sim_df.to_excel(writer, sheet_name='Cosine_Similarity', index=True)
print("\n Cosine Similarity also exported œÑŒø", generate_exported_excelfile(list(pdf_texts.keys())[0], list(pdf_texts.keys())), " \n")


# step 7. Frequencies

This code calculates and displays the most frequently occurring words in each text. It uses Python's Counter to count word occurrences and outputs the Top-100 words for each text, along with their frequencies. This provides an overview of the most common terms used in the documents.

In [None]:
# --- Step 7. Calculate words frequency ---

from collections import Counter  # Import the Counter class from the collections module
import pandas as pd           # Import pandas for DataFrame functionality
# !pip install xlsxwriter # already installed in previous section
# openpyxl should be installed in the previous step

# Define a function to calculate word frequencies
def Get_Word_Frequencies(text):
    words = text.split()  # Split the text into a list of words based on spaces
    return Counter(words)  # Count the frequency of each word in the list and return a Counter object

'''# Retrieve and display the 100 most common words along with their frequencies
word_freq = Get_Word_Frequencies(text)
for word, freq in word_freq.most_common(100):
    print(f"  {word}: {freq}")  # Print each word and its frequency, formatted for readability'''


# Initialize dictionary to store top 100 words and their frequencies for each document
top100_words = {}

# Iterate over the uploaded PDF texts
for filename, text in pdf_texts.items():
    word_freq = Get_Word_Frequencies(text)  # Get the word frequencies for the current text
    # Get the 100 most common words and their frequencies and store them in a dictionary
    top100_words[filename] = dict(word_freq.most_common(100))
    # print(f"\n--- Top 100 Words in {filename}:")
    # for x, y in top100_words[filename].items():    # Print the Top-100 values in one column, one under the other
      # print(x, ':', y)

# Create a DataFrame from the collected top words for side-by-side comparison
# Fill NaN values with 0, indicating the word was not in the top 100 for that document, and convert to int
top100_words_df = pd.DataFrame(top100_words).fillna(0).astype(int)

print("\nTop 100 words (preprocessed) Side-by-Side:")
# Display the DataFrame. Using display() for better formatting in Colab if needed.
from IPython.display import display
display(top100_words_df)

# Save the top100_words_df to an Excel file in a sheet named 'Top-100'
# Use 'openpyxl' engine and 'a' mode to append to an existing workbook
with pd.ExcelWriter(generate_exported_excelfile(list(pdf_texts.keys())[0], list(pdf_texts.keys())), engine='openpyxl', mode='a', if_sheet_exists='replace') as writer:
    top100_words_df.to_excel(writer, sheet_name='Top-100 Pre', index=True)
print("\nTop 100 words (preprocessed) have been exported to ", generate_exported_excelfile(list(pdf_texts.keys())[0], list(pdf_texts.keys())), " in a sheet named 'Top-100 Pre'.")

# step 8. Frequencies after Processing with Tokenization, POS tagging and Lemmatization
The code in the following parts calculates and displays the most frequently occurring words in each text. First tokenization and lemmatization of text takes place and count the Top-25 of Nouns, Adjectives, Adverbs and Verbs occurences. Consequently it count the occurrences of the four POS and outputs the Top-100 words for each text, along with their frequencies. This provides an overview of the most common terms used in the documents.

In [None]:
# Preprocessing - Lowercasing and Stop Words Removal ---

# Install NLTK for stop words
!pip install nltk

import nltk
from nltk.corpus import stopwords
from collections import Counter

# Download the stop words list from NLTK
nltk.download('stopwords')
stop_words = set(stopwords.words('greek'))

# Add custom stop words
custom_stop_words = {"-","Œ±œÄœå", "œÑŒ∑œÇ", "œÑŒ∑", "œÑŒ∑ŒΩ", "œåœÑŒπ" "ŒµŒØŒΩŒ±Œπ" "œÄœâœÇ", "œÑŒø", "œÑŒ±", "œÉŒµ", "œâœÇ", "œÑŒøœÖœÇ", "œÑŒøœÖ", "œÑœâŒΩ", "œÑŒπœÇ", "ŒµŒØŒΩŒ±Œπ", "¬µŒµ", "ŒÆ", "‚Ä¢", "/", "‚àí", "ÔÇ∑", "(œÄ.œá.", "Œ≤ŒÑ", "Œ≥ŒÑ", "Œ≥œÖŒºŒΩŒ±œÉŒØŒøœÖ", "œÉ.", " ÔÉº:", "œÑ.","œÄ.œá.","Œ≤.Œº.","ŒºŒπŒ±","Œ≠ŒΩŒ±","Œ±ŒªŒªŒ¨","œåœÑŒπ","ŒµŒΩœåœÇ","œÑŒøœÖœÇ","Œ∫Œ¨Œ∏Œµ","œÑŒøœÖ/œÑŒ∑œÇ","ŒºŒ≠œÉœâ","Œ±.","œÑŒøœÖœÇ.","œÑŒøœÖœÇ/œÑŒπœÇ","Œ±ŒÑ,","Œ±ŒÑ","ÔÄ≠","ÔÉº ","ÔÉº","/","|", "œÉœÑŒ±", "ŒºŒπŒ±œÇ", "œÉœÑŒπœÇ", "‚Äì", "ŒΩŒ±:", "¬µŒπŒ±", "&", "1", "1.", "2", "3.", "Œ≤.Œº.", "Œ∫œÑŒª.", "¬µŒπŒ±", "Œ±-Œ≤", "Œª.œá."  }  # Add stop words here
stop_words.update(custom_stop_words)  # Update the stop words set with the custom words


# Function to preprocess text
def preprocess_text(text):
    # Lowercase the text
    text = text.lower()
    # Split text into words and remove stop words
    words = text.split()
    filtered_words = [word for word in words if word not in stop_words] # Iterate through each word in the words list.
                                                                        # In each iteration, the current word is assigned to the variable word.
    return filtered_words

## Save Results
The following function stores the results in the exported Excel file in 2 new sheets, titled 'Word_Frequencies' & 'Top-100_POS'

In [None]:
# Stores the frequency data to Excel with the specified organization
# Appends new sheets to existing Excel file - Colab compatible version
def Store_Data_toExcel(top25_data_by_doc, top100_data_by_doc, summary_data_by_doc, excelfile):
    try:
        # Check if file exists in Colab environment
        if os.path.exists(excelfile):
            # Load existing workbook
            wb = openpyxl.load_workbook(excelfile)
            print(f"‚úì Appending to existing Excel file: '{excelfile}'")
        else:
            # Create new workbook if it doesn't exist
            wb = Workbook()
            # Remove default sheet
            wb.remove(wb['Sheet'])
            print(f"‚úì Creating new Excel file: '{excelfile}'")

        # --- SHEET 1: Summary_POS (POS Summary for each document) ---
        # Remove existing sheet if it exists
        if 'Summary_POS' in wb.sheetnames:
            del wb['Summary_POS']
        ws1 = wb.create_sheet("Summary_POS")

        # Define starting columns for each document
        doc_columns = {
            0: 'A',  # Document 1
            1: 'F',  # Document 2
            2: 'K'   # Document 3
        }

        # Populate Sheet 1 - Summary_POS
        for doc_idx, (doc_filename, summary_data) in enumerate(summary_data_by_doc.items()):
            start_col = doc_columns.get(doc_idx, 'A')

            # Document title
            ws1[f'{start_col}1'] = doc_filename

            # Headers for summary
            ws1[f'{start_col}2'] = 'POS_Category'
            ws1[f'{chr(ord(start_col)+1)}2'] = 'Unique_Words'
            ws1[f'{chr(ord(start_col)+2)}2'] = 'Total_Words'

            current_row = 3

            # Add summary data for each POS category
            pos_categories = ['NOUNS', 'ADJECTIVES', 'ADVERBS', 'VERBS', 'TOTAL_POS']
            for pos_cat in pos_categories:
                if pos_cat in summary_data:
                    ws1[f'{start_col}{current_row}'] = pos_cat
                    ws1[f'{chr(ord(start_col)+1)}{current_row}'] = summary_data[pos_cat]['unique']
                    ws1[f'{chr(ord(start_col)+2)}{current_row}'] = summary_data[pos_cat]['total']
                    current_row += 1

        # --- SHEET 2: Word_Frequencies (Top-25 by POS) ---
        # Remove existing sheet if it exists
        if 'Top-25_POS' in wb.sheetnames:
            del wb['Top-25_POS']
        ws2 = wb.create_sheet("Top-25_POS")

        # Define starting columns for each document
        doc_columns_sheet2 = {
            0: 'A',  # Document 1
            1: 'F',  # Document 2
            2: 'K'   # Document 3
        }

        # Populate Sheet 2
        for doc_idx, (doc_filename, pos_data) in enumerate(top25_data_by_doc.items()):
            start_col = doc_columns_sheet2.get(doc_idx, 'A')

            # Document title
            ws2[f'{start_col}1'] = doc_filename

            # Headers
            ws2[f'{start_col}2'] = 'POS_category'
            ws2[f'{chr(ord(start_col)+1)}2'] = 'Rank'
            ws2[f'{chr(ord(start_col)+2)}2'] = 'Word'
            ws2[f'{chr(ord(start_col)+3)}2'] = 'Frequency'

            current_row = 3

            # Add data for each POS category
            for pos_category, data in pos_data.items():
                for rank, (word, freq) in data:
                    ws2[f'{start_col}{current_row}'] = pos_category
                    ws2[f'{chr(ord(start_col)+1)}{current_row}'] = rank
                    ws2[f'{chr(ord(start_col)+2)}{current_row}'] = word
                    ws2[f'{chr(ord(start_col)+3)}{current_row}'] = freq
                    current_row += 1
                current_row += 1  # Add empty row between POS categories

        # --- SHEET 3: Top-100_POS ---
        # Remove existing sheet if it exists
        if 'Top-100_POS' in wb.sheetnames:
            del wb['Top-100_POS']
        ws3 = wb.create_sheet("Top-100_POS")

        # Define starting columns for each document in sheet 3
        doc_columns_sheet3 = {
            0: 'B',  # Document 1
            1: 'E',  # Document 2
            2: 'H'   # Document 3
        }

        # Populate Sheet 3
        for doc_idx, (doc_filename, top100_data) in enumerate(top100_data_by_doc.items()):
            start_col = doc_columns_sheet3.get(doc_idx, 'B')

            # Document title
            ws3[f'{start_col}1'] = doc_filename

            # Headers - Adjusted to match specification
            if doc_idx == 0:  # Document 1
                ws3['A2'] = 'Rank'
                ws3['B2'] = 'Word'
                ws3['C2'] = 'COUNT'
            elif doc_idx == 1:  # Document 2
                ws3['E2'] = 'Word'
                ws3['F2'] = 'COUNT'
            elif doc_idx == 2:  # Document 3
                ws3['H2'] = 'Word'
                ws3['I2'] = 'COUNT'

            current_row = 3

            # Add Top-100 data
            for rank, (word, freq) in top100_data:
                if doc_idx == 0:  # Document 1
                    ws3[f'A{current_row}'] = rank
                    ws3[f'B{current_row}'] = word
                    ws3[f'C{current_row}'] = freq
                elif doc_idx == 1:  # Document 2
                    ws3[f'E{current_row}'] = word
                    ws3[f'F{current_row}'] = freq
                elif doc_idx == 2:  # Document 3
                    ws3[f'H{current_row}'] = word
                    ws3[f'I{current_row}'] = freq

                current_row += 1

        # Save the workbook
        wb.save(excelfile)
        print(f"‚úì Successfully updated Excel file: '{excelfile}'")
        print("‚úì Sheet 'Summary_POS': POS summary statistics by document")
        print("‚úì Sheet 'Word_Frequencies': Top-25 for each POS category by document")
        print("‚úì Sheet 'Top-100_POS': Top-100 words across all POS by document")

        # Offer download in Colab
        # print(f"\nüì• Download the Excel file:")
        # files.download(excelfile)

        return True

    except Exception as e:
        print(f"\n‚ùå Error updating Excel file: {e}")
        return False

## Calculates the Frequencies
This part calculates and displays the most frequently occurring words in each text: the Top-25 of Nouns, Adjectives, Adverbs and Verbs occurences and also the occurrences of the four POS and outputs the Top-100 words for each text, along with their frequencies. The calculations made after tokenization and lemmatization of text.

In [None]:
# --- Step 8. Frequencies after Processing ---  (Not saving POS Summary)
# Install spaCy and Greek language model
!pip install spacy
!pip install openpyxl
!pip install matplotlib
!python -m spacy download el_core_news_sm  # Greek language model

import spacy
from collections import Counter
import pandas as pd
import openpyxl
from openpyxl import Workbook
import os
from google.colab import files
import matplotlib.pyplot as plt
import numpy as np

# Load the Greek model
try:
    nlp = spacy.load("el_core_news_sm")
except OSError:
    print("Greek model not found. Downloading...")
    !python -m spacy download el_core_news_sm
    nlp = spacy.load("el_core_news_sm")

# Initialize counters
noun_counts = Counter()
adj_counts = Counter()
adv_counts = Counter()
verb_counts = Counter()
all_pos_counts = Counter()

# Lists to store data for all sheets
top25_data_by_doc = {}  # {doc_name: {pos_type: [(rank, word, freq)]}}
top100_data_by_doc = {}  # {doc_name: [(rank, word, freq)]}
summary_data_by_doc = {}  # {doc_name: {category: {'unique': x, 'total': y}}}
# Lists to store Top-25 and Top-100 results for comparison
top25_nouns = {}
top25_adj = {}
top25_adv = {}
top25_verbs = {}
top100_allpos = {}

# Lists to store summary data for plotting
plot_summary_data = {
    'filenames': [],
    'nouns_total': [],
    'adjectives_total': [],
    'adverbs_total': [],
    'verbs_total': [],
    'total_pos': []
}

# Process each text
doc_counter = 0
for filename, text in pdf_texts.items():
    print(f"\n{'='*60}")
    print(f"Analyzing: {filename}")
    print(f"{'-'*60}")

    # Add error handling for text processing
    try:
        text = CleanText(text)  # Clean the text from noise of header and footer
        filtered_words = preprocess_text(text)
        processed_text = " ".join(filtered_words)

        # Process in chunks if text is too long (Colab memory optimization)
        if len(processed_text) > 1000000:  # If text is > 1MB
            print("‚ö† Large text detected, processing in chunks...")
            doc = nlp(processed_text[:1000000])  # Process first 1MB
        else:
            doc = nlp(processed_text)

        # Reset counters for each document
        doc_noun_counts = Counter()
        doc_adj_counts = Counter()
        doc_adv_counts = Counter()
        doc_verb_counts = Counter()
        doc_all_pos_counts = Counter()

        # Iterate through tokens in the document with progress indicator
        total_tokens = len(doc)
        print(f"Processing {total_tokens} tokens...")

        for i, token in enumerate(doc):
            # Progress indicator for large documents
            if i % 10000 == 0 and i > 0:
                print(f"Processed {i}/{total_tokens} tokens...")

            lemma = token.lemma_.lower().strip()
            pos = token.pos_

            if not lemma or token.is_punct or token.is_space:
                continue

            if pos == "NOUN":
                doc_noun_counts[lemma] += 1
            elif pos == "ADJ":
                doc_adj_counts[lemma] += 1
            elif pos == "ADV":
                doc_adv_counts[lemma] += 1
            elif pos == "VERB":
                doc_verb_counts[lemma] += 1

            if pos in ["NOUN", "ADJ", "ADV", "VERB"]:
                doc_all_pos_counts[lemma] += 1

        # Print results with summary
        print(f"\nüìä SUMMARY FOR {filename}:")
        print(f"  Nouns: {len(doc_noun_counts)} unique")
        print(f"  Adjectives: {len(doc_adj_counts)} unique")
        print(f"  Adverbs: {len(doc_adv_counts)} unique")
        print(f"  Verbs: {len(doc_verb_counts)} unique")
        print(f"  Total POS words: {sum(doc_all_pos_counts.values())}")

        # Store data for plotting
        plot_summary_data['filenames'].append(filename)
        plot_summary_data['nouns_total'].append(sum(doc_noun_counts.values()))
        plot_summary_data['adjectives_total'].append(sum(doc_adj_counts.values()))
        plot_summary_data['adverbs_total'].append(sum(doc_adv_counts.values()))
        plot_summary_data['verbs_total'].append(sum(doc_verb_counts.values()))
        plot_summary_data['total_pos'].append(sum(doc_all_pos_counts.values()))

        # Store Top-25 Nouns in list
        top25_nouns[filename] = doc_noun_counts.most_common(25)
        print(f"\n--- TOP 25 NOUNS in {filename} ---")
        for word, count in doc_noun_counts.most_common(25):
            print(f"{word}: {count}")

        # Store Top-25 Adjectives in list
        top25_adj[filename] = doc_adj_counts.most_common(25)
        print(f"\n--- TOP 25 ADJECTIVES in {filename} ---")
        for word, count in doc_adj_counts.most_common(25):
            print(f"{word}: {count}")

         # Store Top-25 Adverbs in list
        top25_adv[filename] = doc_adv_counts.most_common(25)
        print(f"\n--- TOP 25 ADVERBS in {filename} ---")
        for word, count in doc_adv_counts.most_common(25):
            print(f"{word}: {count}")

        # Store Top-25 Verbs in list
        top25_verbs[filename] = doc_verb_counts.most_common(25)
        print(f"\n--- TOP 25 VERBS in {filename} ---")
        for word, count in doc_verb_counts.most_common(25):
            print(f"{word}: {count}")

        # Store Top-100 All POS in list
        top100_allpos[filename] = doc_all_pos_counts.most_common(100)
        print(f"\n--- TOP 100 All POS in {filename} ---")
        for word, count in doc_all_pos_counts.most_common(100):
            print(f"{word}: {count}")

        # Update global counters
        noun_counts.update(doc_noun_counts)
        adj_counts.update(doc_adj_counts)
        adv_counts.update(doc_adv_counts)
        verb_counts.update(doc_verb_counts)
        all_pos_counts.update(doc_all_pos_counts)

        # --- STORE DATA FOR EXCEL ---

        # Store Top-25 data organized by POS category
        top25_data_by_doc[filename] = {
            'NOUN': list(enumerate(doc_noun_counts.most_common(25), 1)),
            'ADJECTIVE': list(enumerate(doc_adj_counts.most_common(25), 1)),
            'ADVERB': list(enumerate(doc_adv_counts.most_common(25), 1)),
            'VERB': list(enumerate(doc_verb_counts.most_common(25), 1))
        }

        # Store Top-100 data (all POS combined)
        top100_data_by_doc[filename] = list(enumerate(doc_all_pos_counts.most_common(100), 1))

        # Store Summary data for the new Summary_POS sheet
        summary_data_by_doc[filename] = {
            'NOUNS': {'unique': len(doc_noun_counts), 'total': sum(doc_noun_counts.values())},
            'ADJECTIVES': {'unique': len(doc_adj_counts), 'total': sum(doc_adj_counts.values())},
            'ADVERBS': {'unique': len(doc_adv_counts), 'total': sum(doc_adv_counts.values())},
            'VERBS': {'unique': len(doc_verb_counts), 'total': sum(doc_verb_counts.values())},
            'TOTAL_POS': {'unique': len(doc_all_pos_counts), 'total': sum(doc_all_pos_counts.values())}
        }

        doc_counter += 1

    except Exception as e:
        print(f"‚ùå Error processing {filename}: {e}")
        continue

# --- CREATE SUMMARY PLOT ---
def create_summary_plot(plot_data):
    """
    Create a grouped bar plot showing POS summary by file
    """
    filenames = plot_data['filenames']
    n_files = len(filenames)

    # Set up the plot
    fig, ax = plt.subplots(figsize=(12, 8))

    # Define the categories and their colors
    categories = ['Nouns', 'Adjectives', 'Adverbs', 'Verbs', 'Total POS']
    colors = ['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd']

    # Set the width of bars and positions
    bar_width = 0.15
    x_pos = np.arange(n_files)

    # Create bars for each category
    for i, (category, color) in enumerate(zip(categories, colors)):
        if category == 'Nouns':
            values = plot_data['nouns_total']
        elif category == 'Adjectives':
            values = plot_data['adjectives_total']
        elif category == 'Adverbs':
            values = plot_data['adverbs_total']
        elif category == 'Verbs':
            values = plot_data['verbs_total']
        else:  # Total POS
            values = plot_data['total_pos']

        positions = x_pos + (i - 2) * bar_width
        bars = ax.bar(positions, values, bar_width, label=category, color=color, alpha=0.8)

        # Add value labels on bars
        for bar, value in zip(bars, values):
            height = bar.get_height()
            ax.text(bar.get_x() + bar.get_width()/2., height + height*0.01,
                   f'{value:,}', ha='center', va='bottom', fontsize=9, rotation=0)

    # Customize the plot
    ax.set_xlabel('Documents', fontsize=12, fontweight='bold')
    ax.set_ylabel('Word Count', fontsize=12, fontweight='bold')
    ax.set_title('POS Distribution Summary by Document', fontsize=14, fontweight='bold')
    ax.set_xticks(x_pos)

    # Shorten filenames for better display
    short_names = [name[:20] + '...' if len(name) > 20 else name for name in filenames]
    ax.set_xticklabels(short_names, rotation=45, ha='right')

    # Add legend
    ax.legend(bbox_to_anchor=(1.05, 1), loc='upper left')

    # Add grid for better readability
    ax.grid(True, axis='y', alpha=0.3, linestyle='--')

    # Adjust layout to prevent label cutoff
    plt.tight_layout()

    # Display the plot
    plt.show()

    return fig

# Generate and display the plot
print(f"\n{'='*60}")
print("GENERATING SUMMARY PLOT")
print(f"{'-'*60}")

if plot_summary_data['filenames']:
    summary_plot = create_summary_plot(plot_summary_data)
    print("‚úÖ Summary plot generated successfully!")
else:
    print("‚ùå No data available for plotting")

# --- CALL THE FUNCTION TO STORE DATA TO EXCEL ---
if top25_data_by_doc and top100_data_by_doc and summary_data_by_doc:
    # Use the generated filename from your existing function
    excel_file = generate_exported_excelfile(list(pdf_texts.keys())[0], list(pdf_texts.keys())) if pdf_texts else 'Œ†Œ£_ŒëŒªŒªŒø_Metrics.xlsx'
    Store_Data_toExcel(top25_data_by_doc, top100_data_by_doc, summary_data_by_doc, excel_file)
else:
    print("‚ùå No data to save to Excel")

# Print overall statistics across all documents
print(f"\n{'='*60}")
print("OVERALL STATISTICS ACROSS ALL TEXTS")
print(f"{'-'*60}")

print(f"\n--- OVERALL TOP 100 WORDS (All POS) ---")
for word, count in all_pos_counts.most_common(100):
    print(f"{word}: {count}")

print(f"\n--- OVERALL TOP 25 NOUNS ---")
for word, count in noun_counts.most_common(25):
    print(f"{word}: {count}")

print(f"\n--- OVERALL TOP 25 ADJECTIVES ---")
for word, count in adj_counts.most_common(25):
    print(f"{word}: {count}")

print(f"\n--- OVERALL TOP 25 ADVERBS ---")
for word, count in adv_counts.most_common(25):
    print(f"{word}: {count}")

print(f"\n--- OVERALL TOP 25 VERBS ---")
for word, count in verb_counts.most_common(25):
    print(f"{word}: {count}")

# Print POS distribution summary
print(f"\n--- POS DISTRIBUTION SUMMARY ---")
print(f"Total Nouns: {sum(noun_counts.values())} (Unique: {len(noun_counts)})")
print(f"Total Adjectives: {sum(adj_counts.values())} (Unique: {len(adj_counts)})")
print(f"Total Adverbs: {sum(adv_counts.values())} (Unique: {len(adv_counts)})")
print(f"Total Verbs: {sum(verb_counts.values())} (Unique: {len(verb_counts)})")
print(f"Total Words Counted: {sum(all_pos_counts.values())} (Unique: {len(all_pos_counts)})")

# step 9. Word Frequencies Comparisons
The follwoing code provides comprehensive comparisons for all POS categories and percentage calculations for new words in the 3rd document. Also saves results to a new "POS_Comparisons" sheet in the existing Excel file.  First the following function implements the comparisons:
* Top-25 Nouns, Adjectives, Adverbs, Verbs comparisons
* All POS Top-100 comparisons
* Overlap analysis between documents
* New word identification in 3rd document
* Percentage calculations for novelty

In [None]:
import openpyxl
from openpyxl import Workbook
import os

# Compare POS lists across documents
def compare_pos_lists(top25_nouns, top25_adj, top25_adv, top25_verbs, top100_allpos, excel_file):

    # Get filenames
    filenames = list(top25_nouns.keys())
    if len(filenames) < 3:
        print("‚ùå Need at least 3 documents for comparison")
        return

    filename1, filename2, filename3 = filenames[0], filenames[1], filenames[2]

    # Extract just the words from the (word, frequency) tuples
    nouns1 = set([word for word, freq in top25_nouns[filename1]])
    nouns2 = set([word for word, freq in top25_nouns[filename2]])
    nouns3 = set([word for word, freq in top25_nouns[filename3]])

    adj1 = set([word for word, freq in top25_adj[filename1]])
    adj2 = set([word for word, freq in top25_adj[filename2]])
    adj3 = set([word for word, freq in top25_adj[filename3]])

    adv1 = set([word for word, freq in top25_adv[filename1]])
    adv2 = set([word for word, freq in top25_adv[filename2]])
    adv3 = set([word for word, freq in top25_adv[filename3]])

    verbs1 = set([word for word, freq in top25_verbs[filename1]])
    verbs2 = set([word for word, freq in top25_verbs[filename2]])
    verbs3 = set([word for word, freq in top25_verbs[filename3]])

    allpos1 = set([word for word, freq in top100_allpos[filename1]])
    allpos2 = set([word for word, freq in top100_allpos[filename2]])
    allpos3 = set([word for word, freq in top100_allpos[filename3]])

    # Calculate comparisons
    comparison_results = {}

    # NOUNS comparisons
    comparison_results['NOUNS'] = {
        '1a_nouns_1_in_2': len(nouns1.intersection(nouns2)),
        '1b_nouns_2_in_3': len(nouns2.intersection(nouns3)),
        '1c_new_nouns_in_3': len(nouns3 - nouns2 - nouns1),
        '1c_percentage_new_nouns': len(nouns3 - nouns2 - nouns1) / len(nouns3) * 100 if nouns3 else 0
    }

    # ADJECTIVES comparisons
    comparison_results['ADJECTIVES'] = {
        '1a_adj_1_in_2': len(adj1.intersection(adj2)),
        '1b_adj_2_in_3': len(adj2.intersection(adj3)),
        '1c_new_adj_in_3': len(adj3 - adj2 - adj1),
        '1c_percentage_new_adj': len(adj3 - adj2 - adj1) / len(adj3) * 100 if adj3 else 0
    }

    # ADVERBS comparisons
    comparison_results['ADVERBS'] = {
        '1a_adv_1_in_2': len(adv1.intersection(adv2)),
        '1b_adv_2_in_3': len(adv2.intersection(adv3)),
        '1c_new_adv_in_3': len(adv3 - adv2 - adv1),
        '1c_percentage_new_adv': len(adv3 - adv2 - adv1) / len(adv3) * 100 if adv3 else 0
    }

    # VERBS comparisons
    comparison_results['VERBS'] = {
        '1a_verbs_1_in_2': len(verbs1.intersection(verbs2)),
        '1b_verbs_2_in_3': len(verbs2.intersection(verbs3)),
        '1c_new_verbs_in_3': len(verbs3 - verbs2 - verbs1),
        '1c_percentage_new_verbs': len(verbs3 - verbs2 - verbs1) / len(verbs3) * 100 if verbs3 else 0
    }

    # TOP-100 ALL POS comparisons
    comparison_results['TOP100_ALL_POS'] = {
        '1a_allpos_1_in_2': len(allpos1.intersection(allpos2)),
        '1b_allpos_2_in_3': len(allpos2.intersection(allpos3)),
        '1c_new_allpos_in_3': len(allpos3 - allpos2 - allpos1),
        '1c_percentage_new_allpos': len(allpos3 - allpos2 - allpos1) / len(allpos3) * 100 if allpos3 else 0
    }

    # Print results by GROUP
    print("COMPARISON RESULTS")
    print(f"{'-'*80}")

    print(f"\nüìä COMPARISON BETWEEN:")
    print(f"  Document 1: {filename1}")
    print(f"  Document 2: {filename2}")
    print(f"  Document 3: {filename3}")
    print(f"{'-'*80}")

    # NOUNS Group
    print(f"\nüî§ NOUNS COMPARISON:")
    print(f"  1a. Nouns from {filename1} also in {filename2}: {comparison_results['NOUNS']['1a_nouns_1_in_2']}/25")
    print(f"  1b. Nouns from {filename2} also in {filename3}: {comparison_results['NOUNS']['1b_nouns_2_in_3']}/25")
    print(f"  1c. NEW nouns in {filename3}: {comparison_results['NOUNS']['1c_new_nouns_in_3']}/25 ({comparison_results['NOUNS']['1c_percentage_new_nouns']:.1f}%)")

    # ADJECTIVES Group
    print(f"\nüé® ADJECTIVES COMPARISON:")
    print(f"  1a. Adjectives from {filename1} also in {filename2}: {comparison_results['ADJECTIVES']['1a_adj_1_in_2']}/25")
    print(f"  1b. Adjectives from {filename2} also in {filename3}: {comparison_results['ADJECTIVES']['1b_adj_2_in_3']}/25")
    print(f"  1c. NEW adjectives in {filename3}: {comparison_results['ADJECTIVES']['1c_new_adj_in_3']}/25 ({comparison_results['ADJECTIVES']['1c_percentage_new_adj']:.1f}%)")

    # ADVERBS Group
    print(f"\n‚ö° ADVERBS COMPARISON:")
    print(f"  1a. Adverbs from {filename1} also in {filename2}: {comparison_results['ADVERBS']['1a_adv_1_in_2']}/25")
    print(f"  1b. Adverbs from {filename2} also in {filename3}: {comparison_results['ADVERBS']['1b_adv_2_in_3']}/25")
    print(f"  1c. NEW adverbs in {filename3}: {comparison_results['ADVERBS']['1c_new_adv_in_3']}/25 ({comparison_results['ADVERBS']['1c_percentage_new_adv']:.1f}%)")

    # VERBS Group
    print(f"\nüé≠ VERBS COMPARISON:")
    print(f"  1a. Verbs from {filename1} also in {filename2}: {comparison_results['VERBS']['1a_verbs_1_in_2']}/25")
    print(f"  1b. Verbs from {filename2} also in {filename3}: {comparison_results['VERBS']['1b_verbs_2_in_3']}/25")
    print(f"  1c. NEW verbs in {filename3}: {comparison_results['VERBS']['1c_new_verbs_in_3']}/25 ({comparison_results['VERBS']['1c_percentage_new_verbs']:.1f}%)")

    # TOP-100 ALL POS Group
    print(f"\nüìà TOP-100 ALL POS COMPARISON:")
    print(f"  1a. Words from {filename1} also in {filename2}: {comparison_results['TOP100_ALL_POS']['1a_allpos_1_in_2']}/100")
    print(f"  1b. Words from {filename2} also in {filename3}: {comparison_results['TOP100_ALL_POS']['1b_allpos_2_in_3']}/100")
    print(f"  1c. NEW words in {filename3}: {comparison_results['TOP100_ALL_POS']['1c_new_allpos_in_3']}/100 ({comparison_results['TOP100_ALL_POS']['1c_percentage_new_allpos']:.1f}%)")

    # Save to Excel
    save_comparison_to_excel(comparison_results, filenames, excel_file)

    return comparison_results

## Save Results
the following function provides the Excel integration and saves results to a new "POS_Comparisons" sheet in the existing Excel file.

In [None]:
# Save comparison results to Excel file
def save_comparison_to_excel(comparison_results, filenames, excel_file):

    try:
        # Check if file exists
        if os.path.exists(excel_file):
            wb = openpyxl.load_workbook(excel_file)
            print(f"\n‚úì Appending to existing Excel file: '{excel_file}'")
        else:
            wb = Workbook()
            # Remove default sheet
            wb.remove(wb['Sheet'])
            print(f"‚úì Creating new Excel file: '{excel_file}'")

        # Remove existing comparison sheet if it exists
        if 'POS_Comparisons' in wb.sheetnames:
            del wb['POS_Comparisons']

        # Create new sheet for comparisons
        ws = wb.create_sheet("POS_Comparisons")

        # Add title and document info
        ws['A1'] = "POS CATEGORY COMPARISONS"
        ws['E1'] = f"Comparing: {filenames[0]}, {filenames[1]}, {filenames[2]}"

        # Headers
        headers = ['POS Category', 'AA', 'Count', 'Percentage', 'Description']
        for col, header in enumerate(headers, 1):
            ws.cell(row=4, column=col, value=header)

        current_row = 5

        # Define comparison descriptions
        descriptions = {
            '1a': f"Words from {filenames[0]} also in {filenames[1]}",
            '1b': f"Words from {filenames[1]} also in {filenames[2]}",
            '1c': f"NEW words in {filenames[2]}"
        }

        # Add data for each POS category
        for pos_category, results in comparison_results.items():
            # Add category header
            ws.cell(row=current_row, column=1, value=pos_category)
            current_row += 1

            # Add comparison results
            comparisons = [
                ('1a', results.get('1a_nouns_1_in_2', results.get('1a_adj_1_in_2', results.get('1a_adv_1_in_2', results.get('1a_verbs_1_in_2', results.get('1a_allpos_1_in_2')))))),
                ('1b', results.get('1b_nouns_2_in_3', results.get('1b_adj_2_in_3', results.get('1b_adv_2_in_3', results.get('1b_verbs_2_in_3', results.get('1b_allpos_2_in_3')))))),
                ('1c', results.get('1c_new_nouns_in_3', results.get('1c_new_adj_in_3', results.get('1c_new_adv_in_3', results.get('1c_new_verbs_in_3', results.get('1c_new_allpos_in_3'))))))
            ]

            percentages = [
                None,  # No percentage for 1a
                None,  # No percentage for 1b
                results.get('1c_percentage_new_nouns', results.get('1c_percentage_new_adj', results.get('1c_percentage_new_adv', results.get('1c_percentage_new_verbs', results.get('1c_percentage_new_allpos')))))
            ]

            for i, (comp_type, count) in enumerate(comparisons):
                ws.cell(row=current_row, column=2, value=comp_type)
                ws.cell(row=current_row, column=3, value=count)

                if percentages[i] is not None:
                    ws.cell(row=current_row, column=4, value=f"{percentages[i]:.1f}%")

                ws.cell(row=current_row, column=5, value=descriptions[comp_type])
                current_row += 1

            current_row += 1  # Add empty row between categories

        # Add summary section
        ws.cell(row=current_row, column=1, value="SUMMARY")
        current_row += 1

        summary_data = [
            ("Total new nouns", comparison_results['NOUNS']['1c_new_nouns_in_3']),
            ("Total new adjectives", comparison_results['ADJECTIVES']['1c_new_adj_in_3']),
            ("Total new adverbs", comparison_results['ADVERBS']['1c_new_adv_in_3']),
            ("Total new verbs", comparison_results['VERBS']['1c_new_verbs_in_3']),
            ("Total new words (Top-100)", comparison_results['TOP100_ALL_POS']['1c_new_allpos_in_3'])
        ]

        for label, value in summary_data:
            ws.cell(row=current_row, column=1, value=label)
            ws.cell(row=current_row, column=3, value=value)
            current_row += 1

        # Auto-adjust column widths
        for column in ws.columns:
            max_length = 0
            column_letter = column[0].column_letter
            for cell in column:
                if cell.value:
                    max_length = max(max_length, len(str(cell.value)))
            adjusted_width = (max_length + 2)
            ws.column_dimensions[column_letter].width = adjusted_width

        # Save the workbook
        wb.save(excel_file)
        print(f"‚úì Comparison results saved to sheet 'POS_Comparisons' in '{excel_file}'")

    except Exception as e:
        print(f"‚ùå Error saving comparison results to Excel: {e}")

In [None]:
# Step 9. COMPARISON ANALYSIS of WORDS
print(f"\n{'='*80}")
print("COMPARISON ANALYSIS")

# Check if we have the required data
if len(top25_nouns) >= 3 and len(top25_adj) >= 3 and len(top25_adv) >= 3 and len(top25_verbs) >= 3 and len(top100_allpos) >= 3:
    # Use the same Excel filename as before
    excel_file = generate_exported_excelfile(list(pdf_texts.keys())[0], list(pdf_texts.keys()))

    # Perform comparisons and save to Excel
    comparison_results = compare_pos_lists(top25_nouns, top25_adj, top25_adv, top25_verbs, top100_allpos, excel_file)

    print(f"\n‚úÖ Comparison analysis completed successfully! Results saved to Excel file: '{excel_file}'.")

else:
    print("‚ùå Not enough data for comparison. Need at least 3 documents with complete POS data.")

# step 10. Most common Bigrams

In this part, **we generate and analyze bigrams** (pairs of consecutive words) from the text of each PDF. Using the nltk library's ngrams function, we calculate the frequency of bigrams in each text and display the top 25 most common bigrams along with their frequency. This helps us identify recurring word combinations in the documents.

In [None]:
from nltk import ngrams  # Import the 'ngrams' function from the nltk library, which helps generate n-grams
from collections import Counter  # Import Counter, which counts the occurrences of items

# Define a function to get n-grams
def get_ngrams(text, n=2): # Default n-gram size set to 2
    words = text.split()
    # 'text.split()' splits the input text into words by spaces
    return Counter(ngrams(words, n))  # Returns a Counter object with n-grams and their counts
    # 'ngrams(words, n)' generates a list of n-grams from the words in the text
    # 'Counter' counts the frequency of each n-gram in the generated n-gram list

# Display top 25 bigrams (n=2) for each text
print("\nTop 25 Bigrams in Each Text:")
# Loop through each text and its corresponding file name in the pdf_texts dictionary
for filename, text in pdf_texts.items():
    # Call the get_ngrams function with n=2 (bigrams) to calculate the frequency of bigrams
    bigram_freq = get_ngrams(text, n=2)

    # Print the name of the current file
    print(f"\nTop Bigrams in {filename}:")

    # Loop through the most common 10 bigrams (word pairs) and their frequencies
    for bigram, freq in bigram_freq.most_common(25):
        # 'bigram' is a tuple of two words, 'freq' is the frequency of that bigram
        print(f"  {' '.join(bigram)}: {freq}")
        # Use 'join' to combine the words in the bigram into a single string for display


## Calculate the Bigrams in cleaned (processed text)

In [None]:
# Save top 25 bigrams for each document to Excel in a new sheet
def save_bigrams_to_excel(pdf_texts, excel_file):
    try:
        # Check if file exists
        if os.path.exists(excel_file):
            wb = openpyxl.load_workbook(excel_file)
            print(f"‚úì Appending bigrams to existing Excel file: '{excel_file}'")
        else:
            wb = Workbook()
            # Remove default sheet
            wb.remove(wb['Sheet'])
            print(f"‚úì Creating new Excel file for bigrams: '{excel_file}'")

        # Remove existing bigrams sheet if it exists
        if 'Top-25 Bigrams' in wb.sheetnames:
            del wb['Top-25 Bigrams']

        # Create new sheet for bigrams
        ws = wb.create_sheet("Top-25 Bigrams")

        # Define starting columns for each document
        doc_columns = {
            0: 'A',  # Document 1
            1: 'D',  # Document 2
            2: 'G'   # Document 3
        }

        # Populate the sheet with bigram data
        for doc_idx, (filename, text) in enumerate(pdf_texts.items()):
            start_col = doc_columns.get(doc_idx, 'A')

            # Get bigram frequencies
            bigram_freq = get_ngrams(text, n=2)

            # Document title
            ws[f'{start_col}1'] = filename

            # Headers
            ws[f'{start_col}2'] = 'Rank'
            ws[f'{chr(ord(start_col)+1)}2'] = 'Bigram'
            ws[f'{chr(ord(start_col)+2)}2'] = 'Frequency'

            current_row = 3

            # Add top 25 bigrams
            for rank, (bigram, freq) in enumerate(bigram_freq.most_common(25), 1):
                bigram_text = ' '.join(bigram)

                ws[f'{start_col}{current_row}'] = rank
                ws[f'{chr(ord(start_col)+1)}{current_row}'] = bigram_text
                ws[f'{chr(ord(start_col)+2)}{current_row}'] = freq
                current_row += 1

        # Auto-adjust column widths
        for column in ws.columns:
            max_length = 0
            column_letter = column[0].column_letter
            for cell in column:
                if cell.value:
                    max_length = max(max_length, len(str(cell.value)))
            adjusted_width = (max_length + 2)
            ws.column_dimensions[column_letter].width = adjusted_width

        # Save the workbook
        wb.save(excel_file)
        print(f"‚úì Top 25 bigrams saved to sheet 'Top-25 Bigrams' in '{excel_file}'")
        return True

    except Exception as e:
        print(f"‚ùå Error saving bigrams to Excel: {e}")
        return False

# Define a function to get n-grams from preprocessed text
def get_ngrams(text, n=2):
    text = CleanText(text)                  # Clean the text from noise of header and footer
    words = preprocess_text(text)  # Preprocess to remove stop words and lowercase
    return Counter(ngrams(words, n))

# Display top 25 bigrams after stop words removal
print("\nTop 25 Bigrams in Each Text (Without Stop Words, HeaderFooter):")
for filename, text in pdf_texts.items():
    bigram_freq = get_ngrams(text, n=2)  # Get bigram frequencies
    print(f"\nTop Bigrams in {filename}:")
    for bigram, freq in bigram_freq.most_common(25):
        print(f"  {' '.join(bigram)}: {freq}")

# Save bigrams to Excel
print(f"\n{'='*60}")
print("SAVING BIGRAMS TO EXCEL")

# Generate the Excel filename and save bigrams
if pdf_texts:
    excel_file = generate_exported_excelfile(list(pdf_texts.keys())[0], list(pdf_texts.keys()))
    bigrams_saved = save_bigrams_to_excel(pdf_texts, excel_file)

    if bigrams_saved:
        print(f"‚úÖ Bigrams successfully saved to: {excel_file}. Check the 'Top-25 Bigrams' sheet for bigram frequencies")
    else:
        print("‚ùå Failed to save bigrams to Excel")
else:
    print("‚ùå No text data available to extract bigrams")

# Export Excel file




In [None]:
# Download Excel file from Colab
print(f"\nüì• Download the Excel file:")
files.download(excel_file)
