Feature/ab#32008 add office extraction support by jacobwillsmith · Pull Request #2048 · bcgov/Unity

jacobwillsmith · 2026-02-27T20:14:08Z

Pull request overview

This pull request adds Office document text extraction support for AI attachment summarization.

It enables .docx, .xls, and .xlsx content to be extracted and normalized before it is sent into AI attachment analysis, improving summary quality and reducing metadata-only fallbacks when readable file content exists.

Changes:

Added Word/Excel extraction paths in TextExtractionService for .docx, .xls, and .xlsx
Reused shared normalization/cleanup helpers for Office extraction output before AI prompt usage
Added bounded extraction limits for Office parsing to keep processing constrained
Added NPOI package reference to the application project

# Conflicts: # applications/Unity.GrantManager/src/Unity.GrantManager.Application/AI/TextExtractionService.cs # applications/Unity.GrantManager/src/Unity.GrantManager.Application/Unity.GrantManager.Application.csproj

github-actions · 2026-02-27T20:17:37Z

🧪 Unit Test Results (Parallel Execution)

📊 Summary

Result	Count
✅ Passed	`451`
❌ Failed	`0`
⚠️ Skipped	`0`

📄 HTML Reports

Merged Tests (HTML): Included in artifacts
Generated automatically by CI.

github-actions · 2026-02-28T00:29:06Z

🧪 Unit Test Results (Parallel Execution)

📊 Summary

Result	Count
✅ Passed	`451`
❌ Failed	`0`
⚠️ Skipped	`0`

📄 HTML Reports

Merged Tests (HTML): Included in artifacts
Generated automatically by CI.

github-actions · 2026-03-02T18:12:05Z

🧪 Unit Test Results (Parallel Execution)

📊 Summary

Result	Count
✅ Passed	`451`
❌ Failed	`0`
⚠️ Skipped	`0`

📄 HTML Reports

Merged Tests (HTML): Included in artifacts
Generated automatically by CI.

AndreGAot

At some point would be nice to make these max rows, table rows etc.. configurable

JamesPasta · 2026-03-04T18:42:56Z

Looks good you can merge these - just runing a co pilot review

I would suggest using string builder based on the co-pilot review.

Copilot

Pull request overview

Adds Office document text extraction to the AI attachment summarization pipeline so .docx, .xls, and .xlsx attachments can contribute real content (after normalization/length limiting) instead of falling back to metadata-only summaries.

Changes:

Added NPOI dependency to support Word/Excel parsing.
Implemented .docx extraction via XWPFDocument (paragraph + table text).
Implemented .xls/.xlsx extraction via WorkbookFactory with basic bounds and normalization.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
applications/Unity.GrantManager/src/Unity.GrantManager.Application/Unity.GrantManager.Application.csproj	Adds the NPOI package reference required for Office parsing.
applications/Unity.GrantManager/src/Unity.GrantManager.Application/AI/TextExtractionService.cs	Adds Word/Excel extraction logic and routes Office files through normalization/limits before AI prompt usage.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-04T18:45:48Z

+                var parts = new List<string>();
+
+                foreach (var paragraphText in document.Paragraphs.Take(MaxDocxParagraphs).Select(paragraph => paragraph.ParagraphText))
+                {
+                    if (!string.IsNullOrWhiteSpace(paragraphText))
+                    {
+                        parts.Add(paragraphText);
+                    }
+                }
+
+                foreach (var table in document.Tables)
+                {
+                    foreach (var row in table.Rows.Take(MaxDocxTableRows))
+                    {
+                        foreach (var cell in row.GetTableCells().Take(MaxDocxTableCellsPerRow))
+                        {
+                            var text = cell.GetText();
+                            if (!string.IsNullOrWhiteSpace(text))
+                            {
+                                parts.Add(text);
+                            }
+                        }
+                    }
+                }
+
+                var combined = string.Join(Environment.NewLine, parts);
+                if (combined.Length > MaxExtractedTextLength)
+                {
+                    combined = combined.Substring(0, MaxExtractedTextLength);
+                }
+
+                return combined;


ExtractTextFromWordDocx builds up parts from paragraphs/tables without short-circuiting once the overall extracted text budget is reached. For large docs/tables this can do substantial unnecessary work and allocations (especially the final string.Join) even though the result is truncated to MaxExtractedTextLength. Track a running character count and break out of the paragraph/table loops when the limit is hit (and consider appending to a StringBuilder instead of collecting all parts).

Suggested change

var parts = new List<string>();

foreach (var paragraphText in document.Paragraphs.Take(MaxDocxParagraphs).Select(paragraph => paragraph.ParagraphText))

{

if (!string.IsNullOrWhiteSpace(paragraphText))

{

parts.Add(paragraphText);

}

}

foreach (var table in document.Tables)

{

foreach (var row in table.Rows.Take(MaxDocxTableRows))

{

foreach (var cell in row.GetTableCells().Take(MaxDocxTableCellsPerRow))

{

var text = cell.GetText();

if (!string.IsNullOrWhiteSpace(text))

{

parts.Add(text);

}

}

}

}

var combined = string.Join(Environment.NewLine, parts);

if (combined.Length > MaxExtractedTextLength)

{

combined = combined.Substring(0, MaxExtractedTextLength);

}

return combined;

var builder = new StringBuilder();

var limitReached = false;

var firstPart = true;

void AppendWithLimit(string text)

{

if (limitReached || string.IsNullOrWhiteSpace(text))

{

return;

}

var remaining = MaxExtractedTextLength - builder.Length;

if (remaining <= 0)

{

limitReached = true;

return;

}

// Append separator between parts, counted against the limit

if (!firstPart)

{

var separator = Environment.NewLine;

if (separator.Length <= remaining)

{

builder.Append(separator);

remaining -= separator.Length;

}

else

{

builder.Append(separator.Substring(0, remaining));

limitReached = true;

return;

}

}

firstPart = false;

remaining = MaxExtractedTextLength - builder.Length;

if (remaining <= 0)

{

limitReached = true;

return;

}

if (text.Length <= remaining)

{

builder.Append(text);

}

else

{

builder.Append(text.Substring(0, remaining));

limitReached = true;

}

}

foreach (var paragraphText in document.Paragraphs.Take(MaxDocxParagraphs).Select(paragraph => paragraph.ParagraphText))

{

AppendWithLimit(paragraphText);

if (limitReached)

{

break;

}

}

if (!limitReached)

{

foreach (var table in document.Tables)

{

foreach (var row in table.Rows.Take(MaxDocxTableRows))

{

foreach (var cell in row.GetTableCells().Take(MaxDocxTableCellsPerRow))

{

var text = cell.GetText();

AppendWithLimit(text);

if (limitReached)

{

break;

}

}

if (limitReached)

{

break;

}

}

if (limitReached)

{

break;

}

}

}

return builder.ToString();

Copilot · 2026-03-04T18:45:49Z

+                        var cellTexts = row.Cells
+                            .Take(MaxExcelCellsPerRow)
+                            .Select(GetCellText)
+                            .Where(value => !string.IsNullOrWhiteSpace(value))
+                            .ToList();
+
+                        processedRows++;
+
+                        if (cellTexts.Count == 0)
+                        {
+                            continue;
+                        }
+
+                        var rowText = string.Join(" | ", cellTexts);


In the Excel path, a single cell/row can still be extremely large (e.g., a cell with a huge text blob). rowText is fully materialized and added before checking/truncating, which can defeat the intent of bounded extraction and cause large allocations. Consider truncating individual cell values and/or truncating rowText to the remaining budget before adding it to rows/incrementing totalLength.

Suggested change

var cellTexts = row.Cells

.Take(MaxExcelCellsPerRow)

.Select(GetCellText)

.Where(value => !string.IsNullOrWhiteSpace(value))

.ToList();

processedRows++;

if (cellTexts.Count == 0)

{

continue;

}

var rowText = string.Join(" | ", cellTexts);

// Build row text with awareness of the remaining global budget to avoid large allocations.

var remainingForRow = MaxExtractedTextLength - totalLength;

if (remainingForRow <= 0)

{

break;

}

var rowBuilder = new StringBuilder();

var cells = row.Cells.Take(MaxExcelCellsPerRow);

foreach (var cell in cells)

{

if (remainingForRow <= 0)

{

break;

}

var value = GetCellText(cell);

if (string.IsNullOrWhiteSpace(value))

{

continue;

}

// Add separator if this is not the first value in the row.

if (rowBuilder.Length > 0)

{

const string separator = " | ";

if (separator.Length > remainingForRow)

{

// No space left even for the separator; stop adding to this row.

break;

}

rowBuilder.Append(separator);

remainingForRow -= separator.Length;

}

if (value.Length > remainingForRow)

{

value = value.Substring(0, remainingForRow);

}

rowBuilder.Append(value);

remainingForRow -= value.Length;

if (remainingForRow <= 0)

{

break;

}

}

processedRows++;

if (rowBuilder.Length == 0)

{

continue;

}

var rowText = rowBuilder.ToString();

NPOI loads entire workbooks into memory.

If you process:

Large Excel files

Multiple concurrent uploads

In a web API

You may hit memory pressure.

Copilot · 2026-03-04T18:45:49Z

+            catch (Exception ex)
+            {
+                _logger.LogWarning(ex, "Word (.docx) text extraction failed");
+                return string.Empty;


The warning log in the .docx extraction catch block doesn't include the file name, which makes production troubleshooting harder when multiple attachments are processed. Consider passing fileName into ExtractTextFromWordDocx (or logging from the caller) so the log message includes {FileName} like the PDF/Excel paths do.

JamesPasta

Please review co-pilot suggestions

JamesPasta · 2026-03-04T19:20:31Z

+                        var cellTexts = row.Cells
+                            .Take(MaxExcelCellsPerRow)
+                            .Select(GetCellText)
+                            .Where(value => !string.IsNullOrWhiteSpace(value))
+                            .ToList();
+
+                        processedRows++;
+
+                        if (cellTexts.Count == 0)
+                        {
+                            continue;
+                        }
+
+                        var rowText = string.Join(" | ", cellTexts);


NPOI loads entire workbooks into memory.

If you process:

Large Excel files

Multiple concurrent uploads

In a web API

You may hit memory pressure.

github-actions · 2026-03-04T19:51:42Z

🧪 Unit Test Results (Parallel Execution)

📊 Summary

Result	Count
✅ Passed	`458`
❌ Failed	`0`
⚠️ Skipped	`0`

📄 HTML Reports

Merged Tests (HTML): Included in artifacts
Generated automatically by CI.

jacobwillsmith · 2026-03-04T20:03:02Z

Added bounded StringBuilder accumulation with early-stop limits to reduce temporary allocations during DOCX/Excel text extraction.

jacobwillsmith added 2 commits February 27, 2026 10:55

AB#32008 Add Office document text extraction support (Word/Excel)

a960047

# Conflicts: # applications/Unity.GrantManager/src/Unity.GrantManager.Application/AI/TextExtractionService.cs # applications/Unity.GrantManager/src/Unity.GrantManager.Application/Unity.GrantManager.Application.csproj

AB#32008 Resolve ICell specificity error

7ed4d3f

jacobwillsmith requested review from AndreGAot, JamesPasta, hasanpour and ligovbc February 27, 2026 20:14

AB#32008 Sonar fix simplify docx paragraph extraction loop

ce760cb

AB#32008 Simplify text extraction async flow and stale comments

5d02ba9

jacobwillsmith marked this pull request as ready for review March 2, 2026 23:46

AndreGAot approved these changes Mar 4, 2026

View reviewed changes

JamesPasta requested a review from Copilot March 4, 2026 18:40

Copilot started reviewing on behalf of JamesPasta March 4, 2026 18:41 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

JamesPasta approved these changes Mar 4, 2026

View reviewed changes

AB#32008 Optimize Office text extraction memory usage and limits

b16d19d

jacobwillsmith merged commit f683528 into dev Mar 4, 2026
22 checks passed

jacobwillsmith deleted the feature/AB#32008-AddOfficeExtractionSupport branch March 4, 2026 20:57

-                var parts = new List<string>();
-                foreach (var paragraphText in document.Paragraphs.Take(MaxDocxParagraphs).Select(paragraph => paragraph.ParagraphText))
-                {
-                    if (!string.IsNullOrWhiteSpace(paragraphText))
-                    {
-                        parts.Add(paragraphText);
-                    }
-                }
-                foreach (var table in document.Tables)
-                {
-                    foreach (var row in table.Rows.Take(MaxDocxTableRows))
-                    {
-                        foreach (var cell in row.GetTableCells().Take(MaxDocxTableCellsPerRow))
-                        {
-                            var text = cell.GetText();
-                            if (!string.IsNullOrWhiteSpace(text))
-                            {
-                                parts.Add(text);
-                            }
-                        }
-                    }
-                }
-                var combined = string.Join(Environment.NewLine, parts);
-                if (combined.Length > MaxExtractedTextLength)
-                {
-                    combined = combined.Substring(0, MaxExtractedTextLength);
-                }
-                return combined;
+                var builder = new StringBuilder();
+                var limitReached = false;
+                var firstPart = true;
+                void AppendWithLimit(string text)
+                {
+                    if (limitReached || string.IsNullOrWhiteSpace(text))
+                    {
+                        return;
+                    }
+                    var remaining = MaxExtractedTextLength - builder.Length;
+                    if (remaining <= 0)
+                    {
+                        limitReached = true;
+                        return;
+                    }
+                    // Append separator between parts, counted against the limit
+                    if (!firstPart)
+                    {
+                        var separator = Environment.NewLine;
+                        if (separator.Length <= remaining)
+                        {
+                            builder.Append(separator);
+                            remaining -= separator.Length;
+                        }
+                        else
+                        {
+                            builder.Append(separator.Substring(0, remaining));
+                            limitReached = true;
+                            return;
+                        }
+                    }
+                    firstPart = false;
+                    remaining = MaxExtractedTextLength - builder.Length;
+                    if (remaining <= 0)
+                    {
+                        limitReached = true;
+                        return;
+                    }
+                    if (text.Length <= remaining)
+                    {
+                        builder.Append(text);
+                    }
+                    else
+                    {
+                        builder.Append(text.Substring(0, remaining));
+                        limitReached = true;
+                    }
+                }
+                foreach (var paragraphText in document.Paragraphs.Take(MaxDocxParagraphs).Select(paragraph => paragraph.ParagraphText))
+                {
+                    AppendWithLimit(paragraphText);
+                    if (limitReached)
+                    {
+                        break;
+                    }
+                }
+                if (!limitReached)
+                {
+                    foreach (var table in document.Tables)
+                    {
+                        foreach (var row in table.Rows.Take(MaxDocxTableRows))
+                        {
+                            foreach (var cell in row.GetTableCells().Take(MaxDocxTableCellsPerRow))
+                            {
+                                var text = cell.GetText();
+                                AppendWithLimit(text);
+                                if (limitReached)
+                                {
+                                    break;
+                                }
+                            }
+                            if (limitReached)
+                            {
+                                break;
+                            }
+                        }
+                        if (limitReached)
+                        {
+                            break;
+                        }
+                    }
+                }
+                return builder.ToString();

-                        var cellTexts = row.Cells
-                            .Take(MaxExcelCellsPerRow)
-                            .Select(GetCellText)
-                            .Where(value => !string.IsNullOrWhiteSpace(value))
-                            .ToList();
-                        processedRows++;
-                        if (cellTexts.Count == 0)
-                        {
-                            continue;
-                        }
-                        var rowText = string.Join(" | ", cellTexts);
+                        // Build row text with awareness of the remaining global budget to avoid large allocations.
+                        var remainingForRow = MaxExtractedTextLength - totalLength;
+                        if (remainingForRow <= 0)
+                        {
+                            break;
+                        }
+                        var rowBuilder = new StringBuilder();
+                        var cells = row.Cells.Take(MaxExcelCellsPerRow);
+                        foreach (var cell in cells)
+                        {
+                            if (remainingForRow <= 0)
+                            {
+                                break;
+                            }
+                            var value = GetCellText(cell);
+                            if (string.IsNullOrWhiteSpace(value))
+                            {
+                                continue;
+                            }
+                            // Add separator if this is not the first value in the row.
+                            if (rowBuilder.Length > 0)
+                            {
+                                const string separator = " | ";
+                                if (separator.Length > remainingForRow)
+                                {
+                                    // No space left even for the separator; stop adding to this row.
+                                    break;
+                                }
+                                rowBuilder.Append(separator);
+                                remainingForRow -= separator.Length;
+                            }
+                            if (value.Length > remainingForRow)
+                            {
+                                value = value.Substring(0, remainingForRow);
+                            }
+                            rowBuilder.Append(value);
+                            remainingForRow -= value.Length;
+                            if (remainingForRow <= 0)
+                            {
+                                break;
+                            }
+                        }
+                        processedRows++;
+                        if (rowBuilder.Length == 0)
+                        {
+                            continue;
+                        }
+                        var rowText = rowBuilder.ToString();

Conversation

jacobwillsmith commented Feb 27, 2026

Pull request overview

Uh oh!

github-actions Bot commented Feb 27, 2026

🧪 Unit Test Results (Parallel Execution)

📊 Summary

📄 HTML Reports

Uh oh!

github-actions Bot commented Feb 28, 2026

🧪 Unit Test Results (Parallel Execution)

📊 Summary

📄 HTML Reports

Uh oh!

github-actions Bot commented Mar 2, 2026

🧪 Unit Test Results (Parallel Execution)

📊 Summary

📄 HTML Reports

Uh oh!

AndreGAot left a comment

Choose a reason for hiding this comment

Uh oh!

JamesPasta commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

JamesPasta Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

JamesPasta left a comment

Choose a reason for hiding this comment

Uh oh!

JamesPasta Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Mar 4, 2026

🧪 Unit Test Results (Parallel Execution)

📊 Summary

📄 HTML Reports

Uh oh!

jacobwillsmith commented Mar 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JamesPasta commented Mar 4, 2026 •

edited

Loading