# Semantic Categories

Pass a list of segments to the LLM and ask for labels and descriptions.

In [None]:
import { SystemMessage, HumanMessage } from "@langchain/core/messages";
import { ChatPromptTemplate, MessagesPlaceholder } from "@langchain/core/prompts";
import { JsonOutputParser } from "@langchain/core/output_parsers";
import { readFileSync } from 'node:fs';

import { EXPERIMENTS_DIR, SERVER_DATA_DIR } from '../server/src/util/fileUtils.ts';
import { getNotebookLogger } from '../server/src/Logger.ts';
import { newModel } from '../server/src/agents/agent.ts';

function splitText(text: string) {
  const segments = text.split('\n\n');
  const segind = [];
  let lastIndex = 0;
  for (const segment of segments) {
    segind.push({segment: segment, start: lastIndex, stop: lastIndex + segment.length});
    lastIndex += segment.length + 2; // +2 for the two newlines
  }
  const segindices = [];
  for (const {segment, start, stop} of segind) {
    if (segment !== '') {
      if (segment[0] === '\n') {
        segindices.push({segment: segment.slice(1), start: start+1, stop: stop});
      } else {
        segindices.push({segment, start, stop});
      }
    }
  }
  return segindices;
}

const prompt = ChatPromptTemplate.fromMessages([ new MessagesPlaceholder("messages") ]);
const llm = newModel("Anthropic");
const parser = new JsonOutputParser();
const chain = prompt.pipe(llm).pipe(parser);

const logger = getNotebookLogger();
const lhsText = readFileSync(`${SERVER_DATA_DIR}/AES-md/selected-text.txt`, 'utf-8');
const PROMPT = readFileSync(`${EXPERIMENTS_DIR}/annotateNodePromptCategories7.txt`, 'utf-8');

const lhsSegIndices = splitText(lhsText);
const lhsSegments = [];
for (let i = 0; i < lhsSegIndices.length; i++) {
  const {segment, start, stop} = lhsSegIndices[i];
  lhsSegments.push({text: segment, label: `${i}:${start}-${stop}`}); 
}

const userInput = JSON.stringify(lhsSegments, null, 2);
const output = await chain.invoke({ messages: [
  new SystemMessage(PROMPT),
  new HumanMessage(userInput)
]});
logger.info(output);

```json
[
  {
    "label": "0:0-29",
    "description": "Section header introducing the algorithm specifications",
    "category": "Navigation"
  },
  {
    "label": "1:31-161",
    "description": "Introduction of the general functions CIPHER() and INVCIPHER() for AES algorithms",
    "category": "Definition"
  },
  {
    "label": "2:163-361",
    "description": "Footnote explaining why neutral terminology (CIPHER/INVCIPHER) is used instead of encryption/decryption",
    "category": "Elaboration"
  },
  {
    "label": "3:363-651",
    "description": "Explanation of rounds and round keys as core components of the CIPHER and INVCIPHER algorithms",
    "category": "Definition"
  },
  {
    "label": "4:653-964",
    "description": "Definition of the KEYEXPANSION routine that generates round keys from the cipher key",
    "category": "Definition"
  },
  {
    "label": "5:966-1560",
    "description": "Explanation of the differences between AES-128, AES-192, and AES-256 in terms of key lengt

In [23]:
console.log(JSON.stringify(output, null, 2));

[
  {
    "label": "0:0-18",
    "description": "Section header for the preprocessing section",
    "category": "Navigation"
  },
  {
    "label": "1:20-205",
    "description": "Overview of the three preprocessing steps: padding the message, parsing the message, and setting the initial hash value",
    "category": "Elaboration"
  },
  {
    "label": "2:207-233",
    "description": "Subsection header for padding the message",
    "category": "Navigation"
  },
  {
    "label": "3:235-544",
    "description": "Explanation of the purpose of padding to ensure the message is a multiple of 512 or 1024 bits",
    "category": "Intent"
  },
  {
    "label": "4:546-582",
    "description": "Subsubsection header for SHA-1, SHA-224 and SHA-256 padding",
    "category": "Navigation"
  },
  {
    "label": "5:584-1133",
    "description": "Detailed algorithm for padding messages in SHA-1, SHA-224 and SHA-256, including the formula for calculating padding bits and an example",
    "category": "Algorit

In [24]:
console.log("SEGMENTS LENGTH", lhsSegments.length);
console.log("LLM OUTPUT LENGTH", output.length);

const counter = {};
for (const {text, label} of lhsSegments) {
  counter[label] = 0;
}
for (const {label, description, category} of output) {
  if (label in counter) {
    if (counter[label] !== 0) {
      console.log("ERROR: LABEL ALREADY EXISTS IN OUTPUT", label, counter[label]);
    }
    counter[label] += 1;
  } else {
    console.log("ERROR: LABEL NOT FOUND IN LHS TEXT", label);
  } 
}
for (const {text, label} of lhsSegments) {
  if (counter[label] !== 1) {
    console.log("ERROR: LABEL NOT FOUND IN OUTPUT", label, counter[label]);
  }
}
console.log("DONE: If no errors above, then everything is ok.");

SEGMENTS LENGTH 29
LLM OUTPUT LENGTH 29
DONE: If no errors above, then everything is ok.


In [25]:
const annotations = []
for (const {label, description, category} of output) {
  const match = label.match(/(\d+):(\d+)-(\d+)/);
  const start = parseInt(match[2]);
  const stop = parseInt(match[3]);
  const text = lhsText.substring(start,stop);
  annotations.push({start, stop, category, description, text});
  console.log("LABEL", label);
  console.log("CATEGORY", category);
  console.log("DESCRIPTION", description);
  console.log("TEXT", text);
  console.log("");
}

LABEL 0:0-18
CATEGORY Navigation
DESCRIPTION Section header for the preprocessing section
TEXT # 5. PREPROCESSING

LABEL 1:20-205
CATEGORY Elaboration
DESCRIPTION Overview of the three preprocessing steps: padding the message, parsing the message, and setting the initial hash value
TEXT Preprocessing consists of three steps: padding the message, $M$ (Sec. 5.1), parsing the message into message blocks (Sec. 5.2), and setting the initial hash value, $H^{(0)}$ (Sec. 5.3).

LABEL 2:207-233
CATEGORY Navigation
DESCRIPTION Subsection header for padding the message
TEXT ## 5.1 Padding the Message

LABEL 3:235-544
CATEGORY Intent
DESCRIPTION Explanation of the purpose of padding to ensure the message is a multiple of 512 or 1024 bits
TEXT The purpose of this padding is to ensure that the padded message is a multiple of 512 or 1024 bits, depending on the algorithm. Padding can be inserted before hash computation begins on a message, or at any other time during the hash computation prior to proc

In [26]:
console.log(JSON.stringify(annotations, null, 2));

[
  {
    "start": 0,
    "stop": 18,
    "category": "Navigation",
    "description": "Section header for the preprocessing section",
    "text": "# 5. PREPROCESSING"
  },
  {
    "start": 20,
    "stop": 205,
    "category": "Elaboration",
    "description": "Overview of the three preprocessing steps: padding the message, parsing the message, and setting the initial hash value",
    "text": "Preprocessing consists of three steps: padding the message, $M$ (Sec. 5.1), parsing the message into message blocks (Sec. 5.2), and setting the initial hash value, $H^{(0)}$ (Sec. 5.3)."
  },
  {
    "start": 207,
    "stop": 233,
    "category": "Navigation",
    "description": "Subsection header for padding the message",
    "text": "## 5.1 Padding the Message"
  },
  {
    "start": 235,
    "stop": 544,
    "category": "Intent",
    "description": "Explanation of the purpose of padding to ensure the message is a multiple of 512 or 1024 bits",
    "text": "The purpose of this padding is to ensu

In [12]:
const queryInput = "Can you rewrite the following prompt so that it is clearer?";
const queryOutput = await chain.invoke({ messages: [ 
    new SystemMessage(queryInput),
    new HumanMessage(PROMPT)
]});
console.log(queryOutput.content);

I'll rewrite the prompt to make it clearer:

# Semantic Classification of Cryptographic Standards

## Your Role
You are an expert in analyzing cryptographic standards written in natural language.

## Input Format
You will receive sections from a FIPS publication describing a cryptographic algorithm. The text will be:
- In markdown format with LaTeX mathematical formulas
- Divided into consecutive labeled segments
- The complete set of segments forms what we'll call "the full text"

## Your Task
For each segment, analyze its content in context of the full text and classify it into exactly one of these categories:

1. **Preamble**: Document front matter in markdown
2. **Navigation**: Headers, outlines, or internal references
3. **Algorithm**: Formal algorithm or function description (like pseudocode)
4. **Parameter**: Formal definition of data parameters used in algorithms
5. **Definition**: Formal definition of terms or concepts
6. **Elaboration**: Text explaining algorithms, parameters