# LLM-assisted Concept Mining for COBOL applications

This Jupyter Notebook shows the process of mining abbreviations and discovering first concepts a COBOL legacy mainframe codebase is made of with the help of Large Language Models. It uses Python, pandas and Claude 3.5 Sonnet to generate insights that can be gathered from such a simple thing like a list of files.

## Introduction

### Software archaeology

"Archaeologists try to find the clues left by people who lived before us, and they try to make sense of them." ([Source](https://www.youtube.com/watch?v=TFejIkYDH9Q)). This work is no different for software developers maintaining older systems, who venture as software archaeologists into old codebases like adventurers in a long-forgotten city. Many tasks are the same, such as deciphering the hieroglyphs (aka cryptic abbreviations) left behind by the original creators, unlocking valuable knowledge hidden beneath layers of history. And just as traditional archaeologists piece together ancient stories, software archaeologists help to understand the stories of legacy systems that have stood the test of time, and paving the way for a successful modernization journey.

### Today's subject under inspection
I tried to find and analyze a real legacy code base. But it's really hard to get a realistic scenario. This is where the "Mainframe CardDemo Application" ([GitHub](https://github.com/aws-samples/aws-mainframe-modernization-carddemo)) from AWS comes into play: This project is a sample application designed to showcase modernization strategies for mainframe workloads using AWS services. It features a typical banking scenario with COBOL programs, JCL scripts, and related data files. The demo should provide a realistic environment to demonstrate refactoring, replatforming, and migration of legacy code to cloud-based solutions. Albeit the abount of files that this project consists of is tiny, I think this is one realistig starting point for exploring an unkown legacy application written in a programming language I'm not familiar with.

## Analysis
### Step 1: Clone the repository

To begin, clone the source code repository:
```bash
git clone https://github.com/aws-samples/aws-mainframe-modernization-carddemo.git
```

Then, I get a nice and clean file list we want to work with.


In [65]:
import glob

root_dir = "../aws-mainframe-modernization-carddemo/app/"
glob_list = glob.glob(f"{root_dir}**/*.*", recursive=True)
file_list = [f.replace("\\","/").replace(root_dir, "") for f in glob_list]
file_list[:5]

['bms/COACTUP.bms',
 'bms/COACTVW.bms',
 'bms/COADM01.bms',
 'bms/COBIL00.bms',
 'bms/COCRDLI.bms']

### Step 2: Set up LLM assistant

I use Claude Sonnet 3.5, a large language model from Anthropic, to expand the abbreviations. For this type of application, I always set a low temperature to get consistent, reproducible results across different runs. I’ve also noticed that LLMs tend to get lazy when working on tasks I delegate because I don’t want to do them myself. Perhaps my own approach is rubbing off on the model. That’s why I make sure to remind the LLM not to be lazy in this case.

In [66]:
import anthropic
client = anthropic.Anthropic()

def ask(prompt):
    return client.messages.create(
        model="claude-3-5-sonnet-latest",
        max_tokens=2000,
        temperature=0.0,
        system="""
        You're a software archaeologist who tries to make sense of the past.
        Respond in short and clear explanations.
        Don't be lazy!""",
        messages=[{"role": "user", "content": prompt}]
    ).content[0].text

### Step 3: Define the base prompt
The base prompt was designed to guide the LLM in extracting and expanding abbreviations. I paid attention to providing clear, step-by-step instructions to help the LLM through abbreviation extraction and concept definition. I included handling for alternative meanings and uncertainty by suggesting confidence scores and alternative entries. Additionally, I requested simple regex patterns for practical file matching, which can be useful later for finding all relevant files based on the extracted patterns. I specified a structured output as a JSON schema for consistency and easier integration with subsequent analysis. To my knowledge, Sonnet doesn’t have a built-in option for this kind of structured output, so I also included an example to clarify expectations (and to avoid breaking the following code).

In [67]:
base_prompt = """

Below is a list of paths from a software program containing numerous abbreviations.

Your task is to build a list of abbreviations and a glossary entry of their corresponding concepts.
For this, determine the meaning of an abbreviation found in the paths and filenames.
For very similar abbreviations, make a separate entry. Don't put more than one abbreviations in one entry.

Then, provide a definition of each concept and provide the information if it is a business concept or a technical concept.
With that, also estimate a confidence score between 0 and 1 indicating how certain you are about the term and its definition.

Additionally, find a simple regular expression that identifies all files related to the abbreviation (and their alternative spellings).

If there are not more abbreviatios to discover, deliver back an empty JSON list.

Output only and directly as a JSON array (not with a key) using strict the following schema:
- 'abbreviation': The abbreviation.
- 'meaning': The meaning of the abbreviation.
- 'type': business, technical
- 'definition': An explanation of the concept.
- 'regex': A regular expression to locate related files.
- 'confidence': A value between 0 and 1 indicating certainty.
- 'alternative': An alternative meaning, or "-" if none exists.

Example output:
[
    {
      "abbreviation": "ABC",
      "meaning": "Air Bullet Container",
      "type": "technical"
      "definition": "A storage unit used for air bullets in testing scenarios.",
      "regex": ".*ABC.*",
      "confidence": 0.8,
      "alternative": "-"
    }
]

Now, expand the following abbreviations to their full meanings:
"""

### Step 4: Assemble the initial prompt
Next, I combine the base prompt with the file list from above. I intentionally shuffle the filenames to avoid any biases and (hopefully) keep the LLM engaged with a varied input. The shuffled list is then appended to the base prompt for analysis. Concatenating the shuffled list with the base prompt creates a single string where each file path appears on a new line, making it easier for the model to process.

In [81]:
import random

random.shuffle(file_list)
prompt = base_prompt + "\n".join(file_list)
print(prompt[:100] + "...")



Below is a list of paths from a software program containing numerous abbreviations.

Your task is ...


### Step 5: Ask Claude via backfeeding
The easy part here is that I simply send the prompt to the LLM and hope that we get the results as a JSON data structure (_fingers always crossed_). In more detail, this process leverages an iterative feedback loop to refine the extraction of abbreviations. After each pass, it updates the prompt by including abbreviations already identified, guiding the AI to focus on new terms and avoid redundancy. The analysis continues as long as the extracted meanings maintain a high confidence level, and it stops once confidence drops below a set threshold. 

In [69]:
import json

min_confidence = 1.0

json_results = []

backfeed = ""

while min_confidence > 0.7:

    evolving_prompt = prompt + "\n\n" + backfeed

    res = ask(evolving_prompt)
    json_result = json.loads(res)

    if len(json_result) == 0:
        break

    min_confidence = min([i['confidence'] for i in json_result])
    backfeed = "\n\nI already discovered these abbreviations that I don't need anymore:\n" + "\n".join([i['abbreviation'] for i in json_results])
    json_results.extend(json_result)
    
print(str(json_results)[:100])

[{'abbreviation': 'CBL', 'meaning': 'COBOL', 'type': 'technical', 'definition': 'Common Business-Ori


### Step 6: Prepare for first analyses
For easier inspection of the result (and later analysis steps), I like to load the expanded abbreviations and their meanings in a pandas DataFrame.

In [75]:
import pandas as pd
abbreviations = pd.DataFrame.from_dict(json_results)\
    .sort_values(by='abbreviation')\
.drop_duplicates(subset=["abbreviation", "meaning"])
abbreviations.head()

Unnamed: 0,abbreviation,meaning,type,definition,regex,confidence,alternative
19,ACCT,Account,business,A financial record or arrangement between a cu...,.*AC+T.*,0.95,-
32,ACTUP,Account Update,technical,Process or interface for updating account info...,.*ACTUP.*,0.9,-
42,ACTVW,Account View,business,Process or interface for viewing account details,.*ACTVW.*,0.9,-
20,ADM,Administration,business,System administration and management functions,.*ADM.*,0.9,-
18,ADMIN,Administrator,business,A user with administrative privileges in the s...,.*ADM.*,0.9,-


In [76]:
abbreviations

Unnamed: 0,abbreviation,meaning,type,definition,regex,confidence,alternative
19,ACCT,Account,business,A financial record or arrangement between a cu...,.*AC+T.*,0.95,-
32,ACTUP,Account Update,technical,Process or interface for updating account info...,.*ACTUP.*,0.9,-
42,ACTVW,Account View,business,Process or interface for viewing account details,.*ACTVW.*,0.9,-
20,ADM,Administration,business,System administration and management functions,.*ADM.*,0.9,-
18,ADMIN,Administrator,business,A user with administrative privileges in the s...,.*ADM.*,0.9,-
8,ASCII,American Standard Code for Information Interch...,technical,A character encoding standard used by most mod...,.*ASCII.*,0.95,-
21,BIL,Billing,business,Process of generating and managing customer bills,.*BIL.*,0.95,-
11,BMS,Basic Mapping Support,technical,CICS facility for handling screen layouts and ...,.*\.bms|\.BMS$,0.9,-
54,CATG,Category,business,Classification or grouping of items or transac...,.*CATG.*,0.9,-
0,CBL,COBOL,technical,"Common Business-Oriented Language, a programmi...",.*\.cbl|\.CBL$,0.95,-


### Step 7: Connect files on found patterns
Next, I search for all files that correspond to the pattern in the `regex` column, we can now assign the real files to a given abbreviation / concept. I iterate through the list of known abbreviations, using their regular expressions to find matching files in the project. For each abbreviation, I refine the regex, apply it to filter the file list, and update the list of unmatched files. The matched files and their proportion of the total file count are stored for each abbreviation, and the results are added to the DataFrame for analysis. However, this approach isn't entirely clean, as files can belong to multiple concepts or match more than one abbreviation, leading to potential overlaps. But I think this is good enough for now.

In [77]:
files = pd.Series(file_list)

abbreviation_files = []
proportions = []

length_all_files = len(file_list)

for i, entry in abbreviations.iterrows():
    # remove possible capture groups and start markers
    regex = entry['regex'].replace("(", "(?:").replace("^", ".*")
    files_found = files[files.str.contains(regex)].sort_values().to_list()
    
    proportions.append(len(files_found)/length_all_files)
    abbreviation_files.append(files_found)

abbreviations['prop'] = proportions
abbreviations['paths'] = abbreviation_files
abbreviations.head()

Unnamed: 0,abbreviation,meaning,type,definition,regex,confidence,alternative,prop,paths
19,ACCT,Account,business,A financial record or arrangement between a cu...,.*AC+T.*,0.95,-,0.110345,"[bms/COACTUP.bms, bms/COACTVW.bms, cbl/CBACT01..."
32,ACTUP,Account Update,technical,Process or interface for updating account info...,.*ACTUP.*,0.9,-,0.02069,"[bms/COACTUP.bms, cbl/COACTUPC.cbl, cpy-bms/CO..."
42,ACTVW,Account View,business,Process or interface for viewing account details,.*ACTVW.*,0.9,-,0.02069,"[bms/COACTVW.bms, cbl/COACTVWC.cbl, cpy-bms/CO..."
20,ADM,Administration,business,System administration and management functions,.*ADM.*,0.9,-,0.034483,"[bms/COADM01.bms, cbl/COADM01C.cbl, cpy-bms/CO..."
18,ADMIN,Administrator,business,A user with administrative privileges in the s...,.*ADM.*,0.9,-,0.034483,"[bms/COADM01.bms, cbl/COADM01C.cbl, cpy-bms/CO..."


In [78]:
abbreviations

Unnamed: 0,abbreviation,meaning,type,definition,regex,confidence,alternative,prop,paths
19,ACCT,Account,business,A financial record or arrangement between a cu...,.*AC+T.*,0.95,-,0.110345,"[bms/COACTUP.bms, bms/COACTVW.bms, cbl/CBACT01..."
32,ACTUP,Account Update,technical,Process or interface for updating account info...,.*ACTUP.*,0.9,-,0.02069,"[bms/COACTUP.bms, cbl/COACTUPC.cbl, cpy-bms/CO..."
42,ACTVW,Account View,business,Process or interface for viewing account details,.*ACTVW.*,0.9,-,0.02069,"[bms/COACTVW.bms, cbl/COACTVWC.cbl, cpy-bms/CO..."
20,ADM,Administration,business,System administration and management functions,.*ADM.*,0.9,-,0.034483,"[bms/COADM01.bms, cbl/COADM01C.cbl, cpy-bms/CO..."
18,ADMIN,Administrator,business,A user with administrative privileges in the s...,.*ADM.*,0.9,-,0.034483,"[bms/COADM01.bms, cbl/COADM01C.cbl, cpy-bms/CO..."
8,ASCII,American Standard Code for Information Interch...,technical,A character encoding standard used by most mod...,.*ASCII.*,0.95,-,0.062069,"[data/ASCII/acctdata.txt, data/ASCII/carddata...."
21,BIL,Billing,business,Process of generating and managing customer bills,.*BIL.*,0.95,-,0.02069,"[bms/COBIL00.bms, cbl/COBIL00C.cbl, cpy-bms/CO..."
11,BMS,Basic Mapping Support,technical,CICS facility for handling screen layouts and ...,.*\.bms|\.BMS$,0.9,-,0.117241,"[bms/COACTUP.bms, bms/COACTVW.bms, bms/COADM01..."
54,CATG,Category,business,Classification or grouping of items or transac...,.*CATG.*,0.9,-,0.013793,"[data/EBCDIC/AWS.M2.CARDDEMO.TRANCATG.PS, jcl/..."
0,CBL,COBOL,technical,"Common Business-Oriented Language, a programmi...",.*\.cbl|\.CBL$,0.95,-,0.193103,"[cbl/CBACT01C.cbl, cbl/CBACT02C.cbl, cbl/CBACT..."


### Step 8: Coverage of abbreviation information

This step calculates the proportion of files in the project that contain identifiable abbreviation information. The resulting value indicates the percentage of files for which we were able to extract meaningful details based on the identified abbreviations.

In [79]:
files_with_info_about_abbreviations = len(abbreviations.explode('paths')['paths'].drop_duplicates())
coverage = files_with_info_about_abbreviations / length_all_files
coverage_text = f"{coverage*100:.2f}% of the files contain one or more abbreviations of concepts we know about"
print(coverage_text)

99.31% of the files contain one or more abbreviations of concepts we know about


### Step 9: Generate assessment report

In [80]:
from IPython.display import display, Markdown

assessment_prompt = f"""
Here is a table with the information about a COBOL application.
Find the key insights for a software archeologist and summarize the findings in this assessment:
    
{abbreviations.to_markdown()}

Also: {coverage_text}
"""

result = ask(assessment_prompt)

display(Markdown(result))

Here's a clear archaeological assessment of this COBOL application:

Key Findings:

1. System Type:
- This is a Credit Card Management System running on IBM mainframe
- Uses CICS for transaction processing
- Handles customer accounts, cards, and financial transactions

2. Technical Architecture:
- Core components: COBOL programs (.cbl), CICS screens (.bms), copybooks (.cpy)
- Data stored in EBCDIC format physical sequential files (.PS)
- Batch processing through JCL jobs
- Heavy use of copybooks (31% of codebase) suggesting modular design

3. Main Business Domains:
- Account management (ACCT*)
- Card operations (CRD*)
- Customer data (CUST*)
- Transaction processing (TRAN*)
- User security (USR*)
- Billing/Statements (BIL*, STM*)

4. Notable Patterns:
- Consistent naming conventions (CO prefix for programs)
- Clear separation between business and technical components
- Strong batch processing component (20% JCL)
- Comprehensive user interface (multiple BMS screens)

5. System Maturity:
- Well-structured with clear naming conventions
- High confidence in identified abbreviations (99.31% coverage)
- Complete mainframe ecosystem (online + batch)
- Comprehensive security and user management

This appears to be a mature, production-grade mainframe application following standard IBM mainframe architectural patterns of its era.

### Summary

In this analysis, I applied a LLM-assisted approach to decipher abbreviations embedded in a legacy codebase, focusing on enhancing our understanding of key business and technical concepts. By leveraging a well-structured prompt and an iterative feedback loop, I extracted abbreviations, expanded their meanings, and linked them to relevant files using regular expressions. The high coverage of identifiable abbreviations indicates a comprehensive grasp of the code's structure and purpose. This iterative method allows us to refine the analysis based on previously discovered information, avoiding redundancy and increasing the accuracy of the interpretations. The generated insights gained can provide a solid foundation for upcoming modernization efforts and help understanding the critical business logic of a legacy application.