# Code Splitter (using ANTLR)

## Prerequirements
Install Java Runtime:
```
apt update
apt install openjdk-11-jre
apt install openjdk-11-jdk
```

## Useful Links
- ANTLR Grammars: https://github.com/antlr/grammars-v4 (start rule can be found in `pom.xml` -> `entryPoint`)
- ANTLR with Python runtime: https://github.com/antlr/antlr4/blob/master/doc/python-target.md
- More examples for using ANTLR with Python: https://github.com/jszheng/py3antlr4book

## Related
- Code Splitter in LangChain [Docs](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/code_splitter), [Code](https://github.com/langchain-ai/langchain/blob/b01a443ee525e274335f475a849a1681240ff249/libs/langchain/langchain/text_splitter.py#L816)

Download ANTLR tool

In [36]:
# Install ANTLR Python runtime
!pip install antlr4-python3-runtime==4.13.0
# Download ANTLR tool
!wget https://www.antlr.org/download/antlr-4.13.0-complete.jar
# Download ANTLR grammar for C language
!wget https://raw.githubusercontent.com/antlr/grammars-v4/master/c/C.g4
# Generate C language parser in Python (we don't need .jar file after it)
!java -jar ./antlr-4.13.0-complete.jar -Dlanguage=Python3 C.g4

--2023-10-20 15:06:48--  https://www.antlr.org/download/antlr-4.13.0-complete.jar
Resolving www.antlr.org (www.antlr.org)... 2606:50c0:8001::153, 2606:50c0:8003::153, 2606:50c0:8002::153, ...
Connecting to www.antlr.org (www.antlr.org)|2606:50c0:8001::153|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2148972 (2.0M) [application/java-archive]
Saving to: ‘antlr-4.13.0-complete.jar.1’


2023-10-20 15:06:49 (247 MB/s) - ‘antlr-4.13.0-complete.jar.1’ saved [2148972/2148972]

--2023-10-20 15:06:49--  https://raw.githubusercontent.com/antlr/grammars-v4/master/c/C.g4
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8000::154, 2606:50c0:8001::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 17617 (17K) [text/plain]
Saving to: ‘C.g4.1’


2023-10-20 15:06:49 (93.9 MB/s) - ‘C.g4.1’ saved [17617/17617]


In [51]:
# Test installation
!wget https://raw.githubusercontent.com/postgres/postgres/master/src/backend/storage/large_object/inv_api.c
!mv -f inv_api.c input.c
!pygrun C compilationUnit --tokens input.c

--2023-10-20 15:25:06--  https://raw.githubusercontent.com/postgres/postgres/master/src/backend/storage/large_object/inv_api.c
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8003::154, 2606:50c0:8002::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25210 (25K) [text/plain]
Saving to: ‘inv_api.c.1’


2023-10-20 15:25:06 (44.5 MB/s) - ‘inv_api.c.1’ saved [25210/25210]

[@0,0:1371='/*-------------------------------------------------------------------------\n *\n * inv_api.c\n *\t  routines for manipulating inversion fs large objects. This file\n *\t  contains the user-level large object application interface routines.\n *\n *\n * Note: we access pg_largeobject.data using its C struct declaration.\n * This is safe because it immediately follows pageno which is an int4 field,\n * and therefore the data field will always

In [1]:
import sys
import antlr4
from antlr4 import *
from io import StringIO

In [2]:
from CLexer import CLexer
from CParser import CParser
from CListener import CListener

In [3]:
input_stream = FileStream('/home/ubuntu/postgres-bot/input.c')
lexer = CLexer(input_stream)
stream = CommonTokenStream(lexer)
parser = CParser(stream)
tree = parser.compilationUnit()

line 469:58 no viable alternative at input 'ereport(ERROR,\n\t\t\t\t(errcode(ERRCODE_INVALID_PARAMETER_VALUE),\n\t\t\t\t errmsg_internal("invalid large object seek target: " INT64_FORMAT'
line 469:58 no viable alternative at input '(errcode(ERRCODE_INVALID_PARAMETER_VALUE),\n\t\t\t\t errmsg_internal("invalid large object seek target: " INT64_FORMAT'
line 469:58 no viable alternative at input 'errmsg_internal("invalid large object seek target: " INT64_FORMAT'
line 469:58 missing ';' at 'INT64_FORMAT'
line 470:18 mismatched input ')' expecting ';'
line 819:64 no viable alternative at input 'ereport(ERROR,\n\t\t\t\t(errcode(ERRCODE_INVALID_PARAMETER_VALUE),\n\t\t\t\t errmsg_internal("invalid large object truncation target: " INT64_FORMAT'
line 819:64 no viable alternative at input '(errcode(ERRCODE_INVALID_PARAMETER_VALUE),\n\t\t\t\t errmsg_internal("invalid large object truncation target: " INT64_FORMAT'
line 819:64 no viable alternative at input 'errmsg_internal("invalid large object tr

In [4]:
# Simple Tree Plot
from antlr4.tree.Tree import TerminalNode

def print_tree(tree, rule_names, indent=0):
    if tree is None:
        return

    if isinstance(tree, TerminalNode):
        print(f"{' ' * indent}{tree.getText()}")
        return

    rule_name = rule_names[tree.getRuleIndex()] if tree.getRuleIndex() >= 0 else "Unknown"

    print(f"{' ' * indent}{rule_name} ")

    for child in tree.children:
        print_tree(child, rule_names, indent + 2)

print_tree(tree, parser.ruleNames)

compilationUnit 
  translationUnit 
    externalDeclaration 
      declaration 
        declarationSpecifiers 
          declarationSpecifier 
            typeSpecifier 
              typedefName 
                bool
          declarationSpecifier 
            typeSpecifier 
              typedefName 
                lo_compat_privileges
        ;
    externalDeclaration 
      declaration 
        declarationSpecifiers 
          declarationSpecifier 
            storageClassSpecifier 
              static
          declarationSpecifier 
            typeSpecifier 
              typedefName 
                Relation
        initDeclaratorList 
          initDeclarator 
            declarator 
              directDeclarator 
                lo_heap_r
            =
            initializer 
              assignmentExpression 
                conditionalExpression 
                  logicalOrExpression 
                    logicalAndExpression 
                      inclusiveOrExpression 

TypeError: 'NoneType' object is not iterable

In [5]:
# Parser: AST node types
parser.ruleNames

['startRule',
 'primaryExpression',
 'genericSelection',
 'genericAssocList',
 'genericAssociation',
 'postfixExpression',
 'argumentExpressionList',
 'unaryExpression',
 'unaryOperator',
 'castExpression',
 'multiplicativeExpression',
 'additiveExpression',
 'shiftExpression',
 'relationalExpression',
 'equalityExpression',
 'andExpression',
 'exclusiveOrExpression',
 'inclusiveOrExpression',
 'logicalAndExpression',
 'logicalOrExpression',
 'conditionalExpression',
 'assignmentExpression',
 'assignmentOperator',
 'expression',
 'constantExpression',
 'declaration',
 'declarationSpecifiers',
 'declarationSpecifiers2',
 'declarationSpecifier',
 'initDeclaratorList',
 'initDeclarator',
 'storageClassSpecifier',
 'typeSpecifier',
 'structOrUnionSpecifier',
 'structOrUnion',
 'structDeclarationList',
 'structDeclaration',
 'specifierQualifierList',
 'structDeclaratorList',
 'structDeclarator',
 'enumSpecifier',
 'enumeratorList',
 'enumerator',
 'enumerationConstant',
 'atomicTypeSpecifie

In [6]:
# AST node fields
dir(tree)

['EMPTY',
 'EOF',
 '__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'accept',
 'addChild',
 'addErrorNode',
 'addTokenNode',
 'children',
 'copyFrom',
 'depth',
 'enterRule',
 'exception',
 'exitRule',
 'getAltNumber',
 'getChild',
 'getChildCount',
 'getChildren',
 'getPayload',
 'getRuleContext',
 'getRuleIndex',
 'getSourceInterval',
 'getText',
 'getToken',
 'getTokens',
 'getTypedRuleContext',
 'getTypedRuleContexts',
 'invokingState',
 'isEmpty',
 'parentCtx',
 'parser',
 'removeLastChild',
 'setAltNumber',
 'start',
 'stop',
 'toString',
 'toStringTree',
 'translationUnit']

In [33]:
input_stream = FileStream('/home/ubuntu/postgres-bot/input.c')
lexer = CLexer(input_stream)
token_stream = CommonTokenStream(lexer)
token_stream.fill()

for tok in token_stream.tokens:
    print(tok)

# Note: 119 - BlockComment, 120 - LineComment token types

[@0,0:1371='/*-------------------------------------------------------------------------\n *\n * inv_api.c\n *\t  routines for manipulating inversion fs large objects. This file\n *\t  contains the user-level large object application interface routines.\n *\n *\n * Note: we access pg_largeobject.data using its C struct declaration.\n * This is safe because it immediately follows pageno which is an int4 field,\n * and therefore the data field will always be 4-byte aligned, even if it\n * is in the short 1-byte-header format.  We have to detoast it since it's\n * quite likely to be in compressed or short format.  We also need to check\n * for NULLs, since initdb will mark loid and pageno but not data as NOT NULL.\n *\n * Note: many of these routines leak memory in CurrentMemoryContext, as indeed\n * does most of the backend code.  We expect that CurrentMemoryContext will\n * be a short-lived context.  Data that must persist across function calls\n * is kept either in CacheMemoryContext (t

In [8]:
token = token_stream.tokens[10]
print('Token text:', token.text)
print('Token line:', token.line)
print('Token index:', token.tokenIndex)
print('Token type:', token.type)
print('Token source:', token.getTokenSource())

Token text: #include "access/genam.h"
Token line: 36
Token index: 10
Token type: 115
Token source: <CLexer.CLexer object at 0x7f8394261600>


In [9]:
class FuncLineCollector(ParseTreeListener):

    def __init__(self):
        self.func_lines = []
        self.func_start = -1
        self.depth = 0

    def enterFunctionDefinition(self, ctx):
        self.func_start = ctx.start.line
        self.depth = 0

    def enterCompoundStatement(self, ctx):
        self.depth += 1

    def exitCompoundStatement(self, ctx):
        self.depth -= 1
        if self.depth == 0:
            func_end = ctx.stop.line
            self.func_lines.append((self.func_start, func_end))


In [29]:
listener = FuncLineCollector()
ParseTreeWalker().walk(listener, tree)

print(f"Found {len(listener.func_lines)} functions:")
for start, end in listener.func_lines:
    print(f"Function on lines {start} to {end}")

Found 14 functions:
Function on lines 74 to 93
Function on lines 98 to 124
Function on lines 131 to 161
Function on lines 169 to 196
Function on lines 211 to 242
Function on lines 254 to 332
Function on lines 338 to 343
Function on lines 350 to 371
Function on lines 379 to 425
Function on lines 427 to 474
Function on lines 476 to 487
Function on lines 489 to 580
Function on lines 582 to 777
Function on lines 779 to 955


In [28]:
len('\n'.join(input_stream.getText(0,10000).split('\n')[0:93]))

3132