Skip to content

dspinellis/tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status DOI

tokenizer

Tokenize source code into integer vectors, symbols, or discrete tokens.

The following languages are currently supported.

  • C
  • C#
  • C++
  • Go
  • Java
  • JavaScript
  • PHP
  • Python
  • Rust
  • TypeScript

Build

cd src
make

Test

Ensure CppUnit is installed. Depending on your environment, you may also need to pass its installation directory prefixes to make through the command line arguments. For example, under macOS pass ADDCXXFLAGS='-I /opt/homebrew/include' ADDLDFLAGS='-L /opt/homebrew/lib' as arguments to make.

cd src
make test

Install

cd src
sudo make install

Run

tokenizer file.c
tokenizer -l Java -o statement <file.java

Examples of tokenizing "hello world" programs in diverse languages

C into integers

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c.c | tokenizer -l C
35      320     60      2000    46      2001    62      322     2002    40     41       123     2003    40      625     41      59      327     1500    59     125

C into symbols

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c.c | tokenizer -l C -s
# include < ID:2000 . ID:2001 > int ID:2002 ( ) { ID:2003 ( STRING_LITERAL
) ; return 0 ; }

C# into integers

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/csharp.cs | tokenizer -l "C#"
312     2000    123     360     376     2001    40      41      123     2002   46       2003    46      2004    40      627     41      59      125     125

C# into symbols

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/csharp.cs | tokenizer -l "C#" -s
class ID:2000 { static void ID:2001 ( ) { ID:2002 . ID:2003 . ID:2004
( STRING_LITERAL ) ; } }

C# method-only into integers

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/csharp.cs | tokenizer -l "C#" -o method
123     2002    46      2003    46      2004    40      627     41      59     125

C++ into symbols

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c%2B%2B.cpp | tokenizer -l C++ -s
# include < ID:2000 > LINE_COMMENT using namespace ID:2001 ; int ID:2002
( ) LINE_COMMENT { ID:2003 LSHIFT STRING_LITERAL LSHIFT ID:2004 ;
LINE_COMMENT return 0 ; LINE_COMMENT }

Java into symbols

$ curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/j/Java.java | tokenizer -l Java -s
public class ID:2000 { public static void ID:2001 ( ID:2002 [ ] ID:2003 )
{ ID:2004 . ID:2005 . ID:2006 ( STRING_LITERAL ) ; } }

C++ into code tokens

curl -s https://raw.githubusercontent.com/leachim6/hello-world/master/c/c%2B%2B.cpp | tokenizer -l C++ -c
#
include
<
iostream
>
// ...
using
namespace
std
;
int
main
(
)
// ...
{
cout
<<
"..."
<<
endl
;
// ...
return
0
;
// ...
}

Examples of tokenizer code preprocessing

Token-by-token difference

Produce a token-by-token difference between the current version of the file tokenizer.cpp and the one in version v1.1.

diff <(git show v1.1:./tokenizer.cpp | tokenizer -l C++ -b) \
  <(tokenizer -l C++ -b tokenizer.cpp)

Clone detection

List Type 2 (near) clones in the tokenizer source code.

tokenizer -l C++ -c -f -o line *.cpp *.h | mpcd

Reference manual

You can read the command's Unix manual page through this link.

In 2023 version 2.0 of the tokenizer was released, with a simpler and more orthogonal command-line interface. To convert old code, you can read the Unix manual page of the original v1.1 version through this link.

Contributing

To support a new language proceed as follows.

  • Open an issue with the language name and a pointer to its lexical structure defintion.
  • Add a comment indicating that you're working on it.
  • List the language's keywords in a file name language-keyword.txt. Keep alphabetic order. If the language supports a C-like preprocessor add those keywords as well.
  • Copy the source code files of an existing language that most resembles the new language to create the new language files: languageTokenizer.cpp, languageTokenizer.h, languageTokenizerTest.h.
  • In the copied files rename all instances (uppercase, lowercase, CamelCase) of the existing language name to the new language name.
  • Create a list of the new language's operators and punctuators, and methodically go through the languageTokenizer.cpp switch statements to ensure that these are correctly handled. When code is missing or different, base the new code on an existing pattern. Keep token names used for the same semantic purpose same between languages. If you need a new token name just write Token:MY_NAME and it will be defined automatigcally.
  • Add code to handle the language's comments.
  • Adjust, if needed, the handling of constants and literals. Note that for the sake of simplicity and efficiency, the tokenizer can assume that its input is correct.
  • To implement features that aren't handled in the language whose tokenizer implementation you copied, look at the implementation of other language tokenizers that have these features.
  • If you need to reuse a method from another language, move it to TokenizerBase.
  • Add the object file languageTokenizer.o to the OBJ list of file names in the Makefile.
  • Add unit tests for any new or modified features you implemented.
  • Update the fileUnitTests.cpp to include the unit test header file, and call addTest with the unit test suite.
  • Update the method process_file in tokenizer.cpp to call the tokenizer you implemented and the language's name to the list of supported languages.
  • Ensure the language is correctly tokenized, both by running the tokenizer and by running the unit tests with make test.
  • Update the manual page tokenizer.1 and this README.md file.
  • Bump up the semantic version middle number of the version string in tokenizer.cpp