Introduction

This repository contains the implementation of a basic lexer.

A lexer explodes a given string into a list of tokens.

Installation

From the command line:

composer require dbeurive\lexer

If you want to include this package to your project, then edit your file composer.json and add the following entry:

"require": {
    "dbeurive/lexer": "*"
}

Synopsis

    $varProcessor = function(array $inMatches) {
        $name = strtolower($inMatches[2]);
        switch (strtoupper($inMatches[1])) {
            case 'L': return 'LOCAL_'  . $name;
            case 'G': return 'GLOBAL_' . $name;
        }
        throw new \Exception("Impossible error!");
    };
   
    $tokens = array(
        array('/[0-9]+/',                'numeric'),
        array('/\\$([lg])([a-z0-9]+)/i', 'variable', $varProcessor),
        array('/[a-z]{2,}/i',            'function'),
        array('/(\\+|\\-|\\*|\\/)/',     'operator'),
        array('/\\(/',                   'open_bracket'),
        array('/\\)/',                   'close_bracket'),
        array('/(\\s+|\\r?\\n)/',        'blank', function(array $m) { return null; })
    );
    
    try {
        $lexer = new Lexer($tokens);
        $text = '$gConstant1 + sin($lCoef1) / cos($lcoef2) * $gTemp - tan(21)';
        $tokens = $lexer->lex($_text);
    } catch (\Exception $e) {
        print "ERROR: " . $e->getMessage() . "\n";
        exit(1);
    }
    
    /** @var Token $_token */
    foreach ($tokens as $_token) {
        printf("%s %s\n", $_token->type, $_token->value);
    }

Specifications

Description

The lexer is configured by a list of tokens specifications:

array(
    <token specification>,
    <token specification>,
    ...
)

Each token specification is an array that contains 2 or 3 elements.

<token specification> = array(<regexp>, <type>, [<transformer callback>])

The first element is a regular expression that describes the token.
The second element is a name that identifies the type of the token.
The optional third element is a function that is applied to the token's value before it is returned.

WARNING

Make sure to double all characters "\" within the regular expressions that define the tokens. That is: '/\s/' becomes '/\\s/'.

The signature of the optional third element (<transformer callback>) must be:

mixed|null function(array $inMatches)

The array ($inMatches) passed to the function comes from the processing of the regular expression that describes the token.

The first element of the array ($inMatches[0]) contains the text that matches the full pattern.
The second element of the array ($inMatches[1]) contains the text that matched the first captured parenthesized subpattern.
The third element of the array ($inMatches[2]) contains the text that matched the second captured parenthesized subpattern.
... and so on.

See the description for the PHP function preg_match().

If the function returns the value null, then the detected token is "ignored". That is: it will not be inserted into the list of extracted tokens.
If the function returns a non-null value, then the token is inserted in the list of detected tokens. The value of the inserted token will be the value returned by the function (<transformer callback>).

Very important note

Be aware that the order of declarations of the tokens is important.

The example 2 illustrates this point.

    use dbeurive\Lexer\Lexer;
    use dbeurive\Lexer\Token;
    
    $text = 'AAAA AA';
    
    // ---------------------------------------------------------
    // TEST 1
    // ---------------------------------------------------------
    
    $specifications = array(
        array('/AA/',                    'type A2'),
        array('/A/',                     'type A1'),
        array('/(\\s+|\\r?\\n)/',        'blank', function(array $m) { return null; })
    );
    
    try {
        $lexer = new Lexer($specifications);
        $tokens = $lexer->lex($text);
    } catch (\Exception $e) {
        print "ERROR: " . $e->getMessage() . "\n";
        exit(1);
    }
    
    print "Test1: $text\n\n";
    dumpToken($tokens);
    print "\n";
    
    // ---------------------------------------------------------
    // TEST 2
    // ---------------------------------------------------------
    
    $specifications = array(
        array('/A/',                     'type A1'),
        array('/AA/',                    'type A2'),
        array('/(\\s+|\\r?\\n)/',        'blank', function(array $m) { return null; })
    );
    
    try {
        $lexer = new Lexer($specifications);
        $tokens = $lexer->lex($text);
    } catch (\Exception $e) {
        print "ERROR: " . $e->getMessage() . "\n";
        exit(1);
    }
    
    print "Test2: $text\n\n";
    dumpToken($tokens);
    
    exit(0);
    
    function dumpToken(array $inTokens) {
        $max = 0;
    
        /** @var Token $_token */
        foreach ($inTokens as $_token) {
            $max = strlen($_token->type) > $max ? strlen($_token->type) : $max;
        }
    
        /** @var Token $_token */
        foreach ($inTokens as $_token) {
            printf("%${max}s %s\n", $_token->type, $_token->value);
        }
    }

The result is:

Test1: AAAA AA

type A2 AA
type A2 AA
type A2 AA

Test2: AAAA AA

type A1 A
type A1 A
type A1 A
type A1 A
type A1 A
type A1 A

API

Constructor

    /**
     * Lexer constructor.
     * @param array $inSpecifications This array represents the tokens specifications.
     *        Each element of this array is an array that specifies a token.
     *        It contains 2 or 3 elements.
     *        - First element: a regular expression that describes the token.
     *        - Second element: the name of the token.
     *        - Third element: an optional callback function.
     *          The signature of this function must be:
     *          null|string function(array $inMatches)
     * @throws \Exception
     */
    public function __construct(array $inSpecifications)

Please see the section "specifications" for a detailed description of the parameter.

lex()

    /**
     * Explode a given string into a list of tokens.
     * @param string $inString The string to explode into tokens.
     * @return array The method returns a list of tokens.
     *         Each element of the returned list is an instance of the class Token.
     * @throws \Exception
     * @see Token
     */
    public function lex($inString)

This method "parses" a given text and returns a list of detected tokens.

The returned array contains the list of detected tokens.

Each element of the returned array is an instance of the class \dbeurive\Lexer\Token.

    /**
     * Class Token
     *
     * This class implements a token.
     *
     * @package dbeurive\Lexer
    */
    class Token
    {
        /** @var null|mixed Token's value. */
        public $value = null;
        /** @var null|string Token's type. */
        public $type = null;
    
        /**
         * Token constructor.
         * @param string $inOptValue The token's value.
         * @param string $inOptType The token's type.
         */
        public function __construct($inOptValue=null, $inOptType=null)
        {
            $this->value = $inOptValue;
            $this->type  = $inOptType;
        }
    }

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
examples		examples
src		src
tests		tests
.gitignore		.gitignore
LICENSE.TXT		LICENSE.TXT
README.md		README.md
composer.json		composer.json
composer.lock		composer.lock
phpdoc.xml		phpdoc.xml
phpunit.xml		phpunit.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Installation

Synopsis

Specifications

Description

Very important note

API

Constructor

lex()

About

Releases

Packages

Languages

License

dbeurive/lexer

Folders and files

Latest commit

History

Repository files navigation

Introduction

Installation

Synopsis

Specifications

Description

Very important note

API

Constructor

lex()

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages