Permalink
Find file
Fetching contributors…
Cannot retrieve contributors at this time
590 lines (545 sloc) 15.6 KB

Bootstrapping a Language

Bootstrapping A Language

Once upon a time there was a REPL

#!/usr/bin/env ruby

def repl
  while true
    "geeklisp> ".display
    p(gets)
  end
end

repl

But that REPL didn’t do much

$ ./geeklisp
geeklisp> (define foo 2)
(define foo 2)
geeklisp> (define bar 5)
(define bar 5)
geeklisp> (+ foo bar)
(+ foo bar)
geeklisp>

This is the true story… of several lines of code… picked to live in a REPL… to work together and have their lines executed… find out what happens when code stops being polite… and starts getting real…

Everyone should write a language

file:lisplogo.png

Peter Norvig

file:norvig.jpg

The basics

FormSyntaxExample
variable referencevarx
constant literalnumber12
quotation(quote exp)(quote (1 2 3))
conditional(if test conseq alt)(if (< 10 20) #t #f)
assignment(set! var exp)(set! x (* x x))
definition(define var exp)(define r 3)
procedure(lambda (var...) exp)(lambda (r) (* r r)))
sequencing(begin exp)(begin (set! x 1) (* x 2))
procedure call(proc exp)(square 12)

Let’s start turning our REPL into a real language

Turning the lisp source into tokens (Read)

def to_atom(src)
  return "[" if src =='('
  return "]" if src ==')'
  return src if src =~ /^-?\d+$/ || src =~ /^-?\d*\.\d+$/
  ':' + src
end

def read(src)
  tokens = src.gsub('(', ' ( ').gsub(')', ' ) ').split
  Kernel.eval(tokens.map{|s| to_atom(s)}.join(' ').gsub(' ]',']').gsub(/([^\[]) /,'\1, '))
end

def repl
  while true
    "geeklisp> ".display
    p(read(gets))
  end
end

What it looks like

$ ./geeklisp
geeklisp> (+ 2 3)
[:+, 2, 3]
geeklisp> (define sqr (lambda (x) (* x x)))
[:define, :sqr, [:lambda, [:x], [:*, :x, :x]]]
geeklisp> sqr
:sqr
geeklisp> (sqr 2)
[:sqr, 2]

Basic Evaluation

def eval(x)
  return x if !x.is_a? Array
  case x[0]
    when :quote then x[1..-1]
  end
end

def repl
  while true
    "geeklisp> ".display
    p(eval(read(gets)))
  end
end

Did we take a step back?

$ ./geeklisp
geeklisp> (+ 2 3)
nil
geeklisp> (define sqr (lambda (x) (* x x)))
nil
geeklisp> (sqr 4)
nil
geeklisp> (quote (1 2 3 4))
[[1, 2, 3, 4]]

We need a way to keep track of things so we can refer to them later

class Env < Hash
  def initialize(keys=[], vals=[], outer=nil)
    @outer = outer
    keys.zip(vals).each{|p| store(*p)}
  end

  def [] (name)
    super(name) || @outer[name]
  end

  def set(name, value)
    key?(name) ? store(name, value) : @outer.set(name, value)
  end
end

Update eval to accept an env and repl to initialize an env

def eval(symbol, env)
  return env[symbol] if symbol.is_a? Symbol
  return symbol if !symbol.is_a? Array
  case symbol[0]
  when :quote then symbol[1..-1]
  when :define then env[symbol[1]] = eval(symbol[2], env)
  when :set! then env.set(symbol[1], eval(symbol[2], env))
  when :env then env
  end
end

def repl
  env = Env.new
  while true
    "geeklisp> ".display
    p(eval(read(gets), env))
  end
end

Give it a try

./geeklisp
geeklisp> (define foo 5)
5
geeklisp> (define sqr (lambda (x) (* x x)))
nil
geeklisp> (define bar 27)
27
geeklisp> foo
5
geeklisp> bar
27
geeklisp> (define baz (quote (1 2 3)))
[[1, 2, 3]]
geeklisp> (env)
{:foo=>5, :sqr=>nil, :bar=>27, :baz=>[[1, 2, 3]]}

But our language still doesn’t really know how to do anything…

$ ./geeklisp
geeklisp> (+ 2 3)
nil
geeklisp> (car (1 2 3))
nil
geeklisp> (cdr (1 2 3))
nil
geeklisp> (+ (* 3 4) 5)
nil

Define the basic primitives

def init(env)
  [:+, :-, :*, :/, :>, :<, :>=, :<=, :==].each do |op|
    env[op] = lambda{|a, b| a.send(op, b)}
  end
end

def repl
  env = init(Env.new)
  while true
    "geeklisp> ".display
    p(eval(read(gets), env))
  end
end

def eval(x, env)
  ...
  else
    exps = x.map{|exp| eval(exp, env)}
    exps[0].call(*exps[1..-1])
  end
end

Now we’re cooking

$ ./geeklisp
geeklisp> (* 2 3)
6
geeklisp> (> 2 3)
false
geeklisp> (< 2 3)
true
geeklisp> (/ 4 12)
0
geeklisp> (/ 12 4)
3
geeklisp> (- 5 6)
-1
geeklisp> (== 5 5)
true

Bootstrapping more basic functions

def init(env)
  [:+, :-, :*, :/, :>, :<, :>=, :<=, :==].each do |op|
    env[op] = lambda{|a, b| a.send(op, b)}
  end

  env.update({:length => lambda{|x| x.length},
              :cons => lambda{|x,y| [x] + y},
              :car => lambda{|x| x[0]},
              :cdr => lambda{|x| x[1..-1]},
              :append => lambda{|x,y| x + y},
              :list => lambda{|*xs| xs},
              :list? => lambda{|x| x.is_a? Array},
              :symbol? => lambda{|x| x.is_a? Symbol},
              :not => lambda{|x| !x},
              :display => lambda{|x| p x}})
end

Try it out

(define foo (list 1 2 3))
[1, 2, 3]
geeklisp> (car foo)
1
geeklisp> (cdr foo)
[2, 3]
geeklisp> (cons 4 foo)
[4, 1, 2, 3]
geeklisp> (list? foo)
true
geeklisp> (define bar 5)
5
geeklisp> (list? 5)
false

We’re almost there! Let’s round things out

def eval(symbol, env)
  ...
  when :if
    _, test, conseq, alt = symbol
    eval(eval(test, env) ? conseq : alt, env)
  when :lambda
    _, vars, exp = symbol
    Proc.new{|*args| eval(exp, Env.new(vars, args, env))}
  when :begin
    symbol[1..-1].reduce([nil, env]){|val_env, exp| [eval(exp, val_env[1]), val_env[1]]}[0]
  else
    exps = symbol.map{|exp| eval(exp, env)}
    exps[0].call(*exps[1..-1])
  end
end

Take in your new awesomeness

$ ./geeklisp
geeklisp> (if (< 3 5) 10 20)
10
geeklisp> (if (> 3 5) 10 20)
20
geeklisp> (define square (lambda (x) (* x x)))
#<Proc:0x00000001001b7aa0@./geeklisp:41>
geeklisp> (square 12)
144
geeklisp> (begin (set! x 1) (set! x (+ x 1)) (* x 2))
4

We now have the basics of a language

But we are missing a lot

  • No error handling
  • Not a complete implementation of the scheme spec
  • No ability to load a file and run it
  • No core after the bootstrap (stdlib)
  • No callcc/tail recursion
  • No save-world

In case Lisp isn’t your thing…

Let’s look at another toy language

LLVM

file:llvm.png

???

  • LLVM (formerly Low Level Virtual Machine) is compiler infrastructure written in C++
  • It is designed for compile-time, link-time, run-time, and “idle-time” optimization of programs written in arbitrary programming languages.
  • Originally implemented for C and C++, the language-agnostic design (and the success) of LLVM has since spawned a wide variety of front ends
  • Languages with compilers which use LLVM include Objective-C, Fortran, Ada, Haskell, Java bytecode, Python, Ruby, ActionScript, GLSL, and Rust.

Tokenization

static int gettok() {
  static int LastChar = ' ';

  while (isspace(LastChar))
    LastChar = getchar();

  if (isalpha(LastChar)) {
    IdentifierStr = LastChar;
    while (isalnum((LastChar = getchar())))
      IdentifierStr += LastChar;

    if (IdentifierStr == "def") return tok_def;
    if (IdentifierStr == "extern") return tok_extern;
    return tok_identifier;
  }

if (isdigit(LastChar) || LastChar == '.') {
  std::string NumStr;
  do {
    NumStr += LastChar;
    LastChar = getchar();
  } while (isdigit(LastChar) || LastChar == '.');

  NumVal = strtod(NumStr.c_str(), 0);
  return tok_number;
}

if (LastChar == '#') {
  do LastChar = getchar();
  while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');

  if (LastChar != EOF)
    return gettok();
}

AST

class ExprAST {
public:
  virtual ~ExprAST() {}
  virtual Value *Codegen() = 0;
};

class NumberExprAST : public ExprAST {
  double Val;
public:
  NumberExprAST(double val) : Val(val) {}
  virtual Value *Codegen();
};

class VariableExprAST : public ExprAST {
  std::string Name;
public:
  VariableExprAST(const std::string &name) : Name(name) {}
  virtual Value *Codegen();
};

class BinaryExprAST : public ExprAST {
  char Op;
  ExprAST *LHS, *RHS;
public:
  BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
    : Op(op), LHS(lhs), RHS(rhs) {}
  virtual Value *Codegen();
};

class CallExprAST : public ExprAST {
  std::string Callee;
  std::vector<ExprAST*> Args;
public:
  CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
    : Callee(callee), Args(args) {}
  virtual Value *Codegen();
};

class PrototypeAST {
  std::string Name;
  std::vector<std::string> Args;
public:
  PrototypeAST(const std::string &name, const std::vector<std::string> &args)
    : Name(name), Args(args) {}

  Function *Codegen();
};

class FunctionAST {
  PrototypeAST *Proto;
  ExprAST *Body;
public:
  FunctionAST(PrototypeAST *proto, ExprAST *body)
    : Proto(proto), Body(body) {}

  Function *Codegen();
};

Parsing

static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
  while (1) {
    int TokPrec = GetTokPrecedence();
    int BinOp = CurTok;
    if (TokPrec < ExprPrec) return LHS;

    getNextToken();

    ExprAST *RHS = ParsePrimary();
    if (!RHS) return 0;

    int NextPrec = GetTokPrecedence();
    if (TokPrec < NextPrec) {
      RHS = ParseBinOpRHS(TokPrec+1, RHS);
      if (RHS == 0) return 0;
    }

    LHS = new BinaryExprAST(BinOp, LHS, RHS);
  }
}

Codegen

Function *FunctionAST::Codegen() {
  NamedValues.clear();

  Function *TheFunction = Proto->Codegen();
  if (TheFunction == 0) return 0;

  BasicBlock *BB = BasicBlock::Create(getGlobalContext(), "entry", TheFunction);
  Builder.SetInsertPoint(BB);

  if (Value *RetVal = Body->Codegen()) {
    Builder.CreateRet(RetVal);
    verifyFunction(*TheFunction);
    return TheFunction;
  }

  TheFunction->eraseFromParent();
  return 0;
}

Handlers

static void HandleDefinition() {
  if (FunctionAST *F = ParseDefinition()) {
    if (Function *LF = F->Codegen()) {
      fprintf(stderr, "Read function definition:");
      LF->dump();
    }
  } else {
    getNextToken();
  }
}

static void HandleExtern() {
  if (PrototypeAST *P = ParseExtern()) {
    if (Function *F = P->Codegen()) {
      fprintf(stderr, "Read extern: ");
      F->dump();
    }
  } else {
    getNextToken();
  }
}

Interactive shell

static void MainLoop() {
  while (1) {
    fprintf(stderr, "pon> ");
    switch (CurTok) {
    case tok_eof:    return;
    case ';':        getNextToken(); break;
    case tok_def:    HandleDefinition(); break;
    case tok_extern: HandleExtern(); break;
    default:         HandleTopLevelExpression(); break;
    }
  }
}

Putting it all together

int main() {
  LLVMContext &Context = getGlobalContext();

  BinopPrecedence['<'] = 10;
  BinopPrecedence['+'] = 20;
  BinopPrecedence['-'] = 20;
  BinopPrecedence['*'] = 40;

  fprintf(stderr, "pon> ");
  getNextToken();

  TheModule = new Module("Pon JIT", Context);

  MainLoop();

  TheModule->dump();

  return 0;
}

Giving it a spin

clang++ -g -O3 pon.cpp `llvm-config --cppflags --ldflags --libs core` -o pon
pon> 4+5;
pon> Read top-level expression:
define double @0() {
entry:
  ret double 9.000000e+00
}

pon> def foo(a b) a*a + 2*a*b + b*b;
pon> Read function definition:
define double @foo(double %a, double %b) {
entry:
  %multmp = fmul double %a, %a
  %multmp1 = fmul double 2.000000e+00, %a
  %multmp2 = fmul double %multmp1, %b
  %addtmp = fadd double %multmp, %multmp2
  %multmp3 = fmul double %b, %b
  %addtmp4 = fadd double %addtmp, %multmp3
  ret double %addtmp4
}

pon> def bar(a) foo(a, 4.0) + bar(31337);
pon> Read function definition:
define double @bar(double %a) {
entry:
  %calltmp = call double @foo(double %a, double 4.000000e+00)
  %calltmp1 = call double @bar(double 3.133700e+04)
  %addtmp = fadd double %calltmp, %calltmp1
  ret double %addtmp
}

pon> extern cos(x);
pon> Read extern:
declare double @cos(double)

pon> cos(1.234);
pon> Read top-level expression:
define double @1() {
entry:
  %calltmp = call double @cos(double 1.234000e+00)
  ret double %calltmp
}

THIS IS FUN!

You can learn a so much by doing this!

You will think more about how your programming language does this!

You might even explore the implementation of your language to see how it works!

References