Skip to content

Commit

Permalink
Compiler: optimize codegen performance
Browse files Browse the repository at this point in the history
  • Loading branch information
asterite committed Dec 16, 2016
1 parent 0ef2c6f commit c792139
Showing 1 changed file with 33 additions and 52 deletions.
85 changes: 33 additions & 52 deletions src/compiler/crystal/compiler.cr
Expand Up @@ -248,29 +248,20 @@ module Crystal

private def codegen(program, units : Array(CompilationUnit), lib_flags, output_filename, output_dir)
object_names = units.map &.object_filename
multithreaded = LLVM.start_multithreaded

# First write bitcodes: it breaks if we paralellize it
unless multithreaded
Crystal.timing("Codegen (crystal)", @stats) do
units.each &.write_bitcode
end
end

msg = multithreaded ? "Codegen (bc+obj)" : "Codegen (obj)"
target_triple = target_machine.triple

Crystal.timing(msg, @stats) do
Crystal.timing("Codegen (bc+obj)", @stats) do
if units.size == 1
first_unit = units.first

codegen_single_unit(program, first_unit, target_triple, multithreaded)
codegen_single_unit(program, first_unit, target_triple)

if emit = @emit
first_unit.emit(emit, emit_base_filename || output_filename)
end
else
codegen_many_units(program, units, target_triple, multithreaded)
codegen_many_units(program, units, target_triple)
end
end

Expand All @@ -288,37 +279,28 @@ module Crystal
end
end

private def codegen_many_units(program, units, target_triple, multithreaded)
private def codegen_many_units(program, units, target_triple)
jobs_count = 0
wait_channel = Channel(Nil).new(@n_threads)

while unit = units.pop?
fork_and_codegen_single_unit(program, unit, target_triple, multithreaded, wait_channel)
units.each_slice(Math.max(units.size / @n_threads, 1)) do |slice|
jobs_count += 1

if jobs_count >= @n_threads
wait_channel.receive
jobs_count -= 1
spawn do
codegen_process = fork do
slice.each do |unit|
codegen_single_unit(program, unit, target_triple)
end
end
codegen_process.wait
wait_channel.send nil
end
end

while jobs_count > 0
wait_channel.receive
jobs_count -= 1
end
end

private def fork_and_codegen_single_unit(program, unit, target_triple, multithreaded, wait_channel)
spawn do
codegen_process = fork { codegen_single_unit(program, unit, target_triple, multithreaded) }
codegen_process.wait
wait_channel.send nil
end
jobs_count.times { wait_channel.receive }
end

private def codegen_single_unit(program, unit, target_triple, multithreaded)
private def codegen_single_unit(program, unit, target_triple)
unit.llvm_mod.target = target_triple
unit.write_bitcode if multithreaded
unit.compile
end

Expand Down Expand Up @@ -420,16 +402,11 @@ module Crystal
end
end

def write_bitcode
llvm_mod.write_bitcode(bc_name_new)
end

def compile
bc_name = self.bc_name
bc_name_new = self.bc_name_new
object_name = self.object_name

must_compile = true
memory_buffer = llvm_mod.write_bitcode

# To compile a file we first generate a `.bc` file and then
# create an object file from it. These `.bc` files are stored
Expand All @@ -442,18 +419,26 @@ module Crystal
# the `.o` file will also be the same, so we simply reuse the
# old one. Generating an `.o` file is what takes most time.
if !compiler.emit && !@bc_flags_changed && File.exists?(bc_name) && File.exists?(object_name)
if FileUtils.cmp(bc_name, bc_name_new)
# If the user cancelled a previous compilation it might be that
# the .o file is empty
if File.size(object_name) > 0
File.delete bc_name_new
must_compile = false
end
memory_io = IO::Memory.new(memory_buffer.to_slice)
changed = File.open(bc_name) { |bc_file| !FileUtils.cmp(bc_file, memory_io) }

This comment has been minimized.

Copy link
@RX14

RX14 Dec 16, 2016

Contributor

Would it be better to store a checksum in the filename, or other metadata, then simply checksum the module in-memory? I imagine it would be faster.

This comment has been minimized.

Copy link
@asterite

asterite Dec 16, 2016

Author Member

We can store a separate checksum. Problem is, the case where we can optimize is when the contents are exactly the same. But a same checksum doesn't imply sameness... (the case if very unlikely, but possible)

This comment has been minimized.

Copy link
@RX14

RX14 Dec 16, 2016

Contributor

Well, if you get a sha256sum collision then you're pretty much the luckiest person in the universe.

This comment has been minimized.

Copy link
@lribeiro

lribeiro Dec 17, 2016

@RX14 I second that, the odds a sha256 collision are insanely low. And on the same file with just a few changes, that would probably be impossible.

This comment has been minimized.

Copy link
@refi64

refi64 Dec 17, 2016

Contributor

And, for comparison, it's over 40 times more likely that an asteroid will hit and kill everything on earth than for an SHA-256 collision to happen.

This comment has been minimized.

Copy link
@asterite

asterite Dec 17, 2016

Author Member

I can try storing an MD5 instead of the bitcode. I never said sha256, where did that come from? :-)

We probably can't use sha256 in the compiler because then we'll depend on openssl, something we are avoiding right now.

I guess MD5 will work well too?

This comment has been minimized.

Copy link
@RX14

RX14 Dec 17, 2016

Contributor

@asterite We don't need a cryptographic hash, so maybe it would be best to pick a hash function based on speed/simplicity.

This comment has been minimized.

Copy link
@asterite

asterite Dec 17, 2016

Author Member

Well, I just tried using MD5 checksums but it's about 3 times slower than what we have now. Computing MD5 is slower than just comparing raw bytes from a previous compilation...

This comment has been minimized.

Copy link
@RX14

RX14 Dec 17, 2016

Contributor

@asterite that doesn't sound right to me. I would have thought that hashing a value in memory should be much faster than reading it from disk to compare...

This comment has been minimized.

Copy link
@asterite

asterite Dec 17, 2016

Author Member

Hm, I forgot to compile the compiler in release mode to test this. However...

require "file_utils"
require "crypto/md5"

time = Time.now
eq = File.open("big.log") do |f1|
  File.open("big.log") do |f2|
    FileUtils.cmp(f1, f2)
  end
end
puts "compare read: #{Time.now - time}"
p eq

string = File.read("big.log")
time = Time.now
Crypto::MD5.hex_digest(string)
puts "md5: #{Time.now - time}"
$ du -h big.log 
109M	big.log

Output, with --release:

compare read: 00:00:00.0552930
true
md5: 00:00:01.3015940

So it does seem to be slower to compute an md5 hash than to compare files.

This comment has been minimized.

Copy link
@RX14

RX14 Dec 17, 2016

Contributor

@asterite That's an unfair test by using streaming IO in one yet reading the whole file to memory on the other.

This comment has been minimized.

Copy link
@RX14

RX14 Dec 17, 2016

Contributor

Oh, but it doesn't include the time of reading into memory...

This comment has been minimized.

Copy link
@asterite

asterite Dec 17, 2016

Author Member

Yes. And also, LLVM just writes stuff to a memory buffer and then you have to write it back. Maybe if we could stream from LLVM directly to an md5 hash it will be faster...

I must say that I was kind of surprised that hashing with md5 turned out to be slower than comparing files... maybe I'm doing something wrong and not noticing it.

This comment has been minimized.

Copy link
@RX14

RX14 Dec 17, 2016

Contributor

Using openssl speed md5 i'm getting around 700MB/s on md5, so this just seems to be crystal being super slow.

The 'numbers' are in 1000s of bytes per second processed.
type             16 bytes     64 bytes    256 bytes   1024 bytes   8192 bytes
md5              69275.87k   205260.59k   452296.87k   648269.82k   742446.42k
sha1             77490.51k   223112.04k   491776.77k   709430.27k   821138.77k

# If the user cancelled a previous compilation
# it might be that the .o file is empty
if !changed && File.size(object_name) > 0
# We can skip compilation
memory_buffer.dispose
memory_buffer = nil
else
# We need to compile, so we'll write the memory buffer to file
end
end

if must_compile
File.rename(bc_name_new, bc_name)
# If there's a memory buffer, it means we must create a .o from it
if memory_buffer
# Create the .bc file (for next compilations)
File.write(bc_name, memory_buffer.to_slice)
memory_buffer.dispose

compiler.optimize llvm_mod if compiler.release?
compiler.target_machine.emit_obj_to_file llvm_mod, object_name
end
Expand Down Expand Up @@ -492,10 +477,6 @@ module Crystal
"#{@output_dir}/#{@name}.bc"
end

def bc_name_new
"#{@output_dir}/#{@name}.new.bc"
end

def ll_name
"#{@output_dir}/#{@name}.ll"
end
Expand Down

6 comments on commit c792139

@sdogruyol
Copy link
Member

@sdogruyol sdogruyol commented on c792139 Dec 16, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what i've seen from 5 times of compiling Kemal this brings great improvements after first compile.

0.20.1

crystal build --release --stats src/kemal.cr  3.59s user 0.15s system 101% cpu 3.683 total

Master

/crystal/bin/crystal build --release --stats   0.83s user 0.14s system 106% cpu 

πŸŽ‰

@asterite
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sdogruyol Cool! What if you compile many times without --release?

@luislavena
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asterite
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@luislavena Nice! Is this on linux or mac? I expect performance on linux to improve a lot more than in mac, mostly because now only 8 forks are done instead of thousands.

@luislavena
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asterite Linux (4.2.0 x86_64) (sorry for delayed response πŸ˜„)

@ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented on c792139 Dec 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick compilations of Crystal (master) on Linux (Ubuntu 14.04, x86_64) without optimizations:

# crystal 0.20.1 (empty cache):
real    0m44.628s
user    0m36.692s
sys     0m42.833s

# master (empty cache):
real    0m30.164s
user    0m48.214s
sys     0m2.035s

# master (cache):
real    0m19.821s
user    0m22.709s
sys     0m1.577s

I note that sys time goes from above 40s to below 2s! Here go the slow Linux forks? With a cache, compiling Crystal gets down from 44s to 20s, which is more than twice faster, overall. Good job!

Please sign in to comment.