[Bug] tvm fp16 much more slower than float32

I have a simple MLP model which have 3 layer fcs, I using te to build it manually and specified the dtype in each layer. the exported lib inference much more slower than float32.....

and the exported .so size is twice bigger than float32. How could this be possible?