Can someone explain why restricting the posterior  `z`  as diagonal Gaussian? 

Maybe I do not understand this paper throughly, but can someone explain this?
The posterior  `z`  is modelled as diagonal Gaussian. And in the `Zero initialization` part,  `ensures that the posterior distribution as a simple normal distribution`.
If it is a simple distribution, why a complex prior flow is needed to learn its distribution?