Add split_length by token in preprocessor

**Is your feature request related to a problem? Please describe.**
With LLM like chatgpt, its measurement is token, not word. In order to maximize the embedding etc, it is best to have split by token feature

**Describe the solution you'd like**
Add another split choice `token`, and measure the chunk by token size. 

Other packages like langchain, llama-index already has this feature.