machine learning for text classification
In this notebook, we will introduce a text classification project, the main task is a topic prediction for a text(question or statement etc.).
- The labels include 16 classes, which is described in data/label.csv eg. 生活|心理学|电影|游戏|恋爱|音乐|大学|心理|情感|互联网|社会|人际交往|教育|汽车 |医学|法律
- The datasets include train(129176) / test(32614), you can see in the dir(data)
- We will use a small dataset to set the example. eg.10000
- We will use some traditional statistical features like TFIDF..
- Model type : XgBoost/RandomForest
- pipeline of this project feature extractor | model training | params selection | data balance etc..
for about feature selection / params selection :
- use the inline function, like CV(cross validation) to choose the best params;
- use xgb.plot_importance to figure out the most important feature.
- use grid search to find a beet params
You can directly run each step in the notebook sequentially so that you understand what each step does.