# AI100 Text classification introduction

Here are comparisions of different strategies for text classification task on [AI100](http://competition.ai100.com.cn/html/game_det.html?id=24&tab=1). [The third place](http://geek.ai100.com.cn/2017/06/01/1665) in the end.

The following chapters include three parts: part 1 (bag of words), part 2 (LDA) and part 3 (word2vec).

# Part 1: Bag of words

First we try statistical machine learning strategy, i.e., [bag of words (bow)](https://en.wikipedia.org/wiki/Bag-of-words_model) + different algorithms. 

For unstructured text data, we should extract numerical features vectors first. 

Here we use bag of words to represent text. The training data (corpus) are converted to a matrix, which each row represent document and columns are word vocabulary appeared in the whole corpus. And the element of the matrix can be word counts or tf-idf. Please refer to [here](http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) for more details.

Note that the above matrix is **high dimensional sparse** !

## Step1: reading input and text segmentation

In this section, we will import necessary packages first (such as [scikit-learn](http://scikit-learn.org/), [pandas](http://pandas.pydata.org/), [numpy](http://www.numpy.org/), [jieba](https://github.com/fxsjy/jieba)), load input data and perform Chinese text segmentation.

In [None]:
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import jieba as jb
import codecs
import os
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
pd.set_option('display.max_colwidth', -1)

In [None]:
data_dir = 'materials' # input file directory
# read input
training_data = pd.read_csv(os.path.join(data_dir, 'training.csv'), names=['text_label', 'text_content'], encoding='utf8')

In [None]:
# size of training data
training_data.shape

In [None]:
# take a glance 
training_data

In [None]:
# compute the number of each category
# we see that the training data is unbalanced
pd.value_counts(training_data['text_label'].values)#.plot(kind='bar')
# or training_data.groupby(training_data['text_label']).count().plot(kind='bar')

In [None]:
# helper function: reading file
def read_file(file_path):
    f = codecs.open(file_path, encoding='utf-8')
    lines = []
    for line in f:
        line = line.rstrip('\n').rstrip('\r')
        lines.append(line)
    return lines

stopwordsCN = read_file(os.path.join(data_dir, 'stopWords_cn.txt'))

# helper function: text segmentation
def cut_content(each_row):
    return ' '.join([word for word in jb.lcut(each_row['text_content']) if word not in stopwordsCN])

In [None]:
# look at the first ten rows results
training_data.head(10)['text_content_segmentation']

In [None]:
# manually add specific words
new_words = read_file(os.path.join(data_dir, 'ai100_words.txt'))
for word in new_words:
    jb.add_word(word)
    
# perform text segmentation
training_data['text_content_segmentation'] = training_data.apply(cut_content, axis=1)
#testing_data.head(10)
#selected_index = range(1001, 1200)
#print selected_index
#training_data.iloc[selected_index]

## Step 2: extracting features

In this section, we will extract tf-idf numerical features using [CounterVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) classes.

In [None]:
# word count feature
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(training_data['text_content_segmentation'])
# play with X_train_counts: https://de.dariah.eu/tatom/working_with_text.html

In [None]:
type(X_train_counts)

In [None]:
X_train_counts.shape

In [None]:
count_vect.vocabulary_#.get(u'企业')

In [None]:
# tf-idf feature
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

In [None]:
type(X_train_tfidf)

In [None]:
X_train_tfidf.shape

## Step 3: build classifier

In this section, we use linear [Support Vector Machine (SVM)](https://github.com/jakevdp/sklearn_pycon2015/blob/master/notebooks/03.1-Classification-SVMs.ipynb) method to build the model.

In [None]:
# linear SVM
# http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
svm_clf = SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5)

In [None]:
# 10-fold cross-validation results
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html
cv_scores = cross_val_score(svm_clf, X_train_tfidf, training_data['text_label'], cv=KFold(n_splits=10), n_jobs=-1)

In [None]:
cv_scores

In [None]:
np.mean(cv_scores)

## Step 4: pipeline

In this section, we build a [pipeline](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline) for feature extracting and model building.

In [None]:
text_clf = Pipeline([('word_counter', CountVectorizer()),
                    ('tfidf_computer', TfidfTransformer()),
                    ('clf', SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, n_iter=5))])

In [None]:
text_clf = text_clf.fit(training_data['text_content_segmentation'], training_data['text_label'])

In [None]:
cv_scores = cross_val_score(text_clf, training_data['text_content_segmentation'], training_data['text_label'], cv=KFold(n_splits=10), n_jobs=-1)

In [None]:
np.mean(cv_scores)

## Step 5: predicting

In this section, the above model is used to preform predicting.

In [None]:
# read the test data
testing_data = pd.read_csv(os.path.join(data_dir, 'testing.csv'), names=['text_index', 'text_content'], encoding='utf-8')

In [None]:
testing_data.shape

In [None]:
testing_data['text_content_segmentation'] = testing_data.apply(cut_content, axis=1)

In [None]:
#testing_data.head(10)
selected_index = range(81, 90)
print selected_index
testing_data.iloc[selected_index]

In [None]:
predict_res = text_clf.predict(testing_data['text_content_segmentation'])

In [None]:
predict_res

In [None]:
# save to csv file
np.savetxt(os.path.join(data_dir, 'results.csv'), np.dstack((np.arange(1, predict_res.size+1), predict_res))[0],"%d,%d")
# submit accuracy: 0.836

## Step 6: debugging

Several directions can be done...

[1] model parameter tuning, riched feature (n-gram or word2vec)

[2] feature selection

[3] try XGBoost and liblinear

[4] model ensemble

[5] deal with imbalanced dataset

[6] better text segmentation vocabulary

### The following codes shows the solution of [1].

In [None]:
text_clf = Pipeline([('word_counter', CountVectorizer()),
                    ('tfidf_computer', TfidfTransformer()),
                    ('clf', SGDClassifier(loss='hinge'))])
parameters = {
    'word_counter__ngram_range': [(1, 1), (1, 2), (1, 3)],
    'word_counter__min_df': (1, 2, 3, 4, 5),
    
    'tfidf_computer__norm': ('l1', 'l2'),
    'tfidf_computer__use_idf': (True, False),
    'tfidf_computer__smooth_idf': (True, False),
    'tfidf_computer__sublinear_tf': (True, False)
}

gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)
gs_clf = gs_clf.fit(training_data['text_content_segmentation'], training_data['text_label'])

print gs_clf.best_score_
for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))
# submit accuracy: 0.858

### We try [2] feature selection.

In [None]:
from sklearn.feature_selection import SelectKBest, chi2

text_clf = Pipeline([('word_counter', CountVectorizer()),
                    ('tfidf_computer', TfidfTransformer()),
                    ('select_feature', SelectKBest(chi2)),
                    ('clf', SGDClassifier(loss='hinge', penalty='elasticnet', class_weight='balanced', alpha=1e-05, n_iter=20))])

parameters = {
    # http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
    #'word_counter__ngram_range': [(1, 1), (1, 2)],
    #'word_counter__min_df': (1, 2, 3, 4, 5),

    # http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html
    #'tfidf_computer__smooth_idf': (True, False),
    #'tfidf_computer__sublinear_tf': (True, False),

    #'select_feature__score_func': (),
    'select_feature__k': (10000, 20000, 25000) #50000, 100000, 150000

    # http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
    #'clf__penalty': ('l1', 'l2', 'elasticnet'),
    #'clf__alpha': (0.001, 0.0001, 0.00001),
    #'clf__n_iter': (20, 30, 50, 100, 200),
    #'clf__class_weight': ('balanced', None)
    #'clf__C': (1, 10, 100, 1000),
    #'clf__penalty': ('l1', 'l2'),
    #'clf__loss': ('hinge', 'squared_hinge')
    #'clf__multi_class': ('ovr', 'crammer_singer'),
    #'clf__max_iter': (1000, 2000)
}

print 'start to grid search'
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1, cv=10)
#X_train, X_test, y_train, y_test = train_test_split(training_data['text_content_segmentation'], training_data['text_label'])

gs_clf.fit(training_data['text_content_segmentation'], training_data['text_label'])
#gs_clf.fit(X_train, y_train)

print 'best results:'
print gs_clf.best_score_

for param_name in sorted(parameters.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))

### We then [3] try XGBoost and liblinear.

[XGBoost](http://xgboost.readthedocs.io/en/latest/) is a very popular package in [Kaggle](https://www.kaggle.com/) competiton. Here, we have a look at the power of XGBoost. However, xgboost seems not work for high dimensional and sparse features.

In [None]:
# references: 
# http://machinelearningmastery.com/develop-first-xgboost-model-python-scikit-learn/
# https://jessesw.com/XG-Boost/

import xgboost as xgb

count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(training_data['text_content_segmentation'])

tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)


cv_params = {
    'max_depth': [3, 4],
    'min_child_weight': [1, 3]
}

init_params = {'learning_rate': 0.1, 'n_estimators': 1000, 'seed':0, 'subsample': 0.8, 'colsample_bytree': 0.8, 'objective': 'multi:softmax'}

print 'start to gs:'
gs_clf = GridSearchCV(xgb.XGBClassifier(**init_params), cv_params, scoring='accuracy', n_jobs=-1)
gs_clf.fit(X_train_tfidf.todense(), training_data['text_label'])

print 'best results:'
print gs_clf.best_score_

for param_name in sorted(cv_params.keys()):
    print("%s: %r" % (param_name, gs_clf.best_params_[param_name]))


### Now, we do [4] model ensemble.

In [None]:
# references: 
# http://sebastianraschka.com/Articles/2014_ensemble_classifier.html
# https://stats.stackexchange.com/questions/190151/ensembling-with-votingclassifier
# http://machinelearningmastery.com/ensemble-machine-learning-algorithms-python-scikit-learn/
# http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html


from sklearn.ensemble import VotingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

estimators = []
svm_clf = SGDClassifier(loss='hinge', penalty='l2', alpha=0.0001, n_iter=5)
estimators.append(('svm', svm_clf))

mnb = MultinomialNB(alpha=0.1)
estimators.append(('nb', mnb))

rf_clf = RandomForestClassifier(n_estimators=50)
estimators.append(('rf', rf_clf))

voting = VotingClassifier(estimators)

text_clf = Pipeline([('word_counter', CountVectorizer(ngram_range=(1, 2))),
                     ('tf_idf_computer', TfidfTransformer()),
                     ('clf', voting)])

cv_scores = cross_val_score(text_clf, training_data['text_content_segmentation'], training_data['text_label'], cv=KFold(n_splits=3), n_jobs=-1)
ss = text_clf.fit(training_data['text_content_segmentation'], training_data['text_label'])

print cv_scores
print np.mean(cv_scores)

### And we [5] deal with imbalanced dataset.

In [None]:
# references: 
# http://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/
# http://blog.csdn.net/heyongluoyao8/article/details/49408131
# http://www.jianshu.com/p/3e8b9f2764c8
# https://github.com/ThoughtWorksInc/dataclouds/blob/master/source/_posts/%E4%B8%8D%E5%B9%B3%E8%A1%A1%E6%95%B0%E6%8D%AE%E5%A4%84%E7%90%86.md

'''
# subsampling
sample_size = 200

def sample_data(each_group, sam_size):
    if len(each_group.index) > sam_size:
        return each_group.loc[np.random.choice(each_group.index, sam_size, False),:]
    else:
        return each_group
    
banlanced_data = training_data.groupby(training_data['text_label']).apply(lambda x: sample_data(x, sample_size))

pd.value_counts(banlanced_data['text_label'].values)
text_clf = text_clf.fit(banlanced_data['text_content_segmentation'], banlanced_data['text_label'])
cv_scores = cross_val_score(text_clf, banlanced_data['text_content_segmentation'], banlanced_data['text_label'], cv=KFold(n_splits=10), n_jobs=-1)
np.mean(cv_scores)


'''

from imblearn.over_sampling import RandomOverSampler
count_vect = CountVectorizer(min_df=1)
X_train_counts = count_vect.fit_transform(training_data['text_content_segmentation'])

tfidf_transformer = TfidfTransformer(smooth_idf=True, sublinear_tf=True)
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)

print 'RandomOverSampler'
#sm = SMOTEENN()
#X_resampled, y_resampled = sm.fit_sample(X_train_tfidf.todense(), training_data['text_label'])

ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_sample(X_train_tfidf.todense(), training_data['text_label'])
print('Resampled dataset shape {}'.format(Counter(y_resampled)))
#Resampled dataset shape Counter({1: 1271, 2: 1271, 3: 1271, 4: 1271, 5: 1271, 6: 1271, 7: 1271, 8: 1271, 9: 1271, 10: 1271, 11: 1271})

print 'svm'
text_clf = SGDClassifier(loss='hinge', penalty='l2', class_weight='balanced', n_iter=20)
text_clf.fit(X_resampled, y_resampled)

print 'predict'
testing_data = pd.read_csv(os.path.join(data_dir, 'testing.csv'), names=['text_index', 'text_content'])
testing_data['text_content_segmentation'] = testing_data.apply(cut_content, axis=1)

X_new_counts = count_vect.transform(testing_data['text_content_segmentation'])
X_new_tfidf = tfidf_transformer.transform(X_new_counts)


tested_predicted = text_clf.predict(X_new_tfidf.todense())


# Part 2: LDA

In [None]:
# to do...
# references: 
# https://gist.github.com/aronwc/8248457
# http://mccormickml.com/2016/03/25/lsa-for-text-classification-tutorial/
# https://github.com/chrisjmccormick/LSA_Classification/blob/master/runClassification_LSA.py

# Part 3: Word2vec

[Word2vec](https://en.wikipedia.org/wiki/Word2vec) is a group of related models (continuous bag-of-words (CBOW) and skip-gram) that are used to produce word embeddings, which is a distributed representation of words. 

Here, we use [gensim](https://radimrehurek.com/gensim/) package to implement Word2vec.

## Solution 1: Word2vec + SVM

In [None]:
# references: 
# http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
%load materials/word2vec/word2vec+svm.py

## Solution 2: Doc2vec + SVM

In [None]:
# references: 
#http://www.shuang0420.com/2016/06/01/gensim-doc2vec%E5%AE%9E%E6%88%98/
#https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-IMDB.ipynb
#http://stackoverflow.com/questions/31321209/doc2vec-how-to-get-document-vectors
#https://datascience.stackexchange.com/questions/10216/doc2vec-how-to-label-the-paragraphs-gensim
#https://groups.google.com/forum/#!topic/word2vec-toolkit/HpVnMfeo5PM
#https://github.com/linanqiu/word2vec-sentiments/blob/master/word2vec-sentiment.ipynb
#http://nadbordrozd.github.io/blog/2016/05/20/text-classification-with-word2vec/
%load materials/word2vec/doc2vec+svm.py

## Solution 3: FastText

Maybe due to the lack of more training data, the accuracy of FastText is about 75%.

Step 1: prepare training data

In [None]:
%load materials/Fasttext/AI100_step1_prepare_training_data.py

Step 2: training the model

In [None]:
%load materials/Fasttext/AI100_step2_training.py

Step 3: predicting

In [None]:
%load materials/Fasttext/AI100_step3_predicting.py

## Solution 4: Word2vec + CNN

To do...