# Source code for ANLP: NLP2021

<strong>An Experimental Evaluation of Japanese Tokenizers for Sentiment-Based Text Classification</strong><br/>
Andre Rusli & Makoto Shishido (Tokyo Denki University)

## Import Packages

In [1]:
import pandas as pd
import numpy as np
import sentencepiece as spm
from sudachipy import tokenizer
from sudachipy import dictionary
import MeCab
import time
import pickle

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score

## Import Train (340K) and Test (40K) Data

In [2]:
df_train = pd.read_csv('../data/rakuten-sentiment-dataset/binary/sampled_binary_train.csv', header=None)
df_test = pd.read_csv('../data/rakuten-sentiment-dataset/binary/sampled_binary_test.csv', header=None)

print("Total training samples: ", len(df_train))
print("Total testing samples: ", len(df_test))

Total training samples:  340000
Total testing samples:  40000


In [3]:
df_train.head()

Unnamed: 0,0,1,2
0,1,臭い,余りにも、匂いがきつく安物みたいです。\n安いから仕方ないかな？
1,1,残念…。,マキシスカートのスリムタイプのグレーを購入したのですが、商品に不備があったとの事でカーキに変...
2,2,,発送がメチャクチャ早くて\nびっくりしました(≧▽≦)\n包装も、丁寧で厳重で、信頼◎です(...
3,2,玄関に飾りました,見た目すっきりですが、クローバーのラインストーンがアクセントになっていていいと思います。\n...
4,1,うちの子にはダメ,とても良い商品だと思いますが、避妊手術をした推定１才4ヶ月で体重4キロの子には不向きでした。...


In [4]:
df_test.head()

Unnamed: 0,0,1,2
0,1,残念です…,243254-20151129-0858902215\n生写真プレゼントに惹かれて他の予約商...
1,1,1日で毛玉びっしり！,家で1日履いて夜お風呂で脱いだ時には毛玉びっしり！\n一回きり？もう外では履けない！\n起毛...
2,2,食いつき良し,当方のワンコはドーベルマンです。\nガムだと７日程度（ＬＬサイズ）、アキレスだとほんの１５分...
3,1,残念,Ｎ１１を購入。\nでも折り返し部分が幅広だし履いていてチクチク感があり痒い。\n箪笥の肥やしです。
4,2,使いやすいです,キッチンペーパーがむき出しのホルダーはたくさんあるけれど、キッチンペーパーが見えなくそのうえ...


## Train a SentencePiece Model from Train Data (vocab_size=32000)

In [21]:
start = time.time()
spm.SentencePieceTrainer.train(input='../data/rakuten-sentiment-dataset/binary/sampled_binary_train.csv', model_prefix='sp-model-16000-340k', vocab_size=16000)
end = time.time()
print('Time to train a SP model w/ 16000 vocab_size, from 340k train data:', end-start)

start = time.time()
spm.SentencePieceTrainer.train(input='../data/rakuten-sentiment-dataset/binary/sampled_binary_train.csv', model_prefix='sp-model-16000-340k', vocab_size=16000)
end = time.time()
print('Time to train a SP model w/ 16000 vocab_size, from 340k train data:', end-start)


Time to train a SP model w/ 16000 vocab_size, from 340k train data: 361.1047897338867
Time to train a SP model w/ 16000 vocab_size, from 340k train data: 360.28389596939087


In [22]:
start = time.time()
spm.SentencePieceTrainer.train(input='../data/rakuten-sentiment-dataset/binary/sampled_binary_train.csv', model_prefix='sp-model-16000-340k', vocab_size=16000)
end = time.time()
print('Time to train a SP model w/ 16000 vocab_size, from 340k train data:', end-start)

start = time.time()
spm.SentencePieceTrainer.train(input='../data/rakuten-sentiment-dataset/binary/sampled_binary_train.csv', model_prefix='sp-model-8000-340k', vocab_size=8000)
end = time.time()
print('Time to train a SP model w/ 8000 vocab_size, from 340k train data:', end-start)

start = time.time()
spm.SentencePieceTrainer.train(input='../data/rakuten-sentiment-dataset/binary/sampled_binary_train.csv', model_prefix='sp-model-8000-340k', vocab_size=8000)
end = time.time()
print('Time to train a SP model w/ 8000 vocab_size, from 340k train data:', end-start)

start = time.time()
spm.SentencePieceTrainer.train(input='../data/rakuten-sentiment-dataset/binary/sampled_binary_train.csv', model_prefix='sp-model-8000-340k', vocab_size=8000)
end = time.time()
print('Time to train a SP model w/ 8000 vocab_size, from 340k train data:', end-start)

Time to train a SP model w/ 16000 vocab_size, from 340k train data: 360.79864263534546
Time to train a SP model w/ 8000 vocab_size, from 340k train data: 389.89155197143555
Time to train a SP model w/ 8000 vocab_size, from 340k train data: 388.27756357192993
Time to train a SP model w/ 8000 vocab_size, from 340k train data: 388.90677785873413


In [24]:
start = time.time()
spm.SentencePieceTrainer.train(input='../data/rakuten-sentiment-dataset/binary/sampled_binary_train.csv', model_prefix='sp-model-32000-340k', vocab_size=32000)
end = time.time()
print('Time to train a SP model w/ 32000 vocab_size, from 340k train data:', end-start)

start = time.time()
spm.SentencePieceTrainer.train(input='../data/rakuten-sentiment-dataset/binary/sampled_binary_train.csv', model_prefix='sp-model-32000-340k', vocab_size=32000)
end = time.time()
print('Time to train a SP model w/ 32000 vocab_size, from 340k train data:', end-start)

start = time.time()
spm.SentencePieceTrainer.train(input='../data/rakuten-sentiment-dataset/binary/sampled_binary_train.csv', model_prefix='sp-model-32000-340k', vocab_size=32000)
end = time.time()
print('Time to train a SP model w/ 32000 vocab_size, from 340k train data:', end-start)

Time to train a SP model w/ 32000 vocab_size, from 340k train data: 313.60328674316406
Time to train a SP model w/ 32000 vocab_size, from 340k train data: 314.5716471672058
Time to train a SP model w/ 32000 vocab_size, from 340k train data: 313.32940673828125


## Tokenization Feature

In [5]:
# mecab w/ unidic-lite
wakati = MeCab.Tagger("-Owakati")

#sudachi
sudachi = dictionary.Dictionary().create()
mode = tokenizer.Tokenizer.SplitMode.B

#sentencepiece
sp = spm.SentencePieceProcessor(model_file='./sp-model-32000-340k.model')

rev1 = df_train[2][291809]
rev2 = df_test[2][17827]
rev3 = df_train[2][558]

print("Review 1:\n", rev1)
print("MeCab:\n", wakati.parse(rev1).split())
print("Sudachi:\n", [m.surface() for m in sudachi.tokenize(rev1, mode)])
print("SP:\n", sp.encode(rev1, out_type=str))

print("\n\nReview 2:\n", rev2)
print("MeCab:\n", wakati.parse(rev2).split())
print("Sudachi:\n", [m.surface() for m in sudachi.tokenize(rev2, mode)])
print("SP:\n", sp.encode(rev2, out_type=str))

print("\n\nReview 3:\n", rev3)
print("MeCab:\n", wakati.parse(rev3).split())
print("Sudachi:\n", [m.surface() for m in sudachi.tokenize(rev3, mode)])
print("SP:\n", sp.encode(rev3, out_type=str))

Review 1:
 ギターのノブにするため買いました。\n特に悪い所もなく、届いたとき商品はプチプチで包まれていました。\nどちらかと言うと良かったと思います。\nまだつけて間もないのでこれからどうなるかわからないですけど 今のところ不自由は何一つないです。
MeCab:
 ['ギター', 'の', 'ノブ', 'に', 'する', 'ため', '買い', 'まし', 'た', '。', '\\', 'n', '特に', '悪い', '所', 'も', 'なく', '、', '届い', 'た', 'とき', '商品', 'は', 'プチプチ', 'で', '包ま', 'れ', 'て', 'い', 'まし', 'た', '。', '\\', 'n', 'どちら', 'か', 'と', '言う', 'と', '良かっ', 'た', 'と', '思い', 'ます', '。', '\\', 'n', 'まだ', 'つけ', 'て', '間', 'も', 'ない', 'の', 'で', 'これ', 'から', 'どう', 'なる', 'か', 'わから', 'ない', 'です', 'けど', '今', 'の', 'ところ', '不', '自由', 'は', '何', '一', 'つ', 'ない', 'です', '。']
Sudachi:
 ['ギター', 'の', 'ノブ', 'に', 'する', 'ため', '買い', 'まし', 'た', '。', '\\', 'n', '特に', '悪い', '所', 'も', 'なく', '、', '届い', 'た', 'とき', '商品', 'は', 'プチプチ', 'で', '包ま', 'れ', 'て', 'い', 'まし', 'た', '。', '\\', 'n', 'どちら', 'か', 'と', '言う', 'と', '良かっ', 'た', 'と', '思い', 'ます', '。', '\\', 'n', 'まだ', 'つけ', 'て', '間', 'も', 'ない', 'の', 'で', 'これ', 'から', 'どう', 'なる', 'か', 'わから', 'ない', 'です', 'けど', ' ', '今', 'の', 'ところ', '不自由', 'は', '何', '一', 'つ', 'ない', 'です', '。'

## TF-IDF Vectorizer

In [6]:
def tokenize_sp(text):
    tokenized = sp.encode(text, out_type=str)
    return tokenized

In [7]:
def tokenize_mecab(text):
    tokenized = wakati.parse(text).split()
    return tokenized

In [8]:
def tokenize_sudachi(text):
    tokenized = [m.surface() for m in sudachi.tokenize(text, mode)]
    return tokenized

In [9]:
X_train, y_train = df_train[2], df_train[0]
print("Total train samples: ", len(X_train))

Total train samples:  340000


In [10]:
tfidfVect_mecab = TfidfVectorizer(tokenizer=tokenize_mecab)
tfidfVect_sudachi = TfidfVectorizer(tokenizer=tokenize_sudachi)
tfidfVect_sp = TfidfVectorizer(tokenizer=tokenize_sp)

In [11]:
start = time.time()
X_train_tfidf_mecab = tfidfVect_mecab.fit_transform(X_train)
end = time.time()
print("TFIDF Vect time (MeCab): ", end-start)

start = time.time()
X_train_tfidf_sudachi = tfidfVect_sudachi.fit_transform(X_train)
end = time.time()
print("TFIDF Vect time (Sudachi): ", end-start)

start = time.time()
X_train_tfidf_sp = tfidfVect_sp.fit_transform(X_train)
end = time.time()
print("TFIDF Vect time (SentencePiece): ", end-start)

TFIDF Vect time (MeCab):  27.098011016845703
TFIDF Vect time (Sudachi):  1417.198209285736
TFIDF Vect time (SentencePiece):  19.105496168136597


## Model Training

In [12]:
start = time.time()
clf_lr_mecab = LogisticRegression(random_state=0).fit(X_train_tfidf_mecab, y_train)
end = time.time()
print("Training time (LR-MeCab): ", end-start)

start = time.time()
clf_lr_sudachi = LogisticRegression(random_state=0).fit(X_train_tfidf_sudachi, y_train)
end = time.time()
print("Training time (LR-Sudachi): ", end-start)

start = time.time()
clf_lr_sp = LogisticRegression(random_state=0).fit(X_train_tfidf_sp, y_train)
end = time.time()
print("Training time (LR-SP): ", end-start)

start = time.time()
clf_mnb_mecab = MultinomialNB().fit(X_train_tfidf_mecab, y_train)
end = time.time()
print("Training time (MNB-MeCab): ", end-start)

start = time.time()
clf_mnb_sudachi = MultinomialNB().fit(X_train_tfidf_sudachi, y_train)
end = time.time()
print("Training time (MNB-Sudachi): ", end-start)

start = time.time()
clf_mnb_sp = MultinomialNB().fit(X_train_tfidf_sp, y_train)
end = time.time()
print("Training time (MNB-SP): ", end-start)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Training time (LR-MeCab):  21.328707456588745


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Training time (LR-Sudachi):  21.394134998321533


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Training time (LR-SP):  13.090460777282715
Training time (MNB-MeCab):  0.1601853370666504
Training time (MNB-Sudachi):  0.11599230766296387
Training time (MNB-SP):  0.0952596664428711


## Evaluation (Error Rate = 1 - accuracy)

In [15]:
# Predictions on train set
X_train, y_train = df_train[2], df_train[0]
print("Total train samples: ", len(X_train))

predicted = clf_lr_mecab.predict(X_train_tfidf_mecab)
print("Train Error rate (LR-MeCab)", (1-np.mean(predicted == y_train))*100)

predicted = clf_lr_sudachi.predict(X_train_tfidf_sudachi)
print("Train Error rate (LR-Sudachi)", (1-np.mean(predicted == y_train))*100)

predicted = clf_lr_sp.predict(X_train_tfidf_sp)
print("Train Error rate (LR-SP)", (1-np.mean(predicted == y_train))*100)

predicted = clf_mnb_mecab.predict(X_train_tfidf_mecab)
print("Train Error rate (MNB-MeCab)", (1-np.mean(predicted == y_train))*100)

predicted = clf_mnb_sudachi.predict(X_train_tfidf_sudachi)
print("Train Error rate (MNB-Sudachi)", (1-np.mean(predicted == y_train))*100)

predicted = clf_mnb_sp.predict(X_train_tfidf_sp)
print("Train Error rate (MNB-SP)", (1-np.mean(predicted == y_train))*100)

Total train samples:  340000
Train Error rate (LR-MeCab) 8.514999999999995
Train Error rate (LR-Sudachi) 8.456764705882359
Train Error rate (LR-SP) 6.476176470588236
Train Error rate (MNB-MeCab) 11.237352941176471
Train Error rate (MNB-Sudachi) 11.039705882352946
Train Error rate (MNB-SP) 8.277941176470593


In [16]:
# Predictions on test set
X_test, y_test = df_test[2], df_test[0]
print("Total test samples: ", len(X_test))

X_test_tfidf = tfidfVect_mecab.transform(X_test)
predicted = clf_lr_mecab.predict(X_test_tfidf)
print("Error rate (LR-MeCab)", (1-np.mean(predicted == y_test))*100)

X_test_tfidf = tfidfVect_sudachi.transform(X_test)
predicted = clf_lr_sudachi.predict(X_test_tfidf)
print("Error rate (LR-Sudachi)", (1-np.mean(predicted == y_test))*100)

X_test_tfidf = tfidfVect_sp.transform(X_test)
predicted = clf_lr_sp.predict(X_test_tfidf)
print("Error rate (LR-SP)", (1-np.mean(predicted == y_test))*100)

X_test_tfidf = tfidfVect_mecab.transform(X_test)
predicted = clf_mnb_mecab.predict(X_test_tfidf)
print("Error rate (MNB-MeCab)", (1-np.mean(predicted == y_test))*100)

X_test_tfidf = tfidfVect_sudachi.transform(X_test)
predicted = clf_mnb_sudachi.predict(X_test_tfidf)
print("Error rate (MNB-Sudachi)", (1-np.mean(predicted == y_test))*100)

X_test_tfidf = tfidfVect_sp.transform(X_test)
predicted = clf_mnb_sp.predict(X_test_tfidf)
print("Error rate (MNB-SP)", (1-np.mean(predicted == y_test))*100)

Total test samples:  40000
Error rate (LR-MeCab) 9.635000000000005
Error rate (LR-Sudachi) 9.607500000000002
Error rate (LR-SP) 7.997500000000002
Error rate (MNB-MeCab) 12.517500000000004
Error rate (MNB-Sudachi) 12.417500000000004
Error rate (MNB-SP) 8.897500000000003


In [None]:
# Predictions on few example reviews
X_new = ['悪い商品で、値段も高すぎる', '満足！']
X_new_tfidf = tfidfVect_mecab.transform(X_new)
predicted = clf.predict(X_new_tfidf)
predicted

## Grid Search CV -- Logistics Regression Parameter Tuning

In [13]:
# Grid Search CV => Logistics Regression + SentencePiece

from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV

start = time.time()
# set parameters
model = LogisticRegression(random_state=0)
solvers = ['newton-cg', 'lbfgs', 'liblinear']
penalty = ['l2']
c_values = [100, 10, 1.0, 0.1, 0.01]
# define grid search
grid = dict(solver=solvers,penalty=penalty,C=c_values)
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)
grid_search = GridSearchCV(estimator=model, param_grid=grid, n_jobs=-1, cv=cv, scoring='accuracy',error_score=0)
grid_result = grid_search.fit(X_train_tfidf_sp, y_train)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
end = time.time()
print("Training time (LR-SP): ", end-start)

Best: 0.920125 using {'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'}
0.907683 (0.001330) with: {'C': 100, 'penalty': 'l2', 'solver': 'newton-cg'}
0.916786 (0.001765) with: {'C': 100, 'penalty': 'l2', 'solver': 'lbfgs'}
0.907678 (0.001332) with: {'C': 100, 'penalty': 'l2', 'solver': 'liblinear'}
0.918808 (0.001292) with: {'C': 10, 'penalty': 'l2', 'solver': 'newton-cg'}
0.920125 (0.001375) with: {'C': 10, 'penalty': 'l2', 'solver': 'lbfgs'}
0.918800 (0.001293) with: {'C': 10, 'penalty': 'l2', 'solver': 'liblinear'}
0.919631 (0.001389) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'newton-cg'}
0.919613 (0.001409) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'lbfgs'}
0.919633 (0.001388) with: {'C': 1.0, 'penalty': 'l2', 'solver': 'liblinear'}
0.896632 (0.001390) with: {'C': 0.1, 'penalty': 'l2', 'solver': 'newton-cg'}
0.896648 (0.001387) with: {'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'}
0.896633 (0.001389) with: {'C': 0.1, 'penalty': 'l2', 'solver': 'liblinear'}
0.843839 (0.001359) with: {'

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## Model Training and Evaluation

### Logistics Regression(random_state=0, C=10, penalty=l2, solver lbfgs -- SentencePiece

In [15]:
start = time.time()
clf_lr_sp = LogisticRegression(random_state=0, C=10, penalty='l2', solver='lbfgs').fit(X_train_tfidf_sp, y_train)
end = time.time()
print("Training time (LR-SP): ", end-start, " seconds.")

predicted = clf_lr_sp.predict(X_train_tfidf_sp)
print("Train Error rate (LR-SP)", (1-np.mean(predicted == y_train))*100)

X_test, y_test = df_test[2], df_test[0]
print("Total test samples: ", len(X_test))

X_test_tfidf = tfidfVect_sp.transform(X_test)
predicted = clf_lr_sp.predict(X_test_tfidf)
print("Test Error rate (LR-SP)", (1-np.mean(predicted == y_test))*100)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Training time (LR-SP):  13.420908451080322  seconds.
Train Error rate (LR-SP) 5.560588235294118
Total test samples:  40000
Test Error rate (LR-SP) 7.779999999999998
