shap-authorship-analysis-demo

LightGBM / SHAPによる著者分析のデモ用リポジトリ
特定の文章の特徴を分析し、LightGBMで既知の著者に分類。分類根拠に用いられた文章の特徴をプロットするアプリケーション

デモ (Project Gutenberg)

NLTKで利用可能なGutenberg Corpus (https://www.gutenberg.org/) のうち、Chesterton氏とBryant氏の小説を著者ごとに分類。それぞれTrue・Falseでラベル付けを行いデータセットとして利用した。

データセットを作成するにあたり、各著者の小説全体に対して1段落を1つのデータポイントとみなし、LightGBMで学習・予測を行った。

予測に用いた特徴量には約30個の文法的特徴と約40個の品詞に対する段落中出現頻度を合計した計73種類の数値が用いられた。

LightGBMはブラックボックス化した機械学習モデルであるため、モデルの分類基準を説明するためにSHAP (https://shap.readthedocs.io/en/latest/index.html)を導入した。

Dataset details

記述スタイル: 小説 (1900年代初頭)
クロスバリデーション手法: 100-fold cross validation
データセット: Project Gutenberg Selections source (Chesterton氏とBryant氏の小説を選択。これは両者の記述スタイルが類似しているため。)
以下に著者とその段落数を示す。

Author Paragraphs

Chesterton 4055

Bryant 1194

Total 5249
- （メモ：データポイント数に偏りがあるため、次のステップとして各著者から1000のデータポイントをランダムに選出する実装を追加する）

Confusion matrix (100-fold cross validated model)

Prediction	Bryant	Chesterton
Actual
Bryant	612	582
Chesterton	202	3853

(メモ：データポイント数の偏りのため、やや当てずっぽうにChesterton氏と予測する傾向が見られる。)

Scores (100-fold cross validated model)

Metrics	Score
F1	0.907
ROC-AUC	0.870
Accuracy	0.850

Force plot

1番目のデータポイントに対する予測とその根拠をあらわすプロット
SHAP値が大きい場合にChesterton氏、小さい場合にBryant氏であるとモデルが予測する傾向があることを示す。
（1番目のデータポイントについては正答がChesterton氏であり、モデルも正しく予測していた）

Decision plot

1番目のデータポイントに対するForce plotを、各特徴量ごとに行で分解して表示したもの。
SHAP値が大きい場合にChesterton氏、小さい場合にBryant氏であるとモデルが予測する傾向があることを示す。
（1番目のデータポイントについては正答がChesterton氏であり、モデルも正しく予測していた）

Summary plot (bar)

100-fold クロスバリデーションにおける、各特徴量の分類への総合的な貢献度を表示したもの

Summary plot

100-fold クロスバリデーションによる各特徴量の強さ・弱さが、それぞれのデータポイントの分類に対してどのような貢献をしたのかを表示するプロット。
SHAP値が大きい場合にChesterton氏、小さい場合にBryant氏であるとモデルが予測する傾向があることを示す。

Name		Name	Last commit message	Last commit date
Latest commit History 441 Commits
.devcontainer		.devcontainer
.vscode		.vscode
authorship_tool		authorship_tool
data		data
demo		demo
images		images
md/img		md/img
misc		misc
out		out
test		test
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

shap-authorship-analysis-demo

デモ (Project Gutenberg)

Dataset details

Confusion matrix (100-fold cross validated model)

Scores (100-fold cross validated model)

Force plot

Decision plot

Summary plot (bar)

Summary plot

About

Contributors 2

Languages

Author	Paragraphs
Chesterton	4055
Bryant	1194
Total	5249

ZEKE320/shap-authorship-analysis-demo

Folders and files

Latest commit

History

Repository files navigation

shap-authorship-analysis-demo

デモ (Project Gutenberg)

Dataset details

Confusion matrix (100-fold cross validated model)

Scores (100-fold cross validated model)

Force plot

Decision plot

Summary plot (bar)

Summary plot

About

Resources

Stars

Watchers

Forks

Contributors 2

Languages