<a href="https://colab.research.google.com/github/c-c-c-c/dm_integration/blob/master/myMecab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# テキストの分類
テキスト分類は、以下の順に行います。

1. 前処理
2. 数値表現化 (CountVectorizer, TfIdfVectorizer)
3. 分類器を学習

本ノートブックでは、Wikipedia のエントリ分類を通じて、テキスト分類器を作成します。

In [47]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [48]:
!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7

Reading package lists... Done
Building dependency tree       
Reading state information... Done
aptitude is already the newest version (0.8.10-6ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 25 not upgraded.
mecab is already installed at the requested version (0.996-5)
libmecab-dev is already installed at the requested version (0.996-5)
mecab-ipadic-utf8 is already installed at the requested version (2.7.0-20070801+main-1)
git is already installed at the requested version (1:2.17.1-1ubuntu0.5)
make is already installed at the requested version (4.1-9.1ubuntu1)
curl is already installed at the requested version (7.58.0-2ubuntu3.8)
xz-utils is already installed at the requested version (5.2.2-1.3)
file is already installed at the requested version (1:5.32-2ubuntu0.3)
mecab is already installed at the requested version (0.996-5)
libmecab-dev is already installed at the requested version (0.996-5)
mecab-ipadic-utf8 is already installed at the requested version (2.7.0-20070801+mai

In [49]:
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -a

fatal: destination path 'mecab-ipadic-neologd' already exists and is not an empty directory.
[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] :     find => ok
[install-mecab-ipadic-NEologd] :     sort => ok
[install-mecab-ipadic-NEologd] :     head => ok
[install-mecab-ipadic-NEologd] :     cut => ok
[install-mecab-ipadic-NEologd] :     egrep => ok
[install-mecab-ipadic-NEologd] :     mecab => ok
[install-mecab-ipadic-NEologd] :     mecab-config => ok
[install-mecab-ipadic-NEologd] :     make => ok
[install-mecab-ipadic-NEologd] :     curl => ok
[install-mecab-ipadic-NEologd] :     sed => ok
[install-mecab-ipadic-NEologd] :     cat => ok
[install-mecab-ipadic-NEologd] :     diff => ok
[install-mecab-ipadic-NEologd] :     tar => ok
[install-mecab-ipadic-NEologd] :     unxz => ok
[install-mecab-ipadic-NEologd] :     xargs => ok
[install-mecab-ipadic-NEologd] :     grep => ok
[install-mecab-ipadic-NEo

In [0]:
import joblib
import MeCab
import numpy as np
import pandas as pd
import re

from collections import Counter
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.svm import LinearSVC

## データの読み込み

In [0]:
# タブ (\t) 区切りファイルを読み込む
# df = pd.read_csv("./drive/My Drive/0_インテグ作業/data/EPG_checking (1).csv")
df = pd.read_excel("./drive/My Drive/0_インテグ作業/data/EPG_checking0212.xlsx")

In [0]:
df["sharp_epg_tknz"] = np.nan

In [53]:
df

Unnamed: 0.1,Unnamed: 0,drama_key,Unnamed: 2,Unnamed: 3,Unnamed: 4,drama_title,sharp_title,sharp_cnt,sharp_cnt_epg_regex,start_time,sharp_epg,sharp_epg_hand_corrected,Unnamed: 12,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16,Unnamed: 17,Unnamed: 18,Unnamed: 19,Unnamed: 20,Unnamed: 21,Unnamed: 22,Unnamed: 23,Unnamed: 24,sharp_epg_tknz
0,0.0,1910_CX_月21,1910,CX,月21,シャーロック,番組名シャーロック【危険な天才探偵×スマートな医師ー今夜、運命の出逢い】　#01,1.0,['#01'],"['21', '22']",都内にある病院の中庭で、この病院に勤務する消化器内科医、赤羽栄光(あかばね・はるき/中尾明慶...,都内にある病院の中庭で、この病院に勤務する消化器内科医、赤羽栄光(あかばね・はるき/中尾明慶...,,,,,,,,,,,,,,
1,1.0,1910_CX_月21,1910,CX,月21,シャーロック,番組名シャーロック【探偵×医師最強バディ始動!天才VS聖女の謎解き心理戦】　#02,2.0,['#02'],"['21', '22']",ついにバディ結成で最初の事件に挑む!新宿駅で死んだ女は別人に成り代わった女だった…。“ディー...,ついにバディ結成で最初の事件に挑む!新宿駅で死んだ女は別人に成り代わった女だった…。“ディー...,,,,,,,,,,,,,,
2,2.0,1910_CX_月21,1910,CX,月21,シャーロック,番組名シャーロック【天才VS詐欺師の騙し合い!ノンストップのどんでん返し】　#03,3.0,['#03'],"['21', '21']",オリンピックを翌年に控えたTOKYOの街を舞台に、現代の日本に生まれ変わった世界一有名なバデ...,オリンピックを翌年に控えたTOKYOの街を舞台に、現代の日本に生まれ変わった世界一有名なバデ...,,,,,,,,,,,,,,
3,3.0,1910_CX_月21,1910,CX,月21,シャーロック,番組名シャーロック【試合直前に消えたボクシング世界王者!オレンジの傘の謎】　#04,4.0,['#04'],"['21', '21']",誉獅子雄(ディーン・フジオカ)は、若宮潤一(岩田剛典)とボクシングの試合を観戦。ボクシングの...,誉獅子雄(ディーン・フジオカ)は、若宮潤一(岩田剛典)とボクシングの試合を観戦。ボクシングの...,,,,,,,,,,,,,,
4,4.0,1910_CX_月21,1910,CX,月21,シャーロック,番組名シャーロック【歩く死体の謎!熱帯魚と母の愛…その真相、正義か狂気か】　#05,5.0,['#05'],"['21', '21']",若宮潤一(岩田剛典)が誉獅子雄(ディーン・フジオカ)に文句を言っている。獅子雄は、同居してい...,若宮潤一(岩田剛典)が誉獅子雄(ディーン・フジオカ)に文句を言っている。獅子雄は、同居してい...,,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6149,,,,,,,,,,,,,,,,,,,,,,,,,,
6150,2967.0,,#VALUE!,,,,,5.0,,,,,#VALUE!,,,,,,,,,,,,,
6151,2968.0,,#VALUE!,,,,,6.0,,,,,#VALUE!,,,,,,,,,,,,,
6152,2969.0,,#VALUE!,,,,,7.0,,,,,#VALUE!,,,,,,,,,,,,,


In [0]:
def removeTrash (text):
    import re

    result_text = text
    result_text = re.sub(r"https?://[\w/:%#\$&\?\(\)~\.=\+\-]+", "", result_text)
    result_text = re.sub(r"番組詳細|制作・著作|制作著作", "", result_text)
    result_text = re.sub(r"[!\(\)=『』～/]", "", result_text)
    result_text = re.sub(r"フジテレビ|日本テレビ|TBS|テレビ朝日|TBS|関西テレビ", "", result_text)
    
    result_text = re.sub(r"【公式.*?】", "", result_text)
    result_text = re.sub(r"\u3000", "", result_text)

    return result_text


In [55]:
removeTrash(df['sharp_epg'][20])

'桑野阿部寛が仕事中に倒れ、中川尾美としのりの病院に運び込まれる。見舞いに訪れたまどか吉田羊、有希江稲森いずみ、早紀深川麻衣は、皮肉も全く言わず、いつになく素直で別人のような態度の桑野に大いに驚く。病気をきっかけに桑野が“いい人"になったと喜ぶ有希江と早紀に対して、普段から桑野とケンカばかりのまどかだけは、素直なのはあくまで一時的なものに違いないと疑うが、有希江から「桑野さんに厳しすぎ」と指摘されてしまう。    案の定、まどかは回復した桑野とまたもやささいなことで言い争いに。有希江や早紀には好意的なのに、なぜ自分には皮肉ばかり言うのかー。納得がいかないまどかに対し、早紀は、男と女の間には言葉と感情が裏腹になることがあると力説。それを体現した自分の舞台を見にきてほしいと、まどかたちを誘う。数日後、都合が悪くなり、舞台に行けなくなったまどかが困っていると、偶然桑野がやって来て、自分が代わりに行くと言い出す。早紀が、桑野には来てほしくないと言っていたことを思い出したまどかは慌てるが、一度行く気になった桑野を止められず、結局、桑野と有希江が舞台を見に行くことに。まどかは2人のデート?が気になって…。'

In [69]:
mecab = MeCab.Tagger()
mecab.parse("")
for i, text in enumerate( df['sharp_epg_hand_corrected']):
    text_tokenized = []
    # print(i)

    # URL、記号などのゴミを取り除く
    if type(text) is not str: 
        if np.isnan(text) :
            continue 
    text = removeTrash(text)
    node = mecab.parseToNode(text)
    while node:
        node = node.next
        if node is None:
            continue

        if not node.feature.startswith("BOS/EOS") and not node.feature.startswith("助詞") and\
            not node.feature.startswith("記号") and\
            node.feature.find("人名") == -1 and\
            not node.feature.startswith("助動詞"):
            text_tokenized.append(node.surface)

    df["sharp_epg_tknz"].iloc[i] = text_tokenized

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [72]:
df['sharp_epg_tknz'].iloc[190:193]
#text = df['sharp_detail_epg_tknz'][190:220]
# if type(text) is not str: 
#     if np.isnan(text) :
#         print("hoge") 
# removeTrash (df['sharp_detail_epg'][881])

190    [実, 母子, 肉体, 関係, 持た, せる, モンテ・クリスト・, 真, 海, ディーン・...
191    [恋人, 親友, 同僚, 家族, あなた, 最も, 信頼, する, 人, あなた, 地獄, ...
192    [モンテ・クリスト・, 真, 海, ディーン・フジオカ, 自分, おとしめ, 寺, 角, 類...
Name: sharp_epg_tknz, dtype: object

In [73]:
df["sharp_epg_tknz"].iloc[192]

['モンテ・クリスト・',
 '真',
 '海',
 'ディーン・フジオカ',
 '自分',
 'おとしめ',
 '寺',
 '角',
 '類',
 'この世',
 '抹殺',
 'する',
 'そして',
 '真',
 '海',
 '復讐',
 'ため',
 '放っ',
 '向け',
 '矢',
 'ついに',
 '的',
 '射よ',
 'し',
 'い',
 '家',
 '裏',
 '組織',
 'ユンロンジョン・リ',
 '押し入り',
 '何',
 '知ら',
 'すみれ',
 '明日',
 '花',
 '英',
 '人質',
 'とら',
 'れる',
 'かつて',
 '片棒',
 '担い',
 'ショーン・リージョーナカムラ',
 '事件',
 '目撃',
 '者',
 '香港',
 '警察',
 '接触',
 'する',
 'こと',
 'なっ',
 '目撃',
 '者',
 'ショー',
 'ン',
 '夫婦',
 '娘',
 'エデルヴァ',
 'エデルヴァ',
 'たち',
 '人身',
 '売買',
 'し',
 '日本語',
 '話す',
 '男',
 '引き取っ',
 '誰',
 '狙わ',
 'れ',
 'いる',
 'の',
 'その',
 '男',
 '探し出す',
 'よう',
 '命令',
 'する',
 '家',
 '飛び出し',
 '予定',
 'キャンセル',
 '連絡',
 '神楽',
 '電話',
 'する',
 'つかまら',
 'その',
 '頃',
 '神楽',
 '訪ね',
 'き',
 '真',
 '海',
 '会っ',
 'い',
 '真',
 '海',
 '引き合わ',
 'せ',
 '之',
 '詐欺',
 '師',
 '神楽',
 '謝る',
 'そして',
 '間もなく',
 'ショー',
 'ン',
 '事件',
 '関与',
 'し',
 'い',
 'こと',
 '香港',
 '警察',
 '知る',
 'こと',
 'なる',
 '教え',
 '寺',
 '角',
 '遺体',
 '匿名',
 '通報',
 '警察',
 '発見',
 'さ',
 'れる',
 '寺',
 '角',
 '浜浦',
 '町',
 '出身',
 '15',
 '年',
 '前',
 '暖',
 '捕まっ',
 

In [59]:
df["sharp_epg"][190]

'実の母子に肉体関係を持たせる…。モンテ・クリスト・真海(ディーン・フジオカ)の魔手は神楽清(新井浩文)の妻・留美(稲森いずみ)を、彼女が不倫の果てに産んだ安堂完治(葉山奨之)と結びつけた。  真海の次なる一手は入間公平(高橋克典)。真海は、外務省勤務でマレーシアに駐在していた出口文矢(尾上寛之)を日本に呼び戻して自身の別荘に招待。出口は、入間が自ら選んだ娘・未蘭(岸井ゆきの)の婚約者だ。日本に帰れたことを喜ぶ出口に、真海は頼みがあると持ちかける。それは入間の父・貞吉(伊武雅刀)を殺して欲しいというものだった。驚く出口に真海は冗談だと告げるが、入間家は貞吉の莫大な遺産相続で揉めていることを吹き込む。  一方、未蘭は『富永水産』に頼んでおいたダボハゼを取りに行く。守尾信一朗(高杉真宙)に会った途端、未蘭の顔が輝いた。ランチに出ると未蘭は、貞吉の反対で結婚がなくなったことを信一朗に話す。信一朗と未蘭の未来に明るい陽が差し込んだかに思えたが…。  未蘭が帰宅すると出口が来ていた。入間は出口に、貞吉の遺言書の件を話す。未蘭が出口と結婚したら遺産は全て寄付するというものだ。入間は、それでも未蘭と結婚して欲しいと出口に頼む。出口は真海に成り行きを報告。すると真海は「未蘭との結婚前に貞吉を殺して遺産を相続してしまえば良い」と出口にささやく。逡巡する出口に真海は「貞吉はかつて人を殺している」と話し出した。ディーン・フジオカ\u3000    大倉忠義\u3000    山本美月\u3000    高杉真宙\u3000  葉山奨之\u3000  岸井ゆきの\u3000  桜井ユキ\u3000  三浦誠己\u3000  渋川清彦  \u3000・\u3000  新井浩文  \u3000/\u3000  田中泯  \u3000・\u3000  風吹ジュン  \u3000・\u3000  木下ほうか  \u3000/\u3000  山口紗弥加\u3000  伊武雅刀\u3000  稲森いずみ\u3000    高橋克典【原作】  アレクサンドル・デュマ(仏)「モンテ・クリスト伯」(1841年)\u3000    【脚本】  黒岩勉(『僕のヤバイ妻』、『ようこそ、わが家へ』、『ストロベリーナイト』)\u3000    【プロデュース】  太田大(『名前をなくした女神』、『息もできない夏』、『

In [0]:
str(df["sharp_detail_epg_tknz"].iloc[190])

In [0]:
vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")

In [0]:

X = vectorizer.fit_transform(
    [str(i) for i in df["sharp_epg_tknz"].values]
)

In [77]:
X

<6154x27928 sparse matrix of type '<class 'numpy.float64'>'
	with 495290 stored elements in Compressed Sparse Row format>

In [78]:
pd.DataFrame(X.toarray(), columns=[ x[0] for x in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]) ])

Unnamed: 0,0,00,000,02,03,0456666,1,10,100,1000,1014,1018,102,1024,1028,103,1030,1031,104,105,11,110,1110,1111,1113,1114,1117,1118,112,1121,1125,1128,113,114,116,117,118,12,120,1200,...,黙々と,黙っ,黙ら,黙り込む,黙り込ん,黙る,黙秘,黙認,黛,鼓,鼓動,鼓舞,鼠,鼻,鼻息,鼻歌,鼻血,齋,齢,龍,龍崎,龍門,０,１,１つ,２,３,４,５,６,７,８,９,ａｖ,ｃｍ,ｄｎａ,ｍｄ,ｎｈｋ,ｏｌ,ｓｎｓ
0,0.0,0.0,0.0,0.0,0.0,0.0,0.047506,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.037161,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.042783,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.110092,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.09312,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6149,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6150,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6151,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6152,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.00000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [79]:
vectorizer.idf_

array([6.83464828, 8.62640775, 8.62640775, ..., 9.03187286, 9.03187286,
       9.03187286])

In [0]:
# idf値の確認

idf = pd.Series(vectorizer.idf_, index=[ x[0] for x in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]) ]) \
    .to_frame("idf") \
    .sort_values("idf", ascending=False)

In [81]:


pd.set_option('display.max_rows', 1000)
idf.iloc[1:100]

Unnamed: 0,idf
創始,9.031873
割り込ま,9.031873
割り込み,9.031873
割り込ん,9.031873
割引,9.031873
割烹,9.031873
創っ,9.031873
創ら,9.031873
創る,9.031873
創れ,9.031873


## 形態素解析
---

MeCab を利用した形態素解析と、簡単な処理の例を示します。

### 簡単な例

In [0]:
# ※これは、演習用に単語文書行列を DataFrame に変換して見やすくしてみるためのコードで、覚える必要はありません
pd.DataFrame(X.toarray(), columns=[ x[0] for x in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]) ])

CountVectorizer が生成する単語文書行列では、行 (横) 方向が1つの文書を表し、それぞれの列に単語の出現回数が格納されます (※単語文書行列と言った場合、列方向が文書を表す場合もあるので、他の文献を読む際は注意が必要です)。

どの列がどの単語と対応しているかを知るためには、vectorizer の vocabulary_ を参照します。

In [0]:
vectorizer.vocabulary_

新しい文書に対しては、transform で変換します。

なお、fit\_transform の際に出てこなかった単語 (= vocabulary_ にない単語) については、行列内に登場しません。

In [0]:
# transform も、リスト形式で与える必要があるので注意
X_new = vectorizer.transform([
    "Hello I like this pen"
])
X_new.toarray()

### 日本語データに適用
前処理済みの日本語データに適用してみます。DataFrame にスペース区切りのテキストを準備している場合、DataFrame の列を参照させたものをそのまま与えることができます。

In [0]:
vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
X = vectorizer.fit_transform(df["text_tokenized"])

In [0]:
X

In [0]:
X[0].toarray()

↑この規模の単語文書行列になると、行列のサイズ (文書数x単語数) と比較し、値が入っている場所をほとんど無いことが分かります。

In [0]:
vectorizer.vocabulary_

## TF-IDF変換
---

TF-IDF 変換を行うと、単語の出現回数に、他の文書全体と比較した際の希少性 (レア度) で重みを付けることができます。

TF-IDF 変換済みの単語文書行列は、`TfidfVectorizer` で作成できます。

### 簡単な例

In [0]:
# token_pattern は、デフォルトだと1文字の単語を除外するので、除外しないように設定する
vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")

In [0]:
df[""]

In [0]:
# スペース区切りの文書をリストで与える
X = vectorizer.fit_transform([
    "This is a pen",
    "I am not a pen",
    "Hello"
])
X.toarray()

In [0]:
# ※これは、演習用に単語文書行列を DataFrame に変換して見やすくしてみるためのコードで、覚える必要はありません
pd.DataFrame(X.toarray(), columns=[ x[0] for x in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]) ])

他の文書でも出現している単語については、同じ出現回数でも重みが低くなっていることが分かります。各単語の重みは、`idf_` で見ることができます。確認してみましょう。

In [0]:
vectorizer.idf_

In [0]:
# ※これは、演習用に IDF 値を DataFrame に変換して見やすくしてみるためのコードで、覚える必要はありません
pd.DataFrame(np.atleast_2d(vectorizer.idf_), columns=[ x[0] for x in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]) ])

デフォルトでは、各文書のノルムが1になるように正規化されています (norm="l2")。確認してみましょう。

In [0]:
np.linalg.norm(X.toarray(), axis=1)

### 日本語データに適用
CountVectorizer の場合と同様に、前処理済みの日本語データに適用してみます。

In [0]:
vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")
X = vectorizer.fit_transform(df["text_tokenized"])

In [0]:
X[0].toarray()

IDF 値の大きい単語・小さい単語を確認してみましょう。

In [0]:
# ※これは、演習用に IDF 値を DataFrame に変換して見やすくしてみるためのコードで、覚える必要はありません
pd.Series(vectorizer.idf_, index=[ x[0] for x in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]) ]) \
    .to_frame("idf") \
    .sort_values("idf", ascending=False)

## テキスト分類
---
定量化したテキスト情報を使うことで、テキストをカテゴリー別に分類してみましょう

In [0]:
# テストデータ
df_test = pd.read_csv("dataset/wikipedia-test.txt", sep="\t")
df_test.head()

### 単語文書行列 (TF-IDF 適用済み) の作成
単語文章行列を作成する前に、まずは文章を形態素解析します。

In [0]:
mecab = MeCab.Tagger("-O wakati")

text_tokenized = []
for text in df_test['text']:
    text_tokenized.append(mecab.parse(text))
    
df_test['text_tokenized'] = text_tokenized

続いて、単語文章行列を学習データテストデータでそれぞれ生成します。<br>
この時、2つの単語文章行列は同じ次元数でなければならないので、`transform`を使用します

In [0]:
vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")
X = vectorizer.fit_transform(df["text_tokenized"])

In [0]:
X_test = vectorizer.transform(df_test["text_tokenized"])

### 分類器の適用・比較
単語文書行列ができたら、後はこれまでやってきた機械学習と同様です。

いくつかの分類器で、性能を比較してみましょう。

In [0]:
# Accuracy, Precision/Recall/F-score/Support, Confusion Matrix を表示
def show_evaluation_metrics(y_true, y_pred):
    print("Accuracy:")
    print(accuracy_score(y_true, y_pred))
    print()
    
    print("Report:")
    print(classification_report(y_true, y_pred))
    
    print("Confusion matrix:")
    print(confusion_matrix(y_true, y_pred))

### ロジスティック回帰

In [0]:
clf_lr = LogisticRegression(n_jobs=-1)
clf_lr.fit(X, df["category"])

推定結果は predict で得ることができます (他の分類器でも同様)。

In [0]:
clf_lr.predict(X_test)

In [0]:
y_test_pred = clf_lr.predict(X_test)
show_evaluation_metrics(df_test["category"], y_test_pred)

### SVM (線形カーネル)

In [0]:
clf_svc = LinearSVC()
clf_svc.fit(X, df["category"])

In [0]:
y_test_pred = clf_svc.predict(X_test)
show_evaluation_metrics(df_test["category"], y_test_pred)

### Random Forest

In [0]:
clf_rf = RandomForestClassifier(n_estimators=50, n_jobs=-1)
clf_rf.fit(X, df["category"])

In [0]:
y_test_pred = clf_rf.predict(X_test)
show_evaluation_metrics(df_test["category"], y_test_pred)

## 【演習1】
上記いずれかのアルゴリズムのパラメータを変更して、分類精度を高めてみましょう。

In [0]:
# Enter your code here

### モデルの保存
`pickle` でも保存できますが、`joblib` を使い、ファイル名末尾に `.gz`, `.bz2` 等を指定すると、自動的に圧縮してくれます。読み込み時も同様、自動的に解凍して読み込んでくれます。

言語処理ではモデルが大容量になることが多いため、モデルの圧縮は重要です。

In [0]:
joblib.dump(clf_rf, "wikipedia_category_classifier.pkl.gz")

### モデル読み込みテスト

In [0]:
clf_rf_restored = joblib.load("wikipedia_category_classifier.pkl.gz")
clf_rf_restored

読み込んだモデルは、上記ですでに学習済みなので、`predict`で予測が行えます

In [0]:
clf_rf_restored.predict(X_test)

## 【演習2】
演習1で構築したモデルをjoblibを使って保存してみましょう。

In [0]:
# Enter your code here