<a href="https://colab.research.google.com/github/c-c-c-c/dm_integration/blob/master/myMecab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# テキストの分類
テキスト分類は、以下の順に行います。

1. 前処理
2. 数値表現化 (CountVectorizer, TfIdfVectorizer)
3. 分類器を学習

本ノートブックでは、Wikipedia のエントリ分類を通じて、テキスト分類器を作成します。

In [205]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [206]:
!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7

Reading package lists... Done
Building dependency tree       
Reading state information... Done
aptitude is already the newest version (0.8.10-6ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 25 not upgraded.
mecab is already installed at the requested version (0.996-5)
libmecab-dev is already installed at the requested version (0.996-5)
mecab-ipadic-utf8 is already installed at the requested version (2.7.0-20070801+main-1)
git is already installed at the requested version (1:2.17.1-1ubuntu0.5)
make is already installed at the requested version (4.1-9.1ubuntu1)
curl is already installed at the requested version (7.58.0-2ubuntu3.8)
xz-utils is already installed at the requested version (5.2.2-1.3)
file is already installed at the requested version (1:5.32-2ubuntu0.3)
mecab is already installed at the requested version (0.996-5)
libmecab-dev is already installed at the requested version (0.996-5)
mecab-ipadic-utf8 is already installed at the requested version (2.7.0-20070801+mai

In [207]:
!git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
!echo yes | mecab-ipadic-neologd/bin/install-mecab-ipadic-neologd -n -a

fatal: destination path 'mecab-ipadic-neologd' already exists and is not an empty directory.
[install-mecab-ipadic-NEologd] : Start..
[install-mecab-ipadic-NEologd] : Check the existance of libraries
[install-mecab-ipadic-NEologd] :     find => ok
[install-mecab-ipadic-NEologd] :     sort => ok
[install-mecab-ipadic-NEologd] :     head => ok
[install-mecab-ipadic-NEologd] :     cut => ok
[install-mecab-ipadic-NEologd] :     egrep => ok
[install-mecab-ipadic-NEologd] :     mecab => ok
[install-mecab-ipadic-NEologd] :     mecab-config => ok
[install-mecab-ipadic-NEologd] :     make => ok
[install-mecab-ipadic-NEologd] :     curl => ok
[install-mecab-ipadic-NEologd] :     sed => ok
[install-mecab-ipadic-NEologd] :     cat => ok
[install-mecab-ipadic-NEologd] :     diff => ok
[install-mecab-ipadic-NEologd] :     tar => ok
[install-mecab-ipadic-NEologd] :     unxz => ok
[install-mecab-ipadic-NEologd] :     xargs => ok
[install-mecab-ipadic-NEologd] :     grep => ok
[install-mecab-ipadic-NEo

In [0]:
import joblib
import MeCab
import numpy as np
import pandas as pd
import re

from collections import Counter
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.svm import LinearSVC

## データの読み込み

In [0]:
# タブ (\t) 区切りファイルを読み込む
df = pd.read_csv("./drive/My Drive/0_インテグ作業/data/EPG_checking (1).csv")

In [0]:
df["sharp_detail_epg_tknz"] = np.nan

In [211]:
df

Unnamed: 0.1,Unnamed: 0,drama_key,drama_title,sharp_title,sharp_num_cnt,sharp_num_epg,start_time,sharp_detail_epg,sharp_detail_epg_tknz
0,0,1910_CX_月21,シャーロック,番組名シャーロック【危険な天才探偵×スマートな医師ー今夜、運命の出逢い】　#01,1,['#01'],"['21', '22']",番組詳細【公式HP】 https://www.fujitv.co.jp/sherlock/...,
1,1,1910_CX_月21,シャーロック,番組名シャーロック【探偵×医師最強バディ始動!天才VS聖女の謎解き心理戦】　#02,2,['#02'],"['21', '22']",番組詳細「FIVBワールドカップバレーボール2019　男子　日本×ブラジル」延長の際、放送時...,
2,2,1910_CX_月21,シャーロック,番組名シャーロック【天才VS詐欺師の騙し合い!ノンストップのどんでん返し】　#03,3,['#03'],"['21', '21']",番組詳細【公式HP】 https://www.fujitv.co.jp/sherlock/...,
3,3,1910_CX_月21,シャーロック,番組名シャーロック【試合直前に消えたボクシング世界王者!オレンジの傘の謎】　#04,4,['#04'],"['21', '21']",番組詳細【公式HP】 https://www.fujitv.co.jp/sherlock/...,
4,4,1910_CX_月21,シャーロック,番組名シャーロック【歩く死体の謎!熱帯魚と母の愛…その真相、正義か狂気か】　#05,5,['#05'],"['21', '21']",番組詳細【公式HP】 https://www.fujitv.co.jp/sherlock/...,
...,...,...,...,...,...,...,...,...,...
7662,7662,0710_TBS_日21,ハタチの恋人,番組名ハタチの恋人「運命の再会」,14,,"['21', '21']",番組詳細レストランに向かった圭祐(明石家さんま)は、かつて交際していた絵里(小泉今日子)と再...,
7663,7663,0710_TBS_日21,ハタチの恋人,番組名ハタチの恋人「バージンロード」,15,,"['21', '21']",番組詳細ユリ(長沢まさみ)は、由紀夫(塚本高史)の父親に会うと約束した日に、圭祐(明石家さん...,
7664,7664,0707_CX_月21,ファースト・キス,番組名ファースト・キス,1,,"['21', '21']",番組詳細ファースト・キス◇秋生（平岡祐太）への思いを捨て、ロサンゼルスに帰ることを決意した美...,
7665,7665,0707_CX_月21,ファースト・キス,番組名ファースト・キス「兄と妹の最終章！」,2,,"['21', '21']",番組詳細成田空港で飛行機に乗ろうとしていた美緒(井上真央)の前に、和樹(伊藤英明)が現れる。...,


In [0]:
def removeTrash (text):
    import re

    result_text = text
    result_text = re.sub(r"https?://[\w/:%#\$&\?\(\)~\.=\+\-]+", "", result_text)
    result_text = re.sub(r"番組詳細|制作・著作|制作著作", "", result_text)
    result_text = re.sub(r"[!\(\)=『』～/]", "", result_text)
    result_text = re.sub(r"フジテレビ|日本テレビ|TBS|テレビ朝日|TBS|関西テレビ", "", result_text)
    
    result_text = re.sub(r"【公式.*?】", "", result_text)
    result_text = re.sub(r"\u3000", "", result_text)

    return result_text


In [213]:
removeTrash(df['sharp_detail_epg'][20])

'桑野阿部寛が仕事中に倒れ、中川尾美としのりの病院に運び込まれる。見舞いに訪れたまどか吉田羊、有希江稲森いずみ、早紀深川麻衣は、皮肉も全く言わず、いつになく素直で別人のような態度の桑野に大いに驚く。病気をきっかけに桑野が“いい人"になったと喜ぶ有希江と早紀に対して、普段から桑野とケンカばかりのまどかだけは、素直なのはあくまで一時的なものに違いないと疑うが、有希江から「桑野さんに厳しすぎ」と指摘されてしまう。    案の定、まどかは回復した桑野とまたもやささいなことで言い争いに。有希江や早紀には好意的なのに、なぜ自分には皮肉ばかり言うのかー。納得がいかないまどかに対し、早紀は、男と女の間には言葉と感情が裏腹になることがあると力説。それを体現した自分の舞台を見にきてほしいと、まどかたちを誘う。数日後、都合が悪くなり、舞台に行けなくなったまどかが困っていると、偶然桑野がやって来て、自分が代わりに行くと言い出す。早紀が、桑野には来てほしくないと言っていたことを思い出したまどかは慌てるが、一度行く気になった桑野を止められず、結局、桑野と有希江が舞台を見に行くことに。まどかは2人のデート?が気になって…。阿部寛    吉田羊  深川麻衣  塚本高史  咲妃みゆ  平祐奈    阿南敦子  奈緒  荒井敦史  小野寺ずる  美音    REDRICE湘南乃風  デビット伊東  不破万作    三浦理恵子  尾美としのり  稲森いずみ  草笛光子【脚本】  尾崎将也結婚できない男アットホーム・ダッドシグナル長期未解決事件捜査班他  【演出】  三宅喜重結婚できない男パーフェクトワールド僕のヤバイ妻サイレーン刑事×彼女×完全悪女  小松隆志MMJ結婚できない男美しい隣人白い春植田尚MMJ結婚できない男サキまっすぐな男鬼嫁日記  【チーフプロデューサー】  安藤和久  東城祐司MMJ  【プロデューサー】  米田孝  伊藤達哉MMJ  木曽貴美子MMJ  【制作】    MMJメディアミックス・ジャパン詳しい情報は番組ホームページへ  「まだ結婚できない男」で検索'

In [214]:
mecab = MeCab.Tagger()
mecab.parse("")
for i, text in enumerate( df['sharp_detail_epg']):
    text_tokenized = []
    # print(i)

    # URL、記号などのゴミを取り除く
    if type(text) is not str: 
        if np.isnan(text) :
            continue 
    text = removeTrash(text)
    node = mecab.parseToNode(text)
    while node:
        node = node.next
        if node is None:
            continue

        if not node.feature.startswith("BOS/EOS") and not node.feature.startswith("助詞") and\
            not node.feature.startswith("記号") and\
            node.feature.find("人名") == -1 and\
            not node.feature.startswith("助動詞"):
            text_tokenized.append(node.surface)

    df["sharp_detail_epg_tknz"].iloc[i] = text_tokenized

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [215]:
df['sharp_detail_epg_tknz'][190:220]
#text = df['sharp_detail_epg_tknz'][190:220]
# if type(text) is not str: 
#     if np.isnan(text) :
#         print("hoge") 
# removeTrash (df['sharp_detail_epg'][881])

190    [菜, 津, 美, 小市, 慢太, 郎, YOU, ほか, 穂香, 拉致, さ, れ, ４,...
191    [菜, 津, 美, 小市, 慢太, 郎, YOU, ほか, 橘, ひかり, 3, 年, 前,...
192    [菜, 津, 美, 小市, 慢太, 郎, YOU, ほか, 襲撃, さ, れ, 消息, 絶つ...
193    [菜, 津, 美, 小市, 慢太, 郎, YOU, 介, ほか, 病院, 息子, 真, 襲っ...
194    [菜, 津, 美, 小市, 慢太, 郎, YOU, 介, ほか, 殺さ, れ, 妻, 3, ...
195    [菜, 津, 美, 小市, 慢太, 郎, YOU, 介, ほか, たち, 緊急, 指令, 室...
196    [菜, 津, 美, 小市, 慢太, 郎, YOU, 介, ほか, 出演, タイムリミット, ...
197    [笑い, ため, 何でも, やる, 学園, 爆笑, 王, ", こと, 主人公, 右, 前,...
198    [文化, 祭, アドリブ, 漫才, 大, 成功, さ, せ, 右, 大知, コンビ, き, ...
199    [本気, 芸人, 目指す, 決意, 固め, 右, 大知, 漫才, 日本一, 決める, コンテ...
200    [右, 解散, し, 漫才, コンビ, ねずみ, 花火, 姉, 交際, 知る, 十, 年, ...
201    [右, 笑い, まっすぐ, 姿勢, 胸, 打た, れ, 自分, たち, 原点, 取り戻し, ...
202    [漫才, 日本一, 決める, NMC, 決勝, 進出, 決め, コンビ, デジタル, きん,...
203    [ついに, 最終, 高校, 卒業, し, 右, 一人暮らし, 始め, 大知, お笑い, 養成...
204    [再び, 右, 大知, コンビ, 戻っ, ゃり, 暮らし, 芸人, 選抜, クラス, 芸, ...
205    [トキワ, 自動車, 経営, 戦略, 室, 次長, 天敵, 常務, 対立, し, 府中, 工...
206    [府中, 工場, 戦う, こと, 決意, し, 新, 監督, 候補, 探す, 中, 因縁, ...
207    [初, 公式, 戦, 控え, 新生, アストロズ

In [216]:
df["sharp_detail_epg_tknz"][190]

['菜',
 '津',
 '美',
 '小市',
 '慢太',
 '郎',
 'YOU',
 'ほか',
 '穂香',
 '拉致',
 'さ',
 'れ',
 '４',
 '時間',
 '経過',
 '希',
 '葵',
 '暴行',
 '動画',
 'ライブ',
 '配信',
 'する',
 '時刻',
 '迫る',
 '沖',
 'たち',
 '強行',
 '犯',
 '係',
 '逃亡',
 'ルート',
 'なる',
 '港',
 '急ぐ',
 'ら',
 '葵',
 '捜索',
 '全力',
 '注ぐ',
 '一方',
 '過去',
 '洗っ',
 'い',
 'その',
 '意外',
 '正体',
 '突き止める',
 'それ',
 '聞い',
 '栞',
 '菜',
 '津',
 '美',
 '妹',
 '葵',
 'ある',
 '廃校',
 '監禁',
 'さ',
 'れ',
 'いる',
 'はず',
 '言い',
 '出し',
 '廃校',
 '緊迫',
 '救出',
 '劇',
 '始まる',
 '演出',
 '原作',
 '"',
 'Based',
 'on',
 'the',
 'series',
 '"',
 'Voice',
 '",',
 'produced',
 'and',
 'distributed',
 'byStudio',
 'Dragon',
 'Corporation',
 'and',
 'CJ',
 'ENM',
 'Co',
 '.,',
 'Ltd',
 '"',
 '脚本',
 '主題歌',
 'BLUE',
 'ENCOUNT',
 'バッドパラドックス',
 '音楽',
 '芦屋',
 'チーフ',
 'プロデューサー',
 'プロデューサー',
 '制作',
 '協力',
 'AXON',
 '製作',
 '著作']

In [217]:
df["sharp_detail_epg"][190]

'番組詳細唐沢寿明、真木よう子、増田貴久、木村祐一、石橋菜津美、田村健太郎、安井順平/小市慢太郎、YOU/菊池桃子 ほか森下葵（矢作穂香）が拉致されて４時間が経過。新田（森永悠希）が葵の暴行動画のライブ配信する時刻が迫る。沖原（木村祐一）たち強行犯係は逃亡ルートとなる港に急ぐが、樋口（唐沢寿明）らは葵の捜索に全力を注ぐ。一方、新田の過去を洗っていた緒方（田村健太郎）はその意外な正体を突き止める。それを聞いた栞（石橋菜津美）は、妹の葵がある廃校に監禁されているはずだと言い出し…。廃校で、緊迫の救出劇が始まる!!【演出】久保田充【原作】"Based on the series "Voice", produced and distributed byStudio Dragon Corporation and CJ ENM Co.,Ltd"  【脚本】浜田秀哉【主題歌】BLUE ENCOUNT「バッドパラドックス」  【音楽】ゲイリー芦屋【チーフプロデューサー】池田健司  【プロデューサー】尾上貴洋、後藤庸介  【制作協力】AXON  【製作著作】日本テレビ【公式HP】https://www.ntv.co.jp/voice/  【公式Twitter】https://twitter.com/voice_ntv  【公式Instagram】https://www.instagram.com/voice.ntv/'

In [218]:
df["sharp_detail_epg"].iloc[190]

'番組詳細唐沢寿明、真木よう子、増田貴久、木村祐一、石橋菜津美、田村健太郎、安井順平/小市慢太郎、YOU/菊池桃子 ほか森下葵（矢作穂香）が拉致されて４時間が経過。新田（森永悠希）が葵の暴行動画のライブ配信する時刻が迫る。沖原（木村祐一）たち強行犯係は逃亡ルートとなる港に急ぐが、樋口（唐沢寿明）らは葵の捜索に全力を注ぐ。一方、新田の過去を洗っていた緒方（田村健太郎）はその意外な正体を突き止める。それを聞いた栞（石橋菜津美）は、妹の葵がある廃校に監禁されているはずだと言い出し…。廃校で、緊迫の救出劇が始まる!!【演出】久保田充【原作】"Based on the series "Voice", produced and distributed byStudio Dragon Corporation and CJ ENM Co.,Ltd"  【脚本】浜田秀哉【主題歌】BLUE ENCOUNT「バッドパラドックス」  【音楽】ゲイリー芦屋【チーフプロデューサー】池田健司  【プロデューサー】尾上貴洋、後藤庸介  【制作協力】AXON  【製作著作】日本テレビ【公式HP】https://www.ntv.co.jp/voice/  【公式Twitter】https://twitter.com/voice_ntv  【公式Instagram】https://www.instagram.com/voice.ntv/'

In [219]:
str(df["sharp_detail_epg_tknz"].iloc[190])

'[\'菜\', \'津\', \'美\', \'小市\', \'慢太\', \'郎\', \'YOU\', \'ほか\', \'穂香\', \'拉致\', \'さ\', \'れ\', \'４\', \'時間\', \'経過\', \'希\', \'葵\', \'暴行\', \'動画\', \'ライブ\', \'配信\', \'する\', \'時刻\', \'迫る\', \'沖\', \'たち\', \'強行\', \'犯\', \'係\', \'逃亡\', \'ルート\', \'なる\', \'港\', \'急ぐ\', \'ら\', \'葵\', \'捜索\', \'全力\', \'注ぐ\', \'一方\', \'過去\', \'洗っ\', \'い\', \'その\', \'意外\', \'正体\', \'突き止める\', \'それ\', \'聞い\', \'栞\', \'菜\', \'津\', \'美\', \'妹\', \'葵\', \'ある\', \'廃校\', \'監禁\', \'さ\', \'れ\', \'いる\', \'はず\', \'言い\', \'出し\', \'廃校\', \'緊迫\', \'救出\', \'劇\', \'始まる\', \'演出\', \'原作\', \'"\', \'Based\', \'on\', \'the\', \'series\', \'"\', \'Voice\', \'",\', \'produced\', \'and\', \'distributed\', \'byStudio\', \'Dragon\', \'Corporation\', \'and\', \'CJ\', \'ENM\', \'Co\', \'.,\', \'Ltd\', \'"\', \'脚本\', \'主題歌\', \'BLUE\', \'ENCOUNT\', \'バッドパラドックス\', \'音楽\', \'芦屋\', \'チーフ\', \'プロデューサー\', \'プロデューサー\', \'制作\', \'協力\', \'AXON\', \'製作\', \'著作\']'

In [0]:
vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")

In [0]:


X = vectorizer.fit_transform(
    [str(i) for i in df["sharp_detail_epg_tknz"].values]
)

In [222]:
X

<7667x30691 sparse matrix of type '<class 'numpy.float64'>'
	with 734548 stored elements in Compressed Sparse Row format>

In [223]:
pd.DataFrame(X.toarray(), columns=[ x[0] for x in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]) ])

Unnamed: 0,0,00,000,01,02,03,04,0406,0456666,050,07,0710,08,1,10,100,1000,1014,1018,102,1024,1028,103,1030,1031,104,105,109,11,110,1100,111,1110,1111,1113,1114,1117,1118,112,1121,...,黙り込ん,黙る,黙秘,黙認,黛,鼓,鼓動,鼓笛隊,鼓舞,鼠,鼻,鼻息,鼻歌,鼻血,齋,齋籐,齢,龍,龍崎,龍門,０,１,１つ,２,３,４,５,６,７,８,９,ａｖ,ｃｍ,ｄｎａ,ｈｐ,ｍｄ,ｎｈｋ,ｏｌ,ｓｉｔ,ｓｎｓ
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.040330,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.033866,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.037010,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.093859,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.083199,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7662,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7663,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7664,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7665,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000000,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [224]:
vectorizer.idf_

array([6.57751527, 8.33537319, 8.55851674, ..., 9.25166392, 9.25166392,
       8.84619882])

In [0]:
# idf値の確認

idf = pd.Series(vectorizer.idf_, index=[ x[0] for x in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]) ]) \
    .to_frame("idf") \
    .sort_values("idf", ascending=False)

In [230]:



pd.set_option('display.max_rows', 1000)
idf.iloc[1:100]

Unnamed: 0,idf
山場,9.251664
サンダー,9.251664
山地,9.251664
迎い,9.251664
サントリー,9.251664
山中湖,9.251664
サンド,9.251664
山の神,9.251664
サンバ,9.251664
山々,9.251664


## 形態素解析
---

MeCab を利用した形態素解析と、簡単な処理の例を示します。

### 簡単な例

In [0]:
# ※これは、演習用に単語文書行列を DataFrame に変換して見やすくしてみるためのコードで、覚える必要はありません
pd.DataFrame(X.toarray(), columns=[ x[0] for x in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]) ])

CountVectorizer が生成する単語文書行列では、行 (横) 方向が1つの文書を表し、それぞれの列に単語の出現回数が格納されます (※単語文書行列と言った場合、列方向が文書を表す場合もあるので、他の文献を読む際は注意が必要です)。

どの列がどの単語と対応しているかを知るためには、vectorizer の vocabulary_ を参照します。

In [0]:
vectorizer.vocabulary_

新しい文書に対しては、transform で変換します。

なお、fit\_transform の際に出てこなかった単語 (= vocabulary_ にない単語) については、行列内に登場しません。

In [0]:
# transform も、リスト形式で与える必要があるので注意
X_new = vectorizer.transform([
    "Hello I like this pen"
])
X_new.toarray()

### 日本語データに適用
前処理済みの日本語データに適用してみます。DataFrame にスペース区切りのテキストを準備している場合、DataFrame の列を参照させたものをそのまま与えることができます。

In [0]:
vectorizer = CountVectorizer(token_pattern=r"(?u)\b\w+\b")
X = vectorizer.fit_transform(df["text_tokenized"])

In [0]:
X

In [0]:
X[0].toarray()

↑この規模の単語文書行列になると、行列のサイズ (文書数x単語数) と比較し、値が入っている場所をほとんど無いことが分かります。

In [0]:
vectorizer.vocabulary_

## TF-IDF変換
---

TF-IDF 変換を行うと、単語の出現回数に、他の文書全体と比較した際の希少性 (レア度) で重みを付けることができます。

TF-IDF 変換済みの単語文書行列は、`TfidfVectorizer` で作成できます。

### 簡単な例

In [0]:
# token_pattern は、デフォルトだと1文字の単語を除外するので、除外しないように設定する
vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")

In [0]:
df[""]

In [0]:
# スペース区切りの文書をリストで与える
X = vectorizer.fit_transform([
    "This is a pen",
    "I am not a pen",
    "Hello"
])
X.toarray()

In [0]:
# ※これは、演習用に単語文書行列を DataFrame に変換して見やすくしてみるためのコードで、覚える必要はありません
pd.DataFrame(X.toarray(), columns=[ x[0] for x in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]) ])

他の文書でも出現している単語については、同じ出現回数でも重みが低くなっていることが分かります。各単語の重みは、`idf_` で見ることができます。確認してみましょう。

In [0]:
vectorizer.idf_

In [0]:
# ※これは、演習用に IDF 値を DataFrame に変換して見やすくしてみるためのコードで、覚える必要はありません
pd.DataFrame(np.atleast_2d(vectorizer.idf_), columns=[ x[0] for x in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]) ])

デフォルトでは、各文書のノルムが1になるように正規化されています (norm="l2")。確認してみましょう。

In [0]:
np.linalg.norm(X.toarray(), axis=1)

### 日本語データに適用
CountVectorizer の場合と同様に、前処理済みの日本語データに適用してみます。

In [0]:
vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")
X = vectorizer.fit_transform(df["text_tokenized"])

In [0]:
X[0].toarray()

IDF 値の大きい単語・小さい単語を確認してみましょう。

In [0]:
# ※これは、演習用に IDF 値を DataFrame に変換して見やすくしてみるためのコードで、覚える必要はありません
pd.Series(vectorizer.idf_, index=[ x[0] for x in sorted(vectorizer.vocabulary_.items(), key=lambda x: x[1]) ]) \
    .to_frame("idf") \
    .sort_values("idf", ascending=False)

## テキスト分類
---
定量化したテキスト情報を使うことで、テキストをカテゴリー別に分類してみましょう

In [0]:
# テストデータ
df_test = pd.read_csv("dataset/wikipedia-test.txt", sep="\t")
df_test.head()

### 単語文書行列 (TF-IDF 適用済み) の作成
単語文章行列を作成する前に、まずは文章を形態素解析します。

In [0]:
mecab = MeCab.Tagger("-O wakati")

text_tokenized = []
for text in df_test['text']:
    text_tokenized.append(mecab.parse(text))
    
df_test['text_tokenized'] = text_tokenized

続いて、単語文章行列を学習データテストデータでそれぞれ生成します。<br>
この時、2つの単語文章行列は同じ次元数でなければならないので、`transform`を使用します

In [0]:
vectorizer = TfidfVectorizer(token_pattern=r"(?u)\b\w+\b")
X = vectorizer.fit_transform(df["text_tokenized"])

In [0]:
X_test = vectorizer.transform(df_test["text_tokenized"])

### 分類器の適用・比較
単語文書行列ができたら、後はこれまでやってきた機械学習と同様です。

いくつかの分類器で、性能を比較してみましょう。

In [0]:
# Accuracy, Precision/Recall/F-score/Support, Confusion Matrix を表示
def show_evaluation_metrics(y_true, y_pred):
    print("Accuracy:")
    print(accuracy_score(y_true, y_pred))
    print()
    
    print("Report:")
    print(classification_report(y_true, y_pred))
    
    print("Confusion matrix:")
    print(confusion_matrix(y_true, y_pred))

### ロジスティック回帰

In [0]:
clf_lr = LogisticRegression(n_jobs=-1)
clf_lr.fit(X, df["category"])

推定結果は predict で得ることができます (他の分類器でも同様)。

In [0]:
clf_lr.predict(X_test)

In [0]:
y_test_pred = clf_lr.predict(X_test)
show_evaluation_metrics(df_test["category"], y_test_pred)

### SVM (線形カーネル)

In [0]:
clf_svc = LinearSVC()
clf_svc.fit(X, df["category"])

In [0]:
y_test_pred = clf_svc.predict(X_test)
show_evaluation_metrics(df_test["category"], y_test_pred)

### Random Forest

In [0]:
clf_rf = RandomForestClassifier(n_estimators=50, n_jobs=-1)
clf_rf.fit(X, df["category"])

In [0]:
y_test_pred = clf_rf.predict(X_test)
show_evaluation_metrics(df_test["category"], y_test_pred)

## 【演習1】
上記いずれかのアルゴリズムのパラメータを変更して、分類精度を高めてみましょう。

In [0]:
# Enter your code here

### モデルの保存
`pickle` でも保存できますが、`joblib` を使い、ファイル名末尾に `.gz`, `.bz2` 等を指定すると、自動的に圧縮してくれます。読み込み時も同様、自動的に解凍して読み込んでくれます。

言語処理ではモデルが大容量になることが多いため、モデルの圧縮は重要です。

In [0]:
joblib.dump(clf_rf, "wikipedia_category_classifier.pkl.gz")

### モデル読み込みテスト

In [0]:
clf_rf_restored = joblib.load("wikipedia_category_classifier.pkl.gz")
clf_rf_restored

読み込んだモデルは、上記ですでに学習済みなので、`predict`で予測が行えます

In [0]:
clf_rf_restored.predict(X_test)

## 【演習2】
演習1で構築したモデルをjoblibを使って保存してみましょう。

In [0]:
# Enter your code here