# データマイニング Report3

+ 全体の流れ
    + NLTKの解説本の0章〜12章まで、計13個のHTMLファイルをダウンロードせよ。
    + BoWベースの特徴ベクトル（Level 1 もしくは Level 2）を生成せよ。
    + 共起行列ベースの特徴ベクトル（Level3）を生成せよ。
    + ラベル付き文書に対して分類タスク（Level4）を実行せよ。
+ Level 1: 文書ファイル毎に、``Bag-of-Words``で特徴ベクトルを生成せよ。
+ Level 2: ``BoW``に``TF-IDF``で重み調整した特徴ベクトルを生成せよ。
+ Level 3: 単語の``共起行列``から特徴ベクトルを生成せよ。
+ Level 4: ``文書分類``せよ。
+ オプション例
    + 相互情報量から``特徴ベクトル``を生成してみよう。
    + 共起行列に基づいた特徴ベクトル、もしくは相互特徴量に基づいた特徴ベクトルを``SVD``により``次元削減``してみよう。
    + SVDによる次元削減時に``2次元``とせよ。気になる単語1つを選び、上位10件と下位10件を2次元空間にマッピングせよ。マッピング結果、どのように散らばっているか観察し、想定とどのぐらい似通っているか考察してみよう。
    + ``日本語文書``について自然言語処理してみよう。

In [2]:
import os
import nltk
from nltk.tokenize import wordpunct_tokenize, sent_tokenize
import numpy as np
import glob
import scipy.spatial.distance as distance
import re

# LEVEL1:文書ファイル毎に、Bag-of-Wordsで特徴ベクトルを生成せよ

+ collect_words_eng(): 英文書集合から単語コードブック作成
    

nltkのdownloadするべきmoudle

In [16]:
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/e175751/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/e175751/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /Users/e175751/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

# Bag-of-Words

## 文書集合からターム素性集合（コードブック）を作る

In [11]:

def collect_words_eng(docs):
    '''
    英文書集合から単語コードブック作成。
    シンプルに文書集合を予め決めうちした方式で処理する。
    必要に応じて指定できるようにしていた方が使い易いかも。

    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :return (list): 文分割、単語分割、基本形、ストップワード除去した、ユニークな単語一覧。
    '''
    
    codebook = []
    stopwords = nltk.corpus.stopwords.words('english') 
    
    #stopwords.append('.')   # ピリオドを追加。
    #stopwords.append(',')   # カンマを追加。
    #stopwords.append('')    # 空文字を追加。
    
    symbol = ["'", '"', ':', ';', '.', ',', '-', '!', '?', "'s","<",">","_"]
    '''
    SWList = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've",
              "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 
              'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 
              'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 
              'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 
              'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 
              'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 
              'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on',
              'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 
              'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 
              'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will',
              'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain',
              'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', 
              "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't",
              'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn',
              "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
    '''
    
    clean_frequency = nltk.FreqDist(w.lower() for w in docs if w.lower() not in stopwords + symbol)
    
    wnl = nltk.stem.wordnet.WordNetLemmatizer()
    
    for doc in docs:
        for sent in sent_tokenize(doc):
            for word in wordpunct_tokenize(sent):
                this_word = wnl.lemmatize(word.lower())
                if this_word not in codebook and this_word not in clean_frequency:
                    codebook.append(this_word)
    return codebook

In [12]:
def collect_words_eng1(docs):
    
    codebook = []
    stopwords = nltk.corpus.stopwords.words('english')
    stopwords.append('.')   # ピリオドを追加。
    stopwords.append(',')   # カンマを追加。
    stopwords.append('')    # 空文字を追加。
    wnl = nltk.stem.wordnet.WordNetLemmatizer()
    for doc in docs:
        for sent in sent_tokenize(doc):
            for word in wordpunct_tokenize(sent):
                this_word = wnl.lemmatize(word.lower())
                if this_word not in codebook and this_word not in stopwords:
                    codebook.append(this_word)
    return codebook

サンプル(test)

In [19]:
docs3 = []
docs3.append("This is test.")
docs3.append("That is test too.")
docs3.append("There are so many many tests.")

``clean_frequencya``を使った場合
これにより、vector数が10個になる

In [20]:
codebook = collect_words_eng1(docs3)
print('codebook = ',codebook)

codebook =  ['test', 'many']


``stopwords``のままの場合
これにより、vector数が2個となる

In [21]:
codebook = collect_words_eng(docs3)
print('codebook = ',codebook)

codebook =  ['this', 'is', 'test', '.', 'that', 'too', 'there', 'are', 'so', 'many']


## コードブックを素性とする文書ベクトルを作る (直接ベクトル生成)

In [10]:
def make_vectors_eng(docs, codebook):
    '''コードブックを素性とする文書ベクトルを作る（直接ベクトル生成）

    :param docs(list): 1文書1文字列で保存。複数文書をリストとして並べたもの。
    :param codebook(list): ユニークな単語一覧。
    :return (list): コードブックを元に、出現回数を特徴量とするベクトルを返す。
    '''
    vectors = []
    wnl = nltk.stem.wordnet.WordNetLemmatizer()
    for doc in docs:
        this_vector = []
        fdist = nltk.FreqDist()
        for sent in sent_tokenize(doc):
            for word in wordpunct_tokenize(sent):
                this_word = wnl.lemmatize(word.lower())
                fdist[this_word] += 1
        for word in codebook:
            this_vector.append(fdist[word])
        vectors.append(this_vector)
    return vectors


In [23]:
vectors = make_vectors_eng(docs3, codebook)
for index in range(len(docs3)):
    print('docs[{}] = {}'.format(index,docs3[index]))
    print('vectors[{}] = {}'.format(index,vectors[index]))
    print('----')

docs[0] = This is test.
vectors[0] = [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
----
docs[1] = That is test too.
vectors[1] = [0, 1, 1, 1, 1, 1, 0, 0, 0, 0]
----
docs[2] = There are so many many tests.
vectors[2] = [0, 0, 1, 1, 0, 0, 1, 1, 1, 2]
----


## ユークリッド距離

In [13]:
def euclidean_distance(vectors):
    vectors = np.array(vectors)
    distances = []
    for i in range(len(vectors)):
        temp = []
        for j in range(len(vectors)):
            temp.append(np.linalg.norm(vectors[i] - vectors[j]))
        distances.append(temp)
    return distances

In [25]:
distances = euclidean_distance(vectors)
print('# euclidean_distance')
for index in range(len(distances)):
    print(distances[index])


# euclidean_distance
[0.0, 1.7320508075688772, 3.0]
[1.7320508075688772, 0.0, 3.1622776601683795]
[3.0, 3.1622776601683795, 0.0]


## コサイン類似度

In [14]:
def cosine_similarity(vectors):
    vectors = np.array(vectors)
    distances = []
    for i in range(len(vectors)):
        temp = []
        for j in range(len(vectors)):
            temp.append(distance.cosine(vectors[i], vectors[j]))
        distances.append(temp)
    return distances

## コサイン類似度(こっちが本物)

In [15]:
def cos_sim(vector):
    vectors = np.array(vector)
    ListVector=[]
    for i in range(len(vectors)):
        temp=[]
        for j in range(len(vectors)):
            temp.append(np.dot(vectors[i], vectors[j]) / (np.linalg.norm(vectors[i]) * np.linalg.norm(vectors[j])))
        ListVector.append(temp)
    return ListVector

In [28]:
hoge = cos_sim(vectors)
print('# cosine_similarity')
for index in range(len(hoge)):
    print(hoge[index])

# cosine_similarity
[1.0, 0.6708203932499369, 0.3333333333333333]
[0.6708203932499369, 0.9999999999999998, 0.29814239699997197]
[0.3333333333333333, 0.29814239699997197, 1.0]


In [29]:
similarities = cosine_similarity(vectors)
print('# cosine_similarity')
for index in range(len(similarities)):
    print(similarities[index])

# cosine_similarity
[0.0, 0.3291796067500631, 0.6666666666666667]
[0.3291796067500631, 0.0, 0.7018576030000281]
[0.6666666666666667, 0.7018576030000281, 0.0]


## それでは実際に文章を分類する

fileのpathを配列に格納する

In [3]:
List_Data_NL=[]
for i in range(1,14):
    List_Data_NL = glob.glob( "./data/*.html")

In [4]:
List_Data_NL

['./data/kadai1.html',
 './data/kadai6.html',
 './data/kadai10.html',
 './data/kadai11.html',
 './data/kadai7.html',
 './data/kadai4.html',
 './data/kadai12.html',
 './data/kadai8.html',
 './data/kadai9.html',
 './data/kadai13.html',
 './data/kadai5.html',
 './data/kadai2.html',
 './data/kadai3.html']

In [5]:
DataPath = "./data/kadai"

In [6]:
def cleanhtml(raw_html):
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, '', raw_html)
    return cleantext

In [7]:
sentence = []
for i in range(1,len(List_Data_NL)+1):
    with open(DataPath +str(i) + ".html" ) as f:
        r = f.read()
        text = cleanhtml(r)
        sentence.append(text)

In [9]:
sentence

 '\n\n\n\nfunction astext(node)\n{\n    return node.innerHTML.replace(/(]+)>)/ig,"")\n                         .replace(/&gt;/ig, ">")\n                         .replace(/&lt;/ig, "<")\n                         .replace(/&quot;/ig, \'"\')\n                         .replace(/&amp;/ig, "&");\n}\n\nfunction copy_notify(node, bar_color, data)\n{\n    // The outer box: relative + inline positioning.\n    var box1 = document.createElement("div");\n    box1.style.position = "relative";\n    box1.style.display = "inline";\n    box1.style.top = "2em";\n    box1.style.left = "1em";\n  \n    // A shadow for fun\n    var shadow = document.createElement("div");\n    shadow.style.position = "absolute";\n    shadow.style.left = "-1.3em";\n    shadow.style.top = "-1.3em";\n    shadow.style.background = "#404040";\n    \n    // The inner box: absolute positioning.\n    var box2 = document.createElement("div");\n    box2.style.position = "relative";\n    box2.style.border = "1px solid #a0a0a0";\n    box

10500

### コードブック生成

In [16]:
codebook = collect_words_eng(sentence)
print('codebook = ',codebook)



### 文書ベクトル

In [18]:
vectors = make_vectors_eng(sentence, codebook)
for index in range(len(sentence)):
    print('vectors[{}] = {}'.format(index,vectors[index]))
    print('----')

vectors[0] = [13, 3, 57, 14, 23, 104, 11, 436, 2, 5, 1, 5, 1, 4, 4, 4, 300, 1, 1, 1, 1, 1, 2, 1, 97, 3, 2, 40, 5, 189, 1, 2, 222, 3, 11, 3, 2, 16, 8, 37, 5, 3, 8, 10, 6, 40, 4, 20, 19, 2, 16, 7, 21, 7, 121, 8, 84, 1, 2, 2, 26, 4, 21, 6, 1, 1, 11, 62, 26, 48, 46, 1, 2, 2, 32, 1, 2, 1, 5, 1, 1, 1, 1, 62, 1, 2, 19, 1, 1, 10, 7, 219, 133, 2, 1, 1, 2, 5, 1, 2, 1, 3, 1, 1, 1, 1, 1, 2, 6, 17, 3, 2, 2, 5, 1, 14, 1, 13, 10, 2, 2, 3, 1, 5, 21, 2, 1, 2, 3, 4, 1, 12, 229, 1, 2, 1, 2, 11, 2, 4, 2, 2, 26, 2, 2, 1, 31, 2, 1, 1, 8, 15, 26, 1, 59, 1, 1, 1, 1, 4, 10, 8, 3, 3, 18, 2, 3, 3, 11, 2, 3, 5, 1, 1, 1, 3, 4, 11, 8, 3, 1, 10, 9, 1, 3, 1, 1, 1, 26, 2, 2, 1, 1, 1, 2, 5, 7, 5, 2, 2, 2, 4, 14, 7, 1, 92, 1, 2, 7, 38, 5, 2, 2, 25, 22, 48, 31, 2, 56, 3, 1, 3, 41, 1, 1, 1, 6, 3, 1, 1, 1, 5, 33, 1, 1, 1, 10, 3, 6, 2, 1, 2, 1, 4, 25, 1, 29, 11, 4, 3, 30, 2, 1, 6, 8, 4, 3, 4, 4, 4, 5, 1, 1, 1, 1, 1, 1, 5, 11, 3, 3, 1, 17, 3, 10, 1, 22, 1, 14, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 14, 5, 24, 48, 13, 10, 3, 

### ユークリッド距離を求める

In [19]:
distances = euclidean_distance(vectors)
print('# euclidean_distance')
for index in range(len(distances)):
    print(distances[index])

# euclidean_distance
[0.0, 1611.9857319467812, 2211.5716131294507, 2678.0039208335747, 2249.7193158258656, 1997.7191994872553, 1775.0504218190536, 1007.4790320398732, 1313.3392554858017, 1403.9907407102085, 795.6418792396489, 1610.5294160616875, 248.18944377229263]
[1611.9857319467812, 0.0, 1079.8273010069713, 1238.4655828887617, 847.4426234265067, 769.4114633926375, 942.8700864912408, 1013.7869598687882, 930.4719232733463, 1157.6579805797564, 2189.460207448402, 815.6512735231889, 1749.6713977201548]
[2211.5716131294507, 1079.8273010069713, 0.0, 940.6784785462033, 1134.5377913494112, 898.9755280317702, 1556.1834724736027, 1622.224706999619, 1581.9352072698805, 1640.3581925908743, 2713.5266720634977, 1422.728364797722, 2323.9477188611622]
[2678.0039208335747, 1238.4655828887617, 940.6784785462033, 0.0, 934.0706611386528, 1063.5229193581115, 1644.6163686404195, 1974.9706326930534, 1822.5358707032353, 1892.6597686853281, 3212.9349511000064, 1560.6473016027676, 2803.7241305092766]
[2249.71

### コサイン類似度を求める

In [20]:
cosin = cos_sim(vectors)
print('# cosine_similarity')
for index in range(len(cosin)):
    print(np.round(cosin[index],3))

# cosine_similarity
[1.    0.817 0.726 0.745 0.811 0.782 0.85  0.882 0.875 0.765 0.699 0.836
 0.971]
[0.817 1.    0.926 0.966 0.977 0.959 0.921 0.902 0.907 0.852 0.627 0.932
 0.789]
[0.726 0.926 1.    0.964 0.919 0.944 0.824 0.838 0.822 0.806 0.593 0.855
 0.708]
[0.745 0.966 0.964 1.    0.96  0.958 0.869 0.868 0.864 0.849 0.58  0.901
 0.723]
[0.811 0.977 0.919 0.96  1.    0.958 0.932 0.914 0.919 0.88  0.628 0.93
 0.78 ]
[0.782 0.959 0.944 0.958 0.958 1.    0.9   0.912 0.89  0.847 0.606 0.909
 0.761]
[0.85  0.921 0.824 0.869 0.932 0.9   1.    0.921 0.936 0.855 0.68  0.897
 0.826]
[0.882 0.902 0.838 0.868 0.914 0.912 0.921 1.    0.938 0.857 0.657 0.909
 0.874]
[0.875 0.907 0.822 0.864 0.919 0.89  0.936 0.938 1.    0.873 0.702 0.888
 0.86 ]
[0.765 0.852 0.806 0.849 0.88  0.847 0.855 0.857 0.873 1.    0.552 0.829
 0.762]
[0.699 0.627 0.593 0.58  0.628 0.606 0.68  0.657 0.702 0.552 1.    0.596
 0.638]
[0.836 0.932 0.855 0.901 0.93  0.909 0.897 0.909 0.888 0.829 0.596 1.
 0.829]
[0.971 0.789

## それぞれのFiIeの関係性をコサイン類似度で確認する

In [40]:
for i in range(0,len(sentence)):
    for j in range(0,len(sentence)):
        list=[]
        if i < j:
            print(i,j)
            list.append(sentence[i])
            list.append(sentence[j])
            
        else:
            continue
        codebook = collect_words_eng(list)
        vectors = make_vectors_eng(list, codebook)
        similarities = cos_sim(vectors)
        print('# cosine_similarity')
        for index in range(len(similarities)):
            print(np.round(similarities[index],3))

0 1
# cosine_similarity
[1.    0.811]
[0.811 1.   ]
0 2
# cosine_similarity
[1.    0.813]
[0.813 1.   ]
0 3
# cosine_similarity
[1.    0.781]
[0.781 1.   ]
0 4
# cosine_similarity
[1.  0.8]
[0.8 1. ]
0 5
# cosine_similarity
[1.   0.82]
[0.82 1.  ]
0 6
# cosine_similarity
[1.    0.876]
[0.876 1.   ]
0 7
# cosine_similarity
[1.    0.828]
[0.828 1.   ]
0 8
# cosine_similarity
[1.    0.858]
[0.858 1.   ]
0 9
# cosine_similarity
[1.    0.793]
[0.793 1.   ]
0 10
# cosine_similarity
[1.    0.651]
[0.651 1.   ]
0 11
# cosine_similarity
[1.    0.871]
[0.871 1.   ]
0 12
# cosine_similarity
[1.    0.907]
[0.907 1.   ]
1 2
# cosine_similarity
[1.    0.987]
[0.987 1.   ]
1 3
# cosine_similarity
[1.    0.994]
[0.994 1.   ]
1 4
# cosine_similarity
[1.    0.993]
[0.993 1.   ]
1 5
# cosine_similarity
[1.    0.991]
[0.991 1.   ]
1 6
# cosine_similarity
[1.    0.963]
[0.963 1.   ]
1 7
# cosine_similarity
[1.    0.978]
[0.978 1.   ]
1 8
# cosine_similarity
[1.   0.98]
[0.98 1.  ]
1 9
# cosine_similarity
[

ここで一度別の**前処理**として、文章ベクトルを**標準化**を行う

## 文章ベクトルを標準化

In [41]:
#from sklearn import preprocessing as pp
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

In [42]:
vectors = make_vectors_eng(sentence, codebook)
ppSS = StandardScaler()

In [43]:
data_std = ppSS.fit_transform(vectors)

In [44]:
print(type(data_std))
print(len(data_std[0]))

<class 'numpy.ndarray'>
3411


In [45]:
for index in range(len(data_std)):
    print('vectors[{}] = {}'.format(index,data_std[index]))
    print('-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------')

vectors[0] = [-0.77918914 -0.63900965 -0.71651428 ... -0.28867513 -0.46188022
 -0.28867513]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vectors[1] = [-0.1414084   0.54772256  0.66075442 ... -0.28867513 -0.46188022
 -0.28867513]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vectors[2] = [-0.06637537  0.54772256  0.69699833 ... -0.28867513 -0.46188022
 -0.28867513]
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
vectors[3] = [ 0.30878977 -0.63900965  2.47295008 ... -0.28867513 -0.46188022
 -0.28867513]
--------------------------------------------------------------------------------------------------------------

In [46]:
hoge = cos_sim(data_std)
print('# cosine_similarity')
for index in range(len(hoge)):
    print(np.round(hoge[index],3))

# cosine_similarity
[ 1.    -0.068 -0.117 -0.289 -0.258 -0.179 -0.161  0.057 -0.063  0.026
  0.292 -0.212  0.245]
[-0.068  1.     0.073  0.079  0.047  0.042 -0.009 -0.068 -0.053 -0.103
 -0.163 -0.205 -0.143]
[-0.117  0.073  1.     0.062  0.001  0.05  -0.07  -0.082 -0.088 -0.082
 -0.143 -0.135 -0.171]
[-0.289  0.079  0.062  1.     0.162  0.026 -0.017 -0.11  -0.03  -0.124
 -0.304 -0.102 -0.271]
[-0.258  0.047  0.001  0.162  1.     0.041  0.025 -0.146 -0.064 -0.115
 -0.312 -0.096 -0.249]
[-0.179  0.042  0.05   0.026  0.041  1.    -0.003  0.005 -0.02  -0.053
 -0.178 -0.18  -0.174]
[-0.161 -0.009 -0.07  -0.017  0.025 -0.003  1.     0.025 -0.019 -0.056
 -0.179 -0.162 -0.133]
[ 0.057 -0.068 -0.082 -0.11  -0.146  0.005  0.025  1.     0.035  0.033
  0.081 -0.243 -0.037]
[-0.063 -0.053 -0.088 -0.03  -0.064 -0.02  -0.019  0.035  1.     0.118
 -0.071 -0.209 -0.085]
[ 0.026 -0.103 -0.082 -0.124 -0.115 -0.053 -0.056  0.033  0.118  1.
  0.016 -0.221 -0.044]
[ 0.292 -0.163 -0.143 -0.304 -0.312 -0.178 

## 正規化してデータ分析

In [47]:
ms = MinMaxScaler()
mms = ms.fit_transform(vectors)
print(mms)

[[0.30701754 0.         0.20720721 ... 0.         0.         0.        ]
 [0.45614035 0.33333333 0.54954955 ... 0.         0.         0.        ]
 [0.47368421 0.33333333 0.55855856 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.38596491 1.         0.51351351 ... 0.         0.         0.        ]
 [0.31578947 0.         0.07207207 ... 1.         0.2        1.        ]]


In [48]:
for i in range(len(mms[0])):
    print(mms[0][i],vectors[0][i])

0.3070175438596491 36
0.0 1
0.2072072072072072 25
0.15789473684210525 4
0.0 1
1.0 1
0.13142857142857142 24
0.06875 28
0.06875 27
0.0 6
0.19872167344567113 419
0.13888888888888887 55
0.060810810810810814 26
0.03013182674199623 51
0.2695035460992908 48
0.0 0
0.01986754966887417 11
0.3070175438596491 36
0.09465020576131686 322
0.2 1
0.05475285171102661 312
0.0 1
0.25 1
1.0 1
0.05325443786982249 77
0.0125 1
0.4011627906976744 208
0.16666666666666666 2
0.1 2
0.07414829659318636 74
0.2647058823529412 11
0.08094768015794669 85
0.20000000000000004 3
0.14714980114891738 342
0.0 1
0.08571428571428572 5
0.07142857142857142 6
0.10526315789473682 31
0.31059506531204645 215
0.0 1
0.07692307692307693 38
0.0 1
0.14285714285714285 4
0.5 2
0.8108108108108109 64
0.025210084033613446 3
0.0927536231884058 148
0.05439330543933054 13
1.0 3
0.11001964636542239 60
0.5185185185185185 14
0.8253968253968254 104
0.14285714285714288 11
1.0 2
0.38461538461538464 5
1.0 1
1.0 5
1.0 1
1.0 4
0.00404040404040404 4
0.0333

0.5 1
0.044444444444444446 4
0.03296703296703297 3
0.0 0
0.072992700729927 10
0.007299270072992699 8
0.0 0
0.19565217391304346 9
0.031914893617021274 3
0.011111111111111112 1
0.1 1
0.09090909090909091 1
0.04 2
0.01098901098901099 1
0.011111111111111112 1
0.0 0
0.0 0
0.09090909090909091 1
0.0 0
0.0 0
0.1 1
0.06349206349206349 4
0.0 0
0.0 0
0.0 0
0.0 0
0.08333333333333333 3
0.0 0
0.0 0
0.0 0
0.3225806451612903 10
0.25 2
0.0 0
0.0 0
0.09090909090909091 1
0.3333333333333333 2
0.058823529411764705 2
0.0 0
0.0 0
0.008849557522123894 1
0.04641350210970464 11
0.008403361344537815 1
0.008695652173913044 1
0.008695652173913044 1
0.008695652173913044 1
0.035460992907801414 5
0.003115264797507788 1
0.030927835051546393 3
0.008670520231213872 3
0.0 0
0.0 0
0.0 0
0.00911854103343465 3
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
1.0 2
0.05555555555555555 1
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.021739130434782608 1
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0

0.0 0
0.0 0
0.2222222222222222 2
0.0 0
0.0 0
0.0 0
0.0 0
1.0 1
0.75 3
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.16666666666666666 1
0.0 0
0.16666666666666666 1
0.0 0
0.0 0
0.0 0
0.16666666666666666 2
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.06060606060606061 3
0.29411764705882354 5
0.0 0
0.0 0
0.14285714285714285 1
0.0 0
0.2727272727272727 3
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.3333333333333333 1
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.125 1
0.16666666666666666 2
0.0 0
0.0 0
0.3684210526315789 21
0.3333333333333333 1
0.0 0
0.5 1
0.04 1
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.05 1
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.044444444444444446 2
0.1111111111111111 1
0.0 0
0.0 0
0.0 0
0.0 0
1.0 2
0.0 0
0.0 0
0.0 0
0.11764705882352941 2
0.03225806451612903 1
0.125 1
0.058823529411764705 1
0.0 0
0.0 0
0.0 0
0.0 0
0.25 1
0.4545454545454546 5
0.14285714285714285 1
0.0 0
0.14285714285714285 1
0.0 0
0.0 0
0.6666666666666666 2
0.09090909090909091 1
0.2 1
0.41666666666666663 5
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0

0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.058823529411764705 2
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.3333333333333333 1
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.16666666666666666 1
0.0 0
0.0 0
0.0 0
0.0 0
0.42857142857142855 3
0.0 0
1.0 1
0.0 0
0.0 0
0.0 0
0.5 1
0.0 0
1.0 5
0.0 0
1.0 1
0.0 0
0.0 0
0.0 0
0.0 0
0.3333333333333333 1
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.2 6
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.23529411764705882 4
0.0 0
0.0 0
0.0 0
1.0 1
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.6666666666666666 2
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.5 1
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
1.0 3
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.0 0
0.

In [49]:
hoge = cos_sim(mms)
print('# cosine_similarity')
for index in range(len(hoge)):
    print(np.round(hoge[index],3))

# cosine_similarity
[1.    0.601 0.57  0.521 0.528 0.552 0.541 0.599 0.573 0.575 0.051 0.424
 0.55 ]
[0.601 1.    0.69  0.688 0.67  0.692 0.652 0.662 0.653 0.609 0.048 0.46
 0.447]
[0.57  0.69  1.    0.677 0.65  0.681 0.623 0.641 0.633 0.61  0.038 0.479
 0.416]
[0.521 0.688 0.677 1.    0.692 0.667 0.635 0.639 0.653 0.596 0.059 0.475
 0.387]
[0.528 0.67  0.65  0.692 1.    0.667 0.642 0.621 0.637 0.595 0.044 0.469
 0.394]
[0.552 0.692 0.681 0.667 0.667 1.    0.648 0.682 0.661 0.621 0.035 0.456
 0.427]
[0.541 0.652 0.623 0.635 0.642 0.648 1.    0.668 0.645 0.615 0.029 0.446
 0.436]
[0.599 0.662 0.641 0.639 0.621 0.682 0.668 1.    0.676 0.641 0.045 0.452
 0.458]
[0.573 0.653 0.633 0.653 0.637 0.661 0.645 0.676 1.    0.675 0.059 0.454
 0.452]
[0.575 0.609 0.61  0.596 0.595 0.621 0.615 0.641 0.675 1.    0.021 0.42
 0.44 ]
[0.051 0.048 0.038 0.059 0.044 0.035 0.029 0.045 0.059 0.021 1.    0.054
 0.046]
[0.424 0.46  0.479 0.475 0.469 0.456 0.446 0.452 0.454 0.42  0.054 1.
 0.285]
[0.55  0.447 

## 主成分分析によるデータの圧縮化(標準化)

In [50]:
from sklearn.decomposition import PCA

In [51]:
pca = PCA(n_components=10)
pca.fit(data_std)
print(pca.explained_variance_ratio_)
print("------------------------------------------------------------------------------------------------------------------------------------------------------")
print(pca.singular_values_)
print("------------------------------------------------------------------------------------------------------------------------------------------------------")
pca_X = pca.transform(data_std)
print(pca_X)

[0.30571528 0.16943287 0.0849565  0.06835538 0.06490385 0.0574546
 0.05544767 0.04645715 0.04230637 0.03883215]
------------------------------------------------------------------------------------------------------------------------------------------------------
[116.34629908  86.61494477  61.33273226  55.01490413  53.60795238
  50.43783141  49.54908631  45.35448967  43.28095887  41.46576691]
------------------------------------------------------------------------------------------------------------------------------------------------------
[[-1.18160964e+01  2.06532198e+01 -1.58291485e+00 -1.34507865e+00
   7.34272066e-01 -4.21848692e+00 -1.17986159e+00 -9.20080918e+00
   1.21000796e+01 -5.14824092e+00]
 [-7.39059816e+00 -1.25558051e+01  2.03533271e-01 -2.39158208e+00
  -5.42101998e+00 -3.57650560e+00 -1.49981726e+01 -1.07674155e+01
   3.16270821e+01 -6.61631430e+00]
 [-4.30205424e+00 -1.19505126e+01 -4.11783831e+00 -5.28660507e+00
  -1.17443400e+01 -1.31779064e+01 -2.47759010e+01 -2.

## 最終的なコサイン類似度の計算(標準化)

In [52]:
hoge = cos_sim(pca_X)
print('# cosine_similarity')
for index in range(len(hoge)):
    print(np.round(hoge[index],3))

# cosine_similarity
[ 1.     0.318 -0.167 -0.503 -0.398 -0.358 -0.264  0.295 -0.064  0.213
  0.58  -0.307  0.477]
[ 0.318  1.     0.076  0.108  0.057  0.059 -0.017  0.058 -0.092 -0.185
 -0.235 -0.224 -0.194]
[-0.167  0.076  1.     0.063  0.001  0.052 -0.07  -0.207 -0.089 -0.084
 -0.145 -0.135 -0.173]
[-0.503  0.108  0.063  1.     0.162  0.021 -0.019 -0.244 -0.03  -0.117
 -0.301 -0.101 -0.267]
[-0.398  0.057  0.001  0.162  1.     0.044  0.027 -0.402 -0.06  -0.113
 -0.312 -0.096 -0.249]
[-0.358  0.059  0.052  0.021  0.044  1.    -0.02   0.313 -0.044 -0.056
 -0.184 -0.182 -0.171]
[-0.264 -0.017 -0.07  -0.019  0.027 -0.02   1.     0.304 -0.04  -0.065
 -0.189 -0.163 -0.134]
[ 0.295  0.058 -0.207 -0.244 -0.402  0.313  0.304  1.     0.49   0.282
  0.374 -0.575 -0.069]
[-0.064 -0.092 -0.089 -0.03  -0.06  -0.044 -0.04   0.49   1.     0.099
 -0.093 -0.214 -0.092]
[ 0.213 -0.185 -0.084 -0.117 -0.113 -0.056 -0.065  0.282  0.099  1.
 -0.011 -0.226 -0.061]
[ 0.58  -0.235 -0.145 -0.301 -0.312 -0.184 

## 主成分分析だけしたデータでのコサイン類似度

In [53]:
pca.fit(vectors)
pca_y = pca.transform(vectors)

In [54]:
hoge = cos_sim(pca_y)
print('# cosine_similarity')
for index in range(len(hoge)):
    print(np.round(hoge[index],3))

# cosine_similarity
[ 1.    -0.944 -0.935 -0.983 -0.981 -0.946  0.147  0.508  0.351 -0.516
  0.995  0.779  0.998]
[-0.944  1.     0.862  0.954  0.929  0.888 -0.21  -0.614 -0.324  0.48
 -0.932 -0.805 -0.943]
[-0.935  0.862  1.     0.935  0.912  0.935 -0.232 -0.479 -0.471  0.359
 -0.933 -0.744 -0.942]
[-0.983  0.954  0.935  1.     0.969  0.937 -0.262 -0.56  -0.42   0.475
 -0.969 -0.804 -0.98 ]
[-0.981  0.929  0.912  0.969  1.     0.927 -0.185 -0.468 -0.396  0.413
 -0.971 -0.754 -0.978]
[-0.946  0.888  0.935  0.937  0.927  1.    -0.158 -0.503 -0.405  0.363
 -0.946 -0.759 -0.955]
[ 0.147 -0.21  -0.232 -0.262 -0.185 -0.158  1.    -0.019  0.466 -0.058
  0.077  0.275  0.128]
[ 0.508 -0.614 -0.479 -0.56  -0.468 -0.503 -0.019  1.     0.096 -0.434
  0.521  0.484  0.524]
[ 0.351 -0.324 -0.471 -0.42  -0.396 -0.405  0.466  0.096  1.     0.148
  0.296  0.205  0.329]
[-0.516  0.48   0.359  0.475  0.413  0.363 -0.058 -0.434  0.148  1.
 -0.497 -0.576 -0.496]
[ 0.995 -0.932 -0.933 -0.969 -0.971 -0.946  

## 正規化の主成分分析

In [55]:
pca.fit(mms)
pca_z = pca.transform(mms)

In [56]:
hoge = cos_sim(pca_z)
print('# cosine_similarity')
for index in range(len(hoge)):
    print(np.round(hoge[index],3))

# cosine_similarity
[ 1.     0.16  -0.134 -0.502 -0.392 -0.374 -0.293  0.229 -0.155  0.186
  0.738 -0.291  0.534]
[ 0.16   1.     0.036  0.068  0.019  0.045 -0.035 -0.066 -0.088 -0.167
 -0.229 -0.181 -0.172]
[-0.134  0.036  1.     0.054 -0.011  0.039 -0.083 -0.235 -0.102 -0.095
 -0.152 -0.108 -0.184]
[-0.502  0.068  0.054  1.     0.154  0.01  -0.03  -0.26  -0.04  -0.138
 -0.314 -0.069 -0.272]
[-0.392  0.019 -0.011  0.154  1.     0.036  0.009 -0.413 -0.064 -0.125
 -0.31  -0.072 -0.251]
[-0.374  0.045  0.039  0.01   0.036  1.    -0.049  0.355 -0.067 -0.079
 -0.194 -0.166 -0.169]
[-0.293 -0.035 -0.083 -0.03   0.009 -0.049  1.     0.276 -0.053 -0.05
 -0.185 -0.147 -0.119]
[ 0.229 -0.066 -0.235 -0.26  -0.413  0.355  0.276  1.     0.358  0.144
  0.53  -0.496 -0.082]
[-0.155 -0.088 -0.102 -0.04  -0.064 -0.067 -0.053  0.358  1.     0.092
 -0.092 -0.182 -0.088]
[ 0.186 -0.167 -0.095 -0.138 -0.125 -0.079 -0.05   0.144  0.092  1.
  0.01  -0.219 -0.049]
[ 0.738 -0.229 -0.152 -0.314 -0.31  -0.194 -

## 考察

2つの文章の**文章ベクトル**から確認してみる

# LEVEL2:BoWにTF-IDFで重み調整した特徴ベクトルを生成せよ

### 重要なモジュールをインポートする

In [57]:
import sklearn.feature_extraction.text as fe_text

## 例

### データ作成

In [58]:
docs = []
docs.append("You can get dis-counted price with trade-in.")
docs.append("iPhone 11 shoots beautifully sharp 4K video at 60 fps across all its cameras.")
docs.append("From $16.62/mo. or $399 with trade-in.")

**Bag-of-Words**によるベクトルを生成。

In [59]:
def bow(docs):
    vectorizer = fe_text.CountVectorizer(stop_words='english')
    vectors = vectorizer.fit_transform(docs)
    return vectors.toarray(), vectorizer

**Bag-of-Words**に**TF-IDF**で重み調整したベクトルを生成

In [60]:
def bow_tfidf(docs):
    vectorizer = fe_text.TfidfVectorizer(norm=None, stop_words='english')
    vectors = vectorizer.fit_transform(docs)
    return vectors.toarray(), vectorizer

### ノーマルなBag-of-Word

In [61]:
vectors, vectorizer = bow(docs)
print('# normal BoW')
print(vectorizer.get_feature_names())
print(vectors)

# normal BoW
['11', '16', '399', '4k', '60', '62', 'beautifully', 'cameras', 'counted', 'dis', 'fps', 'iphone', 'mo', 'price', 'sharp', 'shoots', 'trade', 'video']
[[0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 0]
 [1 0 0 1 1 0 1 1 0 0 1 1 0 0 1 1 0 1]
 [0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0]]


### tfirfを用いたBag-of-Word

In [62]:
vectors, vectorizer = bow_tfidf(docs)
print('# BoW + tfidf')
print(vectorizer.get_feature_names())
print(vectors)

# BoW + tfidf
['11', '16', '399', '4k', '60', '62', 'beautifully', 'cameras', 'counted', 'dis', 'fps', 'iphone', 'mo', 'price', 'sharp', 'shoots', 'trade', 'video']
[[0.         0.         0.         0.         0.         0.
  0.         0.         1.69314718 1.69314718 0.         0.
  0.         1.69314718 0.         0.         1.28768207 0.        ]
 [1.69314718 0.         0.         1.69314718 1.69314718 0.
  1.69314718 1.69314718 0.         0.         1.69314718 1.69314718
  0.         0.         1.69314718 1.69314718 0.         1.69314718]
 [0.         1.69314718 1.69314718 0.         0.         1.69314718
  0.         0.         0.         0.         0.         0.
  1.69314718 0.         0.         0.         1.28768207 0.        ]]


## 実際のデータを用いて実行する

In [63]:
DataPath = "./data/kadai"
sentence = []
for i in range(1,len(List_Data_NL)+1):
    with open(DataPath +str(i) + ".html" ) as f:
        r = f.read()
        sentence.append(r)

### まずは簡単なBag-of-Word

In [64]:
vectors1, vectorizer1 = bow(sentence)
print('# normal BoW')
print(vectorizer1.get_feature_names())
print(vectors1)

# normal BoW
[[0 0 1 ... 0 0 0]
 [2 4 1 ... 0 0 0]
 [4 4 1 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 2 1 ... 0 0 0]
 [0 0 1 ... 0 0 0]]


In [65]:
hoge = cos_sim(vectors1)
print('# cosine_similarity')
for index in range(len(hoge)):
    print(np.round(hoge[index],3))

# cosine_similarity
[1.    0.759 0.75  0.733 0.712 0.744 0.778 0.73  0.795 0.802 0.167 0.725
 0.91 ]
[0.759 1.    0.992 0.995 0.991 0.989 0.963 0.982 0.977 0.967 0.069 0.968
 0.659]
[0.75  0.992 1.    0.99  0.988 0.992 0.968 0.985 0.972 0.965 0.084 0.97
 0.655]
[0.733 0.995 0.99  1.    0.994 0.989 0.955 0.982 0.97  0.96  0.051 0.967
 0.63 ]
[0.712 0.991 0.988 0.994 1.    0.987 0.953 0.985 0.964 0.949 0.072 0.968
 0.613]
[0.744 0.989 0.992 0.989 0.987 1.    0.969 0.986 0.972 0.963 0.063 0.968
 0.654]
[0.778 0.963 0.968 0.955 0.953 0.969 1.    0.96  0.957 0.954 0.095 0.949
 0.709]
[0.73  0.982 0.985 0.982 0.985 0.986 0.96  1.    0.971 0.955 0.102 0.963
 0.645]
[0.795 0.977 0.972 0.97  0.964 0.972 0.957 0.971 1.    0.98  0.095 0.944
 0.707]
[0.802 0.967 0.965 0.96  0.949 0.963 0.954 0.955 0.98  1.    0.095 0.935
 0.719]
[0.167 0.069 0.084 0.051 0.072 0.063 0.095 0.102 0.095 0.095 1.    0.116
 0.154]
[0.725 0.968 0.97  0.967 0.968 0.968 0.949 0.963 0.944 0.935 0.116 1.
 0.646]
[0.91  0.659

### 次はTF-IDFを用いたBag-of-Word

In [66]:
vectors, vectorizer = bow_tfidf(sentence)
print('# BoW + tfidf')
print(vectorizer.get_feature_names())
print(vectors)

# BoW + tfidf
[[ 0.          0.          1.07410797 ...  0.          0.
   0.        ]
 [ 5.08089008  5.76733101  1.07410797 ...  0.          0.
   0.        ]
 [10.16178016  5.76733101  1.07410797 ...  0.          0.
   0.        ]
 ...
 [ 0.          0.          0.         ...  0.          0.
   0.        ]
 [ 0.          2.8836655   1.07410797 ...  0.          0.
   0.        ]
 [ 0.          0.          1.07410797 ...  0.          0.
   0.        ]]


In [70]:
vectors

array([[ 0.        ,  0.        ,  1.07410797, ...,  0.        ,
         0.        ,  0.        ],
       [ 5.08089008,  5.76733101,  1.07410797, ...,  0.        ,
         0.        ,  0.        ],
       [10.16178016,  5.76733101,  1.07410797, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  2.8836655 ,  1.07410797, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  1.07410797, ...,  0.        ,
         0.        ,  0.        ]])

In [67]:
hoge = cos_sim(vectors)
print('# cosine_similarity')
for index in range(len(hoge)):
    print(np.round(hoge[index],3))

# cosine_similarity
[1.    0.741 0.731 0.717 0.696 0.724 0.749 0.705 0.769 0.77  0.111 0.704
 0.869]
[0.741 1.    0.986 0.991 0.986 0.981 0.944 0.966 0.963 0.947 0.045 0.955
 0.631]
[0.731 0.986 1.    0.985 0.982 0.983 0.947 0.968 0.957 0.943 0.056 0.957
 0.626]
[0.717 0.991 0.985 1.    0.99  0.981 0.936 0.968 0.957 0.94  0.034 0.955
 0.604]
[0.696 0.986 0.982 0.99  1.    0.978 0.934 0.969 0.95  0.929 0.047 0.955
 0.587]
[0.724 0.981 0.983 0.981 0.978 1.    0.948 0.972 0.955 0.939 0.042 0.952
 0.624]
[0.749 0.944 0.947 0.936 0.934 0.948 1.    0.933 0.929 0.919 0.062 0.921
 0.67 ]
[0.705 0.966 0.968 0.968 0.969 0.972 0.933 1.    0.949 0.925 0.067 0.94
 0.611]
[0.769 0.963 0.957 0.957 0.95  0.955 0.929 0.949 1.    0.956 0.062 0.923
 0.672]
[0.77  0.947 0.943 0.94  0.929 0.939 0.919 0.925 0.956 1.    0.062 0.907
 0.678]
[0.111 0.045 0.056 0.034 0.047 0.042 0.062 0.067 0.062 0.062 1.    0.078
 0.099]
[0.704 0.955 0.957 0.955 0.955 0.952 0.921 0.94  0.923 0.907 0.078 1.
 0.615]
[0.869 0.631

## 正規化を行った

In [68]:
mms = ms.fit_transform(vectors)

In [69]:
hoge = cos_sim(mms)
print('# cosine_similarity')
for index in range(len(hoge)):
    print(np.round(hoge[index],3))

# cosine_similarity
[1.    0.26  0.21  0.23  0.267 0.249 0.253 0.255 0.252 0.242 0.039 0.264
 0.343]
[0.26  1.    0.24  0.27  0.29  0.272 0.265 0.254 0.249 0.221 0.025 0.234
 0.235]
[0.21  0.24  1.    0.234 0.248 0.238 0.219 0.22  0.216 0.194 0.012 0.225
 0.195]
[0.23  0.27  0.234 1.    0.307 0.263 0.256 0.261 0.255 0.224 0.025 0.245
 0.204]
[0.267 0.29  0.248 0.307 1.    0.302 0.289 0.262 0.282 0.242 0.012 0.275
 0.235]
[0.249 0.272 0.238 0.263 0.302 1.    0.286 0.287 0.284 0.24  0.013 0.244
 0.235]
[0.253 0.265 0.219 0.256 0.289 0.286 1.    0.297 0.278 0.243 0.016 0.253
 0.251]
[0.255 0.254 0.22  0.261 0.262 0.287 0.297 1.    0.286 0.253 0.017 0.253
 0.257]
[0.252 0.249 0.216 0.255 0.282 0.284 0.278 0.286 1.    0.304 0.029 0.252
 0.258]
[0.242 0.221 0.194 0.224 0.242 0.24  0.243 0.253 0.304 1.    0.019 0.227
 0.24 ]
[0.039 0.025 0.012 0.025 0.012 0.013 0.016 0.017 0.029 0.019 1.    0.035
 0.029]
[0.264 0.234 0.225 0.245 0.275 0.244 0.253 0.253 0.252 0.227 0.035 1.
 0.247]
[0.343 0.23

# Level3:単語の共起行列から特徴ベクトルを生成せよ。

必要なモジュールをインポートする

## 例

文章データ

In [92]:
sentences = 'pandas is an open source programming tools. The best way to get pandas is via conda. "conda install pandas"'

In [93]:
print(sentences)
print('len(sentences) = ', len(sentences))

pandas is an open source programming tools. The best way to get pandas is via conda. "conda install pandas"
len(sentences) =  107


In [94]:
DataPath = "./data/kadai"
sentence = ""
for i in range(1,len(List_Data_NL)+1):
    with open(DataPath +str(i) + ".html" ) as f:
        r = f.read()
        sentence += r

テキストに対する**前処理**

In [75]:
def preprocess(text):
   
    text = text.lower()
    text = text.replace('.', ' .')
    text = text.replace('"', '')
    words = text.split(' ')

    word_to_id = {}
    id_to_word = {}
    for word in words:
        if word not in word_to_id:
            new_id = len(word_to_id)
            word_to_id[word] = new_id
            id_to_word[new_id] = word
    corpus = np.array([word_to_id[w] for w in words])
    return corpus, word_to_id, id_to_word


In [95]:
corpus, word_to_id, id_to_word = preprocess(sentences)
vocab_size = len(word_to_id)
print(corpus)
print(word_to_id)
print(id_to_word)

[ 0  1  2  3  4  5  6  7  8  9 10 11 12  0  1 13 14  7 14 15  0]
{'pandas': 0, 'is': 1, 'an': 2, 'open': 3, 'source': 4, 'programming': 5, 'tools': 6, '.': 7, 'the': 8, 'best': 9, 'way': 10, 'to': 11, 'get': 12, 'via': 13, 'conda': 14, 'install': 15}
{0: 'pandas', 1: 'is', 2: 'an', 3: 'open', 4: 'source', 5: 'programming', 6: 'tools', 7: '.', 8: 'the', 9: 'best', 10: 'way', 11: 'to', 12: 'get', 13: 'via', 14: 'conda', 15: 'install'}


### 実際のデータで検証

In [103]:
corpus, word_to_id, id_to_word = preprocess(sentence)
vocab_size = len(word_to_id)
print(corpus)
print(word_to_id)
print(id_to_word)

[    0     1     2 ...  2388  2409 35793]


**共起行列**を作成。

In [96]:
def create_co_matrix(corpus, vocab_size, window_size=1):
    
    corpus_size = len(corpus)
    co_matrix = np.zeros((vocab_size, vocab_size), dtype=np.int32)

    for idx, word_id in enumerate(corpus):
        for i in range(1, window_size+1):
            left_idx = idx - i
            right_idx = idx + i
            if left_idx >= 0:
                left_word_id = corpus[left_idx]
                co_matrix[word_id, left_word_id] += 1
            if right_idx < corpus_size:
                right_word_id = corpus[right_idx]
                co_matrix[word_id, right_word_id] += 1
    return co_matrix

In [104]:
co_matrix = create_co_matrix(corpus, vocab_size, window_size=2)
df = pd.DataFrame(co_matrix, index=word_to_id.keys(), columns=word_to_id.keys())
df

Unnamed: 0,<!--,saved,from,url=(0035)https://www,.nltk,.org/book/ch00,.html,--> <html,xmlns=http://www,.w3,...,serve the,"society, and",pathway,riches of,.</p> <p><em>but,present:,hacking!</em></p> <!--,name=noun_phrase_index_term> <p,name=noun_phrase_index_term>updated,acst</p> </div> </div> </div> </body></html>
\n<!--,0,1,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
saved,1,0,13,12,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
from,1,13,2,12,13,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
url=(0035)https://www,0,12,12,0,12,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
.nltk,0,0,13,12,0,5,12,0,0,0,...,0,0,0,0,0,0,0,0,0,0
.org/book/ch00,0,0,0,1,5,0,1,1,0,0,...,0,0,0,0,0,0,0,0,0,0
.html,0,0,0,0,12,1,0,12,12,0,...,0,0,0,0,0,0,0,0,0,0
-->\n<html,0,0,0,0,0,1,12,0,12,12,...,0,0,0,0,0,0,0,0,0,0
xmlns=http://www,0,0,0,0,0,0,12,12,0,12,...,0,0,0,0,0,0,0,0,0,0
.w3,0,0,0,0,0,0,0,12,12,0,...,0,0,0,0,0,0,0,0,0,0


**コサイン類似度**を計算

In [98]:
def cos_similarity(x, y, eps=1e-8):
    nx = x / (np.sqrt(np.sum(x ** 2)) + eps)
    ny = y / (np.sqrt(np.sum(y ** 2)) + eps)
    return np.dot(nx, ny)

コサイン類似度Top5を出力

In [114]:
def most_similar(query, word_to_id, id_to_word, word_matrix, top=5):
    """コサイン類似度Top5を出力。

    :param query(str): クエリ。
    :param word_to_id(dict): 単語をkeyとして、idを参照する辞書。
    :param id_to_word(dict): idをkeyとして、単語を参照する辞書。
    :param word_matrix: 共起行列。
    :param top(int): 上位何件まで表示させるか。
    :return: なし。
    """
    if query not in word_to_id:
        print('%s is not found' % query)
        return

    print('[query] ' + query)
    query_id = word_to_id[query]
    query_vec = word_matrix[query_id]

    vocab_size = len(word_to_id)
    similarity = np.zeros(vocab_size)
    for i in range(vocab_size):
        similarity[i] = cos_similarity(word_matrix[i], query_vec)

    count = 0
    for i in (-1 * similarity).argsort():
        if id_to_word[i] == query:
            continue
        print(' %s: %s' % (id_to_word[i], np.round(similarity[i],3)))
        count += 1
        if count >= top:
            return

In [115]:
print('\n# most_similar() with co_matrix')
user_query = "pandas"
most_similar(user_query, word_to_id, id_to_word, co_matrix)


# most_similar() with co_matrix
pandas is not found


## 単語についてマトリックス表記で単語間類似度（コサイン類似度）を記せ

In [116]:
print("\n# most_similar() with co_matrix")
user_word = "natural"
most_similar(user_word,word_to_id, id_to_word, co_matrix)


# most_similar() with co_matrix
[query] natural
 processing: 0.719
 technologies: 0.716
 community</h2>
<p>the: 0.706
 <em>natural: 0.689
 technology: 0.625


In [117]:
print("\n# most_similar() with co_matrix")
user_word = "language"
most_similar(user_word,word_to_id, id_to_word, co_matrix)


# most_similar() with co_matrix
[query] language
 tagged
corpus,: 0.709
 words: 0.706
 text: 0.686
 task: 0.683
 texts: 0.668


In [118]:
print("\n# most_similar() with co_matrix")
user_word = "text"
most_similar(user_word,word_to_id, id_to_word, co_matrix)


# most_similar() with co_matrix
[query] text
 sentence: 0.881
 context: 0.861
 grammar: 0.852
 vocabulary: 0.847
 words: 0.844


In [119]:
print("\n# most_similar() with co_matrix")
user_word = "count"
most_similar(user_word,word_to_id, id_to_word, co_matrix)


# most_similar() with co_matrix
[query] count
 : 0.895
 use: 0.748
 first: 0.743
 tag: 0.743
 name: 0.74


In [120]:
print("\n# most_similar() with co_matrix")
user_word = "python"
most_similar(user_word,word_to_id, id_to_word, co_matrix)


# most_similar() with co_matrix
[query] python
 text: 0.778
 sentence: 0.771
 program: 0.758
 grammar: 0.756
 tag: 0.75


In [121]:
print("\n# most_similar() with co_matrix")
user_word = "sentence"
most_similar(user_word,word_to_id, id_to_word, co_matrix)


# most_similar() with co_matrix
[query] sentence
 text: 0.881
 name: 0.87
 tag: 0.869
 corpus: 0.868
 string: 0.867


# LEVEL4 文書分類せよ

In [123]:
from sklearn.datasets import fetch_20newsgroups
from pprint import pprint
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.pipeline import make_pipeline
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

In [73]:
newsgroups_train = fetch_20newsgroups(subset='train')
pprint(list(newsgroups_train.target_names))

Downloading 20news dataset. This may take a few minutes.
Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)


TypeError: 'list' object is not callable

In [93]:
categories = ['alt.atheism', 'talk.religion.misc','comp.graphics', 'sci.space']

In [94]:
newsgroups_train = fetch_20newsgroups(subset='train',categories=categories)

In [98]:
newsgroups_train

  "Subject: Re: Biblical Backing of Koresh's 3-02 Tape (Cites enclosed)\nFrom: kmcvay@oneb.almanac.bc.ca (Ken Mcvay)\nOrganization: The Old Frog's Almanac\nLines: 20\n\nIn article <20APR199301460499@utarlg.uta.edu> b645zaw@utarlg.uta.edu (stephen) writes:\n\n>Seems to me Koresh is yet another messenger that got killed\n>for the message he carried. (Which says nothing about the \n\nSeems to be, barring evidence to the contrary, that Koresh was simply\nanother deranged fanatic who thought it neccessary to take a whole bunch of\nfolks with him, children and all, to satisfy his delusional mania. Jim\nJones, circa 1993.\n\n>In the mean time, we sure learned a lot about evil and corruption.\n>Are you surprised things have gotten that rotten?\n\nNope - fruitcakes like Koresh have been demonstrating such evil corruption\nfor centuries.\n-- \nThe Old Frog's Almanac - A Salute to That Old Frog Hisse'f, Ryugen Fisher \n     (604) 245-3205 (v32) (604) 245-4366 (2400x4) SCO XENIX 2.3.2 GT \n  Ladys

In [96]:
len(newsgroups_train)

5

In [99]:
newsgroups_train.

 "Subject: Re: Biblical Backing of Koresh's 3-02 Tape (Cites enclosed)\nFrom: kmcvay@oneb.almanac.bc.ca (Ken Mcvay)\nOrganization: The Old Frog's Almanac\nLines: 20\n\nIn article <20APR199301460499@utarlg.uta.edu> b645zaw@utarlg.uta.edu (stephen) writes:\n\n>Seems to me Koresh is yet another messenger that got killed\n>for the message he carried. (Which says nothing about the \n\nSeems to be, barring evidence to the contrary, that Koresh was simply\nanother deranged fanatic who thought it neccessary to take a whole bunch of\nfolks with him, children and all, to satisfy his delusional mania. Jim\nJones, circa 1993.\n\n>In the mean time, we sure learned a lot about evil and corruption.\n>Are you surprised things have gotten that rotten?\n\nNope - fruitcakes like Koresh have been demonstrating such evil corruption\nfor centuries.\n-- \nThe Old Frog's Almanac - A Salute to That Old Frog Hisse'f, Ryugen Fisher \n     (604) 245-3205 (v32) (604) 245-4366 (2400x4) SCO XENIX 2.3.2 GT \n  Ladysm

In [102]:
vectorizer = TfidfVectorizer()

In [103]:
vectors = vectorizer.fit_transform(newsgroups_train.data)
vectors.shape

(2034, 34118)

In [104]:
vectors.nnz / float(vectors.shape[0])

159.0132743362832

In [106]:
newsgroups_test = fetch_20newsgroups(subset='test',categories=categories)
vectors_test = vectorizer.transform(newsgroups_test.data)

In [107]:
clf = MultinomialNB(alpha=.01)
clf.fit(vectors, newsgroups_train.target)
pred = clf.predict(vectors_test)

In [108]:
metrics.f1_score(newsgroups_test.target, pred, average='macro')

0.8821359240272957

In [124]:
data = fetch_20newsgroups()

In [125]:
print(data.target_names)

['alt.atheism', 'comp.graphics', 'comp.os.ms-windows.misc', 'comp.sys.ibm.pc.hardware', 'comp.sys.mac.hardware', 'comp.windows.x', 'misc.forsale', 'rec.autos', 'rec.motorcycles', 'rec.sport.baseball', 'rec.sport.hockey', 'sci.crypt', 'sci.electronics', 'sci.med', 'sci.space', 'soc.religion.christian', 'talk.politics.guns', 'talk.politics.mideast', 'talk.politics.misc', 'talk.religion.misc']


In [126]:
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

In [127]:
print("Train Data Count =", len(train.data))
print("Test Data Count =", len(test.data))

Train Data Count = 11314
Test Data Count = 7532


In [128]:
print(train.data[10][:500])

From: irwin@cmptrc.lonestar.org (Irwin Arnstein)
Subject: Re: Recommendation on Duc
Summary: What's it worth?
Distribution: usa
Expires: Sat, 1 May 1993 05:00:00 GMT
Organization: CompuTrac Inc., Richardson TX
Keywords: Ducati, GTS, How much? 
Lines: 13

I have a line on a Ducati 900GTS 1978 model with 17k on the clock.  Runs
very well, paint is the bronze/brown/orange faded out, leaks a bit of oil
and pops out of 1st with hard accel.  The shop will fix trans and oil 
leak.  They sold the bike t


In [129]:
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

In [130]:
model.fit(train.data, train.target)

Pipeline(memory=None,
         steps=[('tfidfvectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('multinomialnb',
                 MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))],
         verbose=False)

In [131]:
print('Train accuracy = %.3f' % model.score(train.data, train.target))
print(' Test accuracy = %.3f' % model.score(test.data, test.target))

Train accuracy = 0.933
 Test accuracy = 0.774


In [132]:
plt.rcParams['figure.figsize'] = (15.0, 15.0)

In [133]:
mat = confusion_matrix(test.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label')

NameError: name 'labels' is not defined