当我们处理大文本时，我们需要将百万多的单词转换为数值形式，以喂给机器学习算法使用。这正是“词袋”的用武之地。

bag-of-words 本质上是一个模型，从全文档的所有词汇中习得一个词 的模型。之后，通过文档中所有词汇的频数 对each document 建模，最终得到每个文档的词向量。最后就能用词向量进行ML了

In [1]:
import numpy as np
from nltk.corpus import brown
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = 'all'

In [2]:
# 温故知新：
a=[]
a.append('word')
a.append('data dd')
a
' '.join(a)


['word', 'data dd']

'word data dd'

In [3]:
# 分块函数
def splitter(data, num_words):  # 设一个语块含有num_words 个词
    words = data.split(' ')
    output = []
    
    cur_words = []
    cur_count = 0
    for word in words:
        cur_words.append(word)
        cur_count += 1
        if cur_count == num_words:   # 达到词块容量后
            output.append(' '.join(cur_words)) # 列表变为字符串，空格作为字符串内元素的分隔符，再append到output列表
            cur_words = []
            cur_count = 0
     
    output.append(' '.join(cur_words)) #如果最后剩下的词数不足num_words,它们也作为一个语块;如果正好整除，则最后会多一个空字符串的chunk
    return output    # 返回列表

# example：
d='a b c d'
splitter(d,2)  # each chunk 2 words, 4/2+1=3 chunks
splitter(d,3) # eanc chunk 3 words, 2 chunks
len(d)

['a b', 'c d', '']

['a b c', 'd']

7

In [5]:
# read the data from the Brown corpus
data = ' '.join(brown.words()[:10000])  # 列表变为字符串

# Number of words in each chunk  分成 5块，每块2000词
text_chunks = splitter(data, 2000)

chunks = []
counter = 0  # 字典的key，作索引
for text in text_chunks:
    chunk = {'index': counter, 'text': text} # 字典
    chunks.append(chunk)     # 列表内嵌字典
    counter += 1

#print(len(chunks))
chunks[0]

{'index': 0,

## Extract document term matrix

这个矩阵统计每个词在文档中的词频

相比nltk，这个任务 sklearn 提供更好的实现方法

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(min_df=5, max_df=.95)
doc_term_matrix = vectorizer.fit_transform([chunk['text'] for chunk in chunks])

vocab = np.array(vectorizer.get_feature_names())
print('\nVocabulary')
print(vocab)    

print('\nDocument term matrix') # To print in tabular form, we need to format this:
chunk_names = ['Chunk=0', 'Chunk-1', 'Chunk-2', 'Chunk-3', 'Chunk-4']
formatted_row = '{:>12}' * (len(chunk_names) + 1)
print('\n', formatted_row.format('Word', *chunk_names), '\n')

# 遍历所有词，然后输出每个词在不同 chunks 中的词频
for word, item in zip(vocab, doc_term_matrix.T):
    # 'item' is a 'csr_matrix' data structure
    output = [str(x) for x in item.data]
    print(formatted_row.format(word, *output))  # 只是排版的作用


Vocabulary
['about' 'after' 'against' 'aid' 'all' 'also' 'an' 'and' 'are' 'as' 'at'
 'be' 'been' 'before' 'but' 'by' 'committee' 'congress' 'did' 'each'
 'education' 'first' 'for' 'from' 'general' 'had' 'has' 'have' 'he'
 'health' 'his' 'house' 'in' 'increase' 'is' 'it' 'last' 'made' 'make'
 'may' 'more' 'no' 'not' 'of' 'on' 'one' 'only' 'or' 'other' 'out' 'over'
 'pay' 'program' 'proposed' 'said' 'similar' 'state' 'such' 'take' 'than'
 'that' 'the' 'them' 'there' 'they' 'this' 'time' 'to' 'two' 'under' 'up'
 'was' 'were' 'what' 'which' 'who' 'will' 'with' 'would' 'year' 'years']

Document term matrix

         Word     Chunk=0     Chunk-1     Chunk-2     Chunk-3     Chunk-4 

       about           1           1           1           1           3
       after           2           3           2           1           3
     against           1           2           2           1           1
         aid           1           1           1           3           5
         all         