### TF-IDF

TF: Term Frequency, 衡量一个term在文档中出现得有多频繁。频率。

TF(t) = (t出现在文档中的次数) / (文档中的term总数).

IDF: Inverse Document Frequency, 衡量一个term有多重要。

有些词出现的很多，但是明显不是很有用。比如'is'，’the‘，’and‘之类的。

为了平衡，我们把罕见的词的重要性（weight）搞高，把常见词的重要性搞低。

IDF(t) = log_e(文档总数 / 含有t的文档总数).

TF-IDF = TF * IDF


>例子：

>一个文档有100个单词，其中单词baby出现了3次。

>那么，TF(baby) = (3/100) = 0.03.

>好，现在我们如果有10M的文档， baby出现在其中的1000个文档中。

>那么，IDF(baby) = log(10,000,000 / 1,000) = 4

>所以， TF-IDF(baby) = TF(baby) * IDF(baby) = 0.03 * 4 = 0.12

In [18]:
import nltk
from nltk.text import TextCollection
# 先把所有文档TextCollection类中。
# 这个类会自动帮你断句, 做统计, 做计算
corpus = TextCollection(['this is sentence one',
                        'this is sentence two',
                        'this is sentence three'])

# 直接就能算出tfidf
# (term: 一句话中的某个term, text: 这句话)
print(corpus.tf('this', 'this is sentence four'))
text =  'this is sentence four'
print(text.count('this'),len(text))  #很奇怪 这里计算出现次数是1 但是text的长度居然不是单词 而是字母长度
print(corpus.idf('this'))
print(corpus.tf_idf('this', 'this is sentence four'))
#this 在后面那句话中的频率 * idf(this在上面corpus出现的文档数)
# 1/4 * log( 3/2)

# 同理, 怎么得到一个标准大小的vector来表示所有的句子?

# 对于每个新句子
new_sentence = 'this is sentence five'
standard_vocab = nltk.word_tokenize('this is sentence one two that three')
print(standard_vocab)
# 遍历一遍所有的vocabulary中的词:
for word in standard_vocab:
    print(corpus.tf_idf(word, new_sentence))
    # 我们会得到一个巨长(=所有vocab长度的)向量
print('--------')    
new_sentence1 = 'one two'
# 遍历一遍所有的vocabulary中的词:
for word in standard_vocab:
    print(corpus.tf_idf(word, new_sentence1))
    # 我们会得到一个巨长(=所有vocab长度的)向量

0.047619047619047616
1 21
0.0
0.0
['this', 'is', 'sentence', 'one', 'two', 'that', 'three']
0.0
0.0
0.0
0.0
0.0
0.0
0.0
--------
0.0
0.0
0.0
0.15694461266687282
0.15694461266687282
0.0
0.0


In [13]:
#源代码
class TextCollection(Text):
    """A collection of texts, which can be loaded with list of texts, or
    with a corpus consisting of one or more texts, and which supports
    counting, concordancing, collocation discovery, etc.  Initialize a
    TextCollection as follows:

    >>> import nltk.corpus
    >>> from nltk.text import TextCollection
    >>> print('hack'); from nltk.book import text1, text2, text3
    hack...
    >>> gutenberg = TextCollection(nltk.corpus.gutenberg)
    >>> mytexts = TextCollection([text1, text2, text3])

    Iterating over a TextCollection produces all the tokens of all the
    texts in order.
    """
    def __init__(self, source):
        if hasattr(source, 'words'): # bridge to the text corpus reader
            source = [source.words(f) for f in source.fileids()]

        self._texts = source
        Text.__init__(self, LazyConcatenation(source))
        self._idf_cache = {}

    def tf(self, term, text):
        """ The frequency of the term in text. """
        return text.count(term) / len(text)


    def idf(self, term):
        """ The number of texts in the corpus divided by the
        number of texts that the term appears in.
        If a term does not appear in the corpus, 0.0 is returned. """
        # idf values are cached for performance.
        idf = self._idf_cache.get(term)
        if idf is None:
            matches = len([True for text in self._texts if term in text])
            # FIXME Should this raise some kind of error instead?
            idf = (log(len(self._texts) / matches) if matches else 0.0)
            self._idf_cache[term] = idf
        return idf


    def tf_idf(self, term, text):
        return self.tf(term, text) * self.idf(term)