## Problem
Text to feature using TF-IDF.
## Solution
Term frequency (TF)（词频）: Term frequency is simply the ratio of the count of a
word present in a sentence, to the length of the sentence.
(一个词在该句子中出现的次数与句子长度的比例)

TF is basically capturing the importance of the word irrespective of the
length of the document. For example, a word with the frequency of 3 with
the length of sentence being 10 is not the same as when the word length of
sentence is 100 words. It should get more importance in the first scenario;
that is what TF does.

Inverse Document Frequency (IDF): IDF of each word is the log of
the ratio of the total number of rows to the number of rows in a particular
document in which that word is present.

IDF = log(N/n), where N is the total number of rows and n is the
number of rows in which the word was present.

IDF will measure the rareness of a term. Words like “a,” and “the” show
up in all the documents of the corpus, but rare words will not be there
in all the documents. So, if a word is appearing in almost all documents,
then that word is of no use to us since it is not helping to classify or in
information retrieval. IDF will nullify this problem.

TF-IDF is the simple product of TF and IDF so that both of the
drawbacks are addressed, which makes predictions and information
retrieval relevant.

## Step 6-1 Read the text data
A familiar phrase:


In [49]:
Text = ["The quick brown fox jumped over the lazy dog.",
    "The dog.",
    "The fox"]

## Step 6-2 Creating the Features
Execute the below code on the text data:


In [63]:
#Import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

#Create the transform
vectorizer = TfidfVectorizer()

#Tokenize and build vocab
vectorizer.fit(Text)

#Summarize
print(vectorizer.vocabulary_)
print(vectorizer.idf_)


{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}
[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]


In [62]:
from sklearn import feature_extraction
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction.text import TfidfTransformer  

corpus = ['aaa ccc aaa aaa', 
          'aaa aaa', 
          'aaa aaa aaa', 
          'aaa aaa aaa aaa',
          'aaa bbb aaa bbb aaa',
          'ccc aaa aaa ccc aaa'
         ]
print(corpus)
vectorizer = CountVectorizer() 

X = vectorizer.fit_transform(corpus)

# 获取词袋模型中的所有词语   
word = vectorizer.get_feature_names()  
print(word) 

# 获取每个词在该行（文档）中出现的次数
counts =  X.toarray()
print (counts)

# # 第一种方法
# transformer = TfidfTransformer()
# tfidf = transformer.fit_transform(X)
# #tfidf = transformer.fit_transform(counts) #与上一行的效果完全一样


#第二种方法
transformer = TfidfVectorizer()
transformer.fit(corpus)
tfidf = transformer.transform(corpus)

print(tfidf)
print(tfidf.toarray())

['aaa ccc aaa aaa', 'aaa aaa', 'aaa aaa aaa', 'aaa aaa aaa aaa', 'aaa bbb aaa bbb aaa', 'ccc aaa aaa ccc aaa']
['aaa', 'bbb', 'ccc']
[[3 0 1]
 [2 0 0]
 [3 0 0]
 [4 0 0]
 [3 2 0]
 [3 0 2]]
  (0, 2)	0.5243329281310096
  (0, 0)	0.85151334721046
  (1, 0)	1.0
  (2, 0)	1.0
  (3, 0)	1.0
  (4, 1)	0.8323642772534078
  (4, 0)	0.5542289327998063
  (5, 2)	0.7763051366495072
  (5, 0)	0.63035730725644
[[0.85151335 0.         0.52433293]
 [1.         0.         0.        ]
 [1.         0.         0.        ]
 [1.         0.         0.        ]
 [0.55422893 0.83236428 0.        ]
 [0.63035731 0.         0.77630514]]
