## TF-IDf 介绍














`TF-IDF` 是一种统计方法，用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加，但同时会随着它在语料库中出现的频率成反比下降。


对于个性化阅读来说，一个`item`就是一篇文章。根据上面的第一步，我们首先要从文章内容中抽取出代表它们的属性。常用的方法就是利用出现在一篇文章中词来代表这篇文章，而每个词对应的权重往往使用信息检索中的`tf-idf`来计算。比如对于本文来说，词“CB”、“推荐”和“喜好”的权重会比较大，而“烤肉”这个词的权重会比较低。利用这种方法，一篇抽象的文章就可以使用具体的一个向量来表示了。

`TFIDF`的主要`思想`是：如果某个词或短语在一篇文章中出现的频率TF高，并且在其他文章中很少出现，则认为此词或者短语具有很好的类别区分能力，适合用来分类。`TF-IDF`实际上是：`TF * IDF`。

> `词频 (term frequency, TF)` 指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被归一化(一般是词频除以文章总词数), 以防止它偏向长的文件。（同一个词语在长文件里可能会比短文件有更高的词频，而不管该词语重要与否。）

> 但是, 需要注意, 一些通用的词语对于主题并没有太大的作用, 反倒是一些出现频率较少的词才能够表达文章的主题, 所以单纯使用是TF不合适的。权重的设计必须满足：一个词预测主题的能力越强，权重越大，反之，权重越小。所有统计的文章中，一些词只是在其中很少几篇文章中出现，那么这样的词对文章的主题的作用很大，这些词的权重应该设计的较大。IDF就是在完成这样的工作.


## `TF-IDF`算法步骤：

**step 1：计算词频**

词频（TF）= 某个词在文章中的出现次数

考虑到文章有长短之分，为了便于不同文章的比较，进行"词频"标准化。

$词频（TF）=\frac{某个词在文章中的出现次数}{文章的总词数}$

**step 2：计算逆文档频率**

这时，需要一个语料库（corpus），用来模拟语言的使用环境。

$逆文档频率（IDF）= log(\frac{语料库的文档总数}{包含该词的文档数 + 1})$

如果一个词越常见，那么分母就越大，逆文档频率就越小越接近0。分母之所以要加1，是为了避免分母为0（即所有文档都不包含该词）。log表示对得到的值取对数。

**step 3：计算 TF-IDF**

$TF-IDF=词频（TF）\times 逆文档频率（IDF）$

可以看到，TF-IDF与一个词在文档中的出现次数成正比，与该词在整个语言中的出现次数成反比。所以，自动提取关键词的算法就很清楚了，就是计算出文档的每个词的TF-IDF值，然后按降序排列，取排在最前面的几个词。

##优缺点

`TF-IDF`的优点是简单快速，而且容易理解。缺点是有时候用词频来衡量文章中的一个词的重要性不够全面，有时候重要的词出现的可能不够多，而且这种计算无法体现位置信息，无法体现词在上下文的重要性。如果要体现词的上下文结构，那么你可能需要使用`word2vec`算法来支持。

## sklearn 计算示例

In [5]:
# sklearn 计算：示例1
from sklearn.feature_extraction.text import TfidfTransformer  
from sklearn.feature_extraction.text import CountVectorizer  

corpus=["I come to China to travel", 
    "This is a car polupar in China",          
    "I love tea and Apple ",   
    "The work is to write some papers in science"] 

vectorizer=CountVectorizer()


transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))  
# print(vectorizer.fit_transform(corpus))
print (tfidf)

  (0, 16)	0.4424621378947393
  (0, 15)	0.697684463383976
  (0, 4)	0.4424621378947393
  (0, 3)	0.348842231691988
  (1, 14)	0.45338639737285463
  (1, 9)	0.45338639737285463
  (1, 6)	0.3574550433419527
  (1, 5)	0.3574550433419527
  (1, 3)	0.3574550433419527
  (1, 2)	0.45338639737285463
  (2, 12)	0.5
  (2, 7)	0.5
  (2, 1)	0.5
  (2, 0)	0.5
  (3, 18)	0.3565798233381452
  (3, 17)	0.3565798233381452
  (3, 15)	0.2811316284405006
  (3, 13)	0.3565798233381452
  (3, 11)	0.3565798233381452
  (3, 10)	0.3565798233381452
  (3, 8)	0.3565798233381452
  (3, 6)	0.2811316284405006
  (3, 5)	0.2811316284405006


In [0]:
# sklearn 计算：示例2
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf2 = TfidfVectorizer()
re = tfidf2.fit_transform(corpus)
print (re)

  (0, 16)	0.4424621378947393
  (0, 3)	0.348842231691988
  (0, 15)	0.697684463383976
  (0, 4)	0.4424621378947393
  (1, 5)	0.3574550433419527
  (1, 9)	0.45338639737285463
  (1, 2)	0.45338639737285463
  (1, 6)	0.3574550433419527
  (1, 14)	0.45338639737285463
  (1, 3)	0.3574550433419527
  (2, 1)	0.5
  (2, 0)	0.5
  (2, 12)	0.5
  (2, 7)	0.5
  (3, 10)	0.3565798233381452
  (3, 8)	0.3565798233381452
  (3, 11)	0.3565798233381452
  (3, 18)	0.3565798233381452
  (3, 17)	0.3565798233381452
  (3, 13)	0.3565798233381452
  (3, 5)	0.2811316284405006
  (3, 6)	0.2811316284405006
  (3, 15)	0.2811316284405006


## 示例 blog

[How to process textual data using TF-IDF in Python](https://www.freecodecamp.org/news/how-to-process-textual-data-using-tf-idf-in-python-cd2bbc0a94a3/)

## 用示例数据 实验


In [0]:
docA = "The cat sat on my face"
docB = "The dog sat on my bed"

In [0]:
# 把每个文档的句子用空格分成若干个词
bowA = docA.split(" ")
bowB = docB.split(" ")

In [0]:
print(bowA)
print(bowB)

['The', 'cat', 'sat', 'on', 'my', 'face']
['The', 'dog', 'sat', 'on', 'my', 'bed']


In [0]:
# wordSet 为词袋，里面存着所有文本中出现过的词
# set() 函数创建一个无序不重复元素集
# union() 方法返回两个集合的并集，即包含了所有集合的元素，重复的元素只会出现一次
wordSet = set(bowA).union(set(bowB))

In [0]:
print(wordSet)
type(wordSet)

{'sat', 'my', 'bed', 'dog', 'cat', 'on', 'The', 'face'}


set

In [0]:
# dict.fromkeys(keys, value)
# keys:	Required. An iterable specifying the keys of the new dictionary
# value: Optional. The value for all keys. Default value is None
# wordDictA 和 wordDictB 为字典
wordDictA = dict.fromkeys(wordSet, 0) 
wordDictB = dict.fromkeys(wordSet, 0)

In [0]:
print(wordDictA)
print(wordDictB)

{'dog': 0, 'on': 0, 'cat': 0, 'my': 0, 'sat': 0, 'The': 0, 'bed': 0, 'face': 0}
{'dog': 0, 'on': 0, 'cat': 0, 'my': 0, 'sat': 0, 'The': 0, 'bed': 0, 'face': 0}


In [0]:
for word in bowA:
    # wordDictA[word] = wordDictA[word] + 1
    wordDictA[word]+=1
    
for word in bowB:
    wordDictB[word]+=1

In [0]:
print(wordDictA['dog'])
print(wordDictB['dog'])
print(wordDictA)
print(wordDictB)

0
1
{'dog': 0, 'on': 1, 'cat': 1, 'my': 1, 'sat': 1, 'The': 1, 'bed': 0, 'face': 1}
{'dog': 1, 'on': 1, 'cat': 0, 'my': 1, 'sat': 1, 'The': 1, 'bed': 1, 'face': 0}


In [0]:
import pandas as pd
pd.DataFrame([wordDictA, wordDictB])

Unnamed: 0,dog,on,cat,my,sat,The,bed,face
0,0,1,1,1,1,1,0,1
1,1,1,0,1,1,1,1,0


### 计算 TF
$词频（TF）=\frac{某个词在文章中的出现次数}{文章的总词数}$

In [6]:
# 计算 TF
def computeTF(wordDict, bow):
    # 用一个字典对象保存 TF，把所有对应于 bow 文档里的 TF都计算出来
    tfDict = {}
    bowCount = len(bow) # 文章的总字数
    # 遍历字典中的每个词汇
    # Python字典items()方法以列表返回可遍历的(键,值)元组数组
    for word, count in wordDict.items():    # 刚才统计的 wordDictA 和 wordDictB 字典中的 value，就是 count
        tfDict[word] = count/float(bowCount)
#         print(wordDict.items())
    return tfDict


In [7]:
tfBowA = computeTF(wordDictA, bowA)
print("**********")
tfBowB = computeTF(wordDictB, bowB)

NameError: name 'wordDictA' is not defined

In [0]:
tfBowA

{'The': 0.16666666666666666,
 'bed': 0.0,
 'cat': 0.16666666666666666,
 'dog': 0.0,
 'face': 0.16666666666666666,
 'my': 0.16666666666666666,
 'on': 0.16666666666666666,
 'sat': 0.16666666666666666}

In [0]:
tfBowB

{'The': 0.16666666666666666,
 'bed': 0.16666666666666666,
 'cat': 0.0,
 'dog': 0.16666666666666666,
 'face': 0.0,
 'my': 0.16666666666666666,
 'on': 0.16666666666666666,
 'sat': 0.16666666666666666}

### 计算 IDF
$逆文档频率（IDF）= log(\frac{语料库的文档总数}{包含该词的文档数 + 1})$

In [0]:
# 计算 IDF
# docList 为所有文档的 list
def computeIDF(docList):
    import math
    # 用一个字典对象保存 IDF
    idfDict = {}
    N = len(docList)    # 语料库的文档总数
    
    # 初始化 idfDict 字典，已备后续计算填入值
    # docList[0].keys(): 取出第一个文档（字典）里的所有 keys，分别是 ['dog', 'on', 'cat', 'my', 'sat', 'The', 'bed', 'face']
    # 其实取第几个文档的字典里的 keys 都一样，在前面我们已经把所有词都放入了每个文档里，只不过 value 值不一样，相当于 one-hot
    # dict.fromkeys(docList[0].keys(), 0): 再另所有 keys 的 value 为 0，分别是 {'dog': 0, 'on': 0, 'cat': 0, 'my': 0, 'sat': 0, 'The': 0, 'bed': 0, 'face': 0}
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList: # docList = [wordDictA, wordDictB]
        for word, val in doc.items():
            if val > 0:
                # idfDict 为一个统计词典，遍历每一个文档，把出现过至少一次的词对应在 idfDict 字典中的 value 都+1，则 value 就能表示改词在几个文档中出现过
                idfDict[word] += 1  
    
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))  # log() 计算 math.log10()
        
    return idfDict

In [0]:
idfs = computeIDF([wordDictA, wordDictB])

In [0]:
idfs

{'The': 0.0,
 'bed': 0.3010299956639812,
 'cat': 0.3010299956639812,
 'dog': 0.3010299956639812,
 'face': 0.3010299956639812,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0}

### 计算 TF-IDF
$TF-IDF=词频（TF）\times 逆文档频率（IDF）$

In [0]:
# 计算 TF-IDF
# tfBow 为之前算出的每个文档中词的 TF 值
def computeTFIDF(tfBow, idfs):
    tfidf = {}
    for word, val in tfBow.items():
        tfidf[word] = val*idfs[word]
    return tfidf

In [0]:
tfidfBowA = computeTFIDF(tfBowA, idfs)
tfidfBowB = computeTFIDF(tfBowB, idfs)

In [0]:
tfidfBowA

{'The': 0.0,
 'bed': 0.0,
 'cat': 0.050171665943996864,
 'dog': 0.0,
 'face': 0.050171665943996864,
 'my': 0.0,
 'on': 0.0,
 'sat': 0.0}

In [0]:
import pandas as pd
pd.DataFrame([tfidfBowA, tfidfBowB])

Unnamed: 0,bed,The,dog,on,my,sat,face,cat
0,0.0,0.0,0.0,0.0,0.0,0.0,0.050172,0.050172
1,0.050172,0.0,0.050172,0.0,0.0,0.0,0.0,0.0


----

# 用真实的 rules 数据 实验


## 只计算第一列标题数据的 TF-IDF

### 数据处理

In [0]:
# colab 挂载 google drive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
!ls '/content/drive/My Drive/实验/知识点'

 c_c++_rules.txt
 challenge_answer.csv
 challenge_answer.gsheet
'crawling&data_processing_edu-answer.ipynb'
 c++_rule_opt.xls
 c_rule_opt.xlsx
 c++_rule_opt.xlsx
 c++rules.gsheet
 c_rules.txt
 c_rule.xls
 c++_rule.xls
 c++知识点.gsheet
 c++知识点关系映射.gsheet
 c++知识点关系映射（可自动识别知识点）.gsheet
 delete_duplication_words_out_map.csv
 delete_duplication_words_out_map.gsheet
 delete_duplication_words.py
 extract_keywords.py
 keyword_maps_rules.ipynb
'str_text (1).gdoc'
 str_text.gdoc
 str_text.txt
 temp.csv
 test.csv
 test.xls
 Untitled0.ipynb
 知识点位置序列.gsheet
 简历.gdoc


In [0]:
!pip3 install xlutils

Collecting xlutils
[?25l  Downloading https://files.pythonhosted.org/packages/c7/55/e22ac73dbb316cabb5db28bef6c87044a95914f713a6e81b593f8a0d2f79/xlutils-2.0.0-py2.py3-none-any.whl (55kB)
[K     |██████                          | 10kB 18.1MB/s eta 0:00:01[K     |████████████                    | 20kB 3.1MB/s eta 0:00:01[K     |█████████████████▉              | 30kB 4.5MB/s eta 0:00:01[K     |███████████████████████▉        | 40kB 2.9MB/s eta 0:00:01[K     |█████████████████████████████▊  | 51kB 3.6MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 3.1MB/s 
Installing collected packages: xlutils
Successfully installed xlutils-2.0.0


In [8]:
import xlwt
import xlrd
from xlutils.copy import copy
import re

workbook = xlrd.open_workbook(r'/content/drive/My Drive/实验/知识点/c++_rule_opt.xls')
# 通过index来获得一个sheet对象，index从0开始算起
sheet = workbook.sheet_by_index(0)
print(sheet.name)   # sheet 的名字
print (sheet.nrows) # sheet 的行数

word_bag = []
word_set = []
for i in range(sheet.nrows):
    # 获得特定的cell对象的值
    str_cell = sheet.cell(i,0).value
    # 去掉每行句子头尾的双引号
    str_cell = str_cell.strip('"')
    # 去掉每行句子中的双引号
    str_cell = str_cell.replace('"', '')
    # 把每行的句子用空格分成若干个词
    word_set = str_cell.split(" ")

    #print(word_set)
    word_bag.append(word_set)
print(word_bag)

# word_bag 为词袋，里面存着所有文本中出现过的词
# set() 函数创建一个无序不重复元素集
word_bag_union = []
for i in range(sheet.nrows-1):
    word_bag_union_tmp = set(word_bag[i]).union(set(word_bag[i+1]))
    word_bag_union = set.union(word_bag_union_tmp, word_bag_union)
    # for k,v in word_bag_union_tmp.items():
    #     word_bag_union[k] = v
    # word_bag_union = dict(word_bag_union_tmp.items() + word_bag_union.items())
print(word_bag_union)

for i in range(sheet.nrows):
    # dict.fromkeys(docList[0].keys(), 0): 再另所有 keys 的 value 为 0，分别是 {'dog': 0, 'on': 0, 'cat': 0, 'my': 0, 'sat': 0, 'The': 0, 'bed': 0, 'face': 0}
    wordDict = dict.fromkeys(word_bag_union, 0)
    # 动态生成变量 wordDict_0 ... wordDict_438，并把 value 为 0 赋给每个 key
    exec('wordDict_{} = {}'.format(i, wordDict))

print(wordDict_0)
print(wordDict_438)


IOError: [Errno 2] No such file or directory: '/content/drive/My Drive/\xe5\xae\x9e\xe9\xaa\x8c/\xe7\x9f\xa5\xe8\xaf\x86\xe7\x82\xb9/c++_rule_opt.xls'

In [0]:
# 必须先跑了上个 cell，再跑这个 cell，不然数据会累加
# 遍历第一列每行句子的词汇，出现了该词汇则该词汇数量 +1
word_set_1 = []
names = locals()
for i in range(sheet.nrows):
    # 获得特定的cell对象的值，（i，0）为第 i 行， 第 1 列
    str_cell_1 = sheet.cell(i,0).value
    # 去掉第一列每行句子头尾的双引号
    str_cell_1 = str_cell_1.strip('"')
    # 去掉第一列每行句子中的双引号
    str_cell_1 = str_cell_1.replace('"', '')
    # 把第一列每行的句子用空格分成若干个词
    word_set_1 = str_cell_1.split(" ")
    # 遍历第一列每行句子的词汇，出现了该词汇则该词汇数量 +1
    # names['wordDict_' + str(i)] 为动态生成变量，如 wordDict_0，wordDict_1
    # wordDict_i 中存着第一列每行句子的 one-hot 字典
    for word in word_set_1:
        names['wordDict_' + str(i)][word] = names['wordDict_' + str(i)][word] +1

# 这里用来检查第一列第一行的词汇(即变量 wordDict_0)，哪些词汇 value >=1 
print({k:v for k, v in wordDict_0.items() if v >= 1})

{'summary': 1}


In [0]:
import pandas as pd
pd.DataFrame([wordDict_0, wordDict_1, wordDict_2])

Unnamed: 0,Unnamed: 1,Code,&&,Relational,conversions,integer,managed,<ctime>,"labels',",conditional,"structures,","handlers',",loop-counter,"out,",comments,C++,namespace,"temporary,",Moved-from,initialization,Comment,"statement,","declarations',","self-assigned,","operator',",Member,//,shall,parameter,headers,on,blocks,files,some,User-defined,"if',",Assignment,sides,"clause,",using,...,macro,"null',",definition,function,Flexible,matching,Overriding,statement,"namespaces,",#define'd,mktemp,"cases,",not,"appropriate,",comply,"whitespaces,",setjmp,constructors,corresponding,line,"handler',",consist,"immediately,",....1,at,<filename>,"parameters,",always,forms,Accessible,occurrence,grouped,logic,operators,regular,order,Access,using-declarations,Unions,%s
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### 计算 TF
$词频（TF）=\frac{某个词在文章中的出现次数}{文章的总词数}$


In [0]:
# 计算 TF
def computeTF(wordDict, row_word):
    # 用一个字典对象保存 TF，把所有对应于 bow 文档里的 TF都计算出来
    tfDict = {}
    wordCount = len(row_word) # 文章的总字数
    # 遍历字典中的每个词汇
    # Python字典items()方法以列表返回可遍历的(键,值)元组数组
    for word, count in wordDict.items():    # 刚才统计的 wordDictA 和 wordDictB 字典中的 value，就是 count
        tfDict[word] = count/float(wordCount)
    return tfDict

In [0]:
names = locals()

for i in range(sheet.nrows):
    # 获得特定的cell对象的值，（i，0）为第 i 行， 第 1 列
    str_cell_2 = sheet.cell(i,0).value
    # 去掉第一列每行句子头尾的双引号
    str_cell_2 = str_cell_2.strip('"')
    # 去掉第一列每行句子中的双引号
    str_cell_2 = str_cell_2.replace('"', '')
    # 把第一列每行的句子用空格分成若干个词
    word_set_2 = str_cell_2.split(" ")
    # names['tf_' + str(i)] 为动态生成变量，如 tf_0，tf_1
    names['tf_' + str(i)] = computeTF(names['wordDict_' + str(i)], word_set_2)

print(tf_0)





In [0]:
tf_1

{'': 0.0,
 'Code': 0.0,
 '&&': 0.0,
 'Relational': 0.0,
 'conversions': 0.0,
 'integer': 0.0,
 'managed': 0.0,
 '<ctime>': 0.0,
 "labels',": 0.0,
 'conditional': 0.0,
 'structures,': 0.0,
 "handlers',": 0.0,
 'loop-counter': 0.0,
 'out,': 0.0,
 'comments': 0.0,
 'C++': 0.0,
 'namespace': 0.0,
 'temporary,': 0.16666666666666666,
 'Moved-from': 0.0,
 'initialization': 0.0,
 'Comment': 0.0,
 'statement,': 0.0,
 "declarations',": 0.0,
 'self-assigned,': 0.0,
 "operator',": 0.0,
 'Member': 0.0,
 '//': 0.0,
 'shall': 0.0,
 'parameter': 0.0,
 'headers': 0.0,
 'on': 0.0,
 'blocks': 0.0,
 'files': 0.0,
 'some': 0.0,
 'User-defined': 0.0,
 "if',": 0.0,
 'Assignment': 0.0,
 'sides': 0.0,
 'clause,': 0.0,
 'using': 0.0,
 'Return': 0.0,
 'operands,': 0.0,
 'list,': 0.0,
 'located': 0.0,
 'Scoped': 0.0,
 'methods,': 0.0,
 'stated': 0.0,
 'global': 0.0,
 'const': 0.0,
 'transferred': 0.0,
 'specialized,': 0.0,
 'extern': 0.0,
 'Types': 0.0,
 'naming': 0.0,
 'sizeof': 0.0,
 'Raw': 0.0,
 'language': 0.

### 计算 IDF
$逆文档频率（IDF）= log(\frac{语料库的文档总数}{包含该词的文档数 + 1})$

In [0]:
# 计算 IDF
# docList 为所有文档的 list
def computeIDF(docList):
    import math
    # 用一个字典对象保存 IDF
    idfDict = {}
    N = len(docList)    # 语料库的文档总数
    
    # 初始化 idfDict 字典，已备后续计算填入值
    # docList[0].keys(): 取出第一个文档（字典）里的所有 keys，分别是 ['dog', 'on', 'cat', 'my', 'sat', 'The', 'bed', 'face']
    # 其实取第几个文档的字典里的 keys 都一样，在前面我们已经把所有词都放入了每个文档里，只不过 value 值不一样，相当于 one-hot
    # dict.fromkeys(docList[0].keys(), 0): 再另所有 keys 的 value 为 0，分别是 {'dog': 0, 'on': 0, 'cat': 0, 'my': 0, 'sat': 0, 'The': 0, 'bed': 0, 'face': 0}
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList: # docList = [wordDictA, wordDictB]
        for word, val in doc.items():
            if val > 0:
                # idfDict 为一个统计词典，遍历每一个文档，把出现过至少一次的词对应在 idfDict 字典中的 value 都+1，则 value 就能表示改词在几个文档中出现过
                idfDict[word] += 1  
    
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))  # log() 计算 math.log10()
        
    return idfDict

In [0]:
# 先把第一列的 one-hot 数据按行读入一个 list
# 再调用函数 computeIDF 计算 IDF
names = locals()
wordDict_list = []

for i in range(sheet.nrows):
    wordDict_list.append(names['wordDict_' + str(i)])
print(wordDict_list[0])



In [0]:
idfs = computeIDF(wordDict_list)
idfs

{'': 0.9990118437559339,
 'Code': 2.6424645202421213,
 '&&': 2.165343265522459,
 'Relational': 2.6424645202421213,
 'conversions': 2.6424645202421213,
 'integer': 2.6424645202421213,
 'managed': 2.6424645202421213,
 '<ctime>': 2.6424645202421213,
 "labels',": 2.6424645202421213,
 'conditional': 2.040404528914159,
 'structures,': 2.6424645202421213,
 "handlers',": 2.6424645202421213,
 'loop-counter': 1.8643132698584777,
 'out,': 2.6424645202421213,
 'comments': 2.040404528914159,
 'C++': 2.34143452457814,
 'namespace': 2.165343265522459,
 'temporary,': 2.6424645202421213,
 'Moved-from': 2.6424645202421213,
 'initialization': 2.34143452457814,
 'Comment': 2.6424645202421213,
 'statement,': 2.34143452457814,
 "declarations',": 2.34143452457814,
 'self-assigned,': 2.6424645202421213,
 "operator',": 2.34143452457814,
 'Member': 2.165343265522459,
 '//': 2.165343265522459,
 'shall': 1.195306488899902,
 'parameter': 2.165343265522459,
 'headers': 2.6424645202421213,
 'on': 1.6882220108027965,

### 计算 TF-IDF
$TF-IDF=词频（TF）\times 逆文档频率（IDF）$

In [0]:
# 计算 TF-IDF
# tf 为之前算出的每个文档中词的 TF 值
def computeTFIDF(tf, idfs):
    tfidf = {}
    for word, val in tf.items():
        tfidf[word] = val*idfs[word]
    return tfidf

In [0]:
names = locals()
for i in range(sheet.nrows):
    # tfidf_0 = computeTFIDF(tf_0, idfs)
    names['tfidf_' + str(i)] = computeTFIDF(names['tf_' + str(i)], idfs)
    
print(tfidf_0)
print(tfidf_1)
# 这里用来检查第一列第一行的词汇(即变量 wordDict_0)，哪些词汇 value >=1 
print({k:v for k, v in tfidf_0.items() if v >= 1})
print({k:v for k, v in tfidf_1.items() if v > 0})
print({k:v for k, v in tfidf_2.items() if v > 0})
print({k:v for k, v in tfidf_3.items() if v > 0})

{'summary': 2.6424645202421213}
{'temporary,': 0.44041075337368685, 'should': 0.010637551712341504, 'be': 0.025183804401308105, 'RAII': 0.44041075337368685, 'objects': 0.39023908742969, 'not': 0.041921568856457044}
{'should': 0.005802300934004458, 'without': 0.2128576840525582, 'of': 0.07857393362349797, 'macros': 0.18549132081037809, 'arguments,': 0.18549132081037809, 'all': 0.17668131962782752, 'be': 0.013736620582531694, 'invoked': 0.2402240472947383, 'their': 0.196849387774769, 'Function-like': 0.2128576840525582, 'not': 0.022866310285340207}
{'pthread_mutex_t': 0.196849387774769, 'should': 0.005802300934004458, 'the': 0.08499039492219865, 'in': 0.09081925852326672, "locked',": 0.2402240472947383, 'they': 0.18549132081037809, 'reverse': 0.2402240472947383, 'be': 0.013736620582531694, 'were': 0.2402240472947383, 'unlocked': 0.2128576840525582, 'order': 0.18549132081037809}


## 结论
只用第一列的标题数据，算出的关键字不明显，可能因为标题字数太少，现在尝试加入描述数据。

## 计算第一列标题数据和第八列描述数据的 TF-IDF

### 数据处理

In [0]:
# colab 挂载 google drive
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [0]:
!ls '/content/drive/My Drive/实验/知识点'

 c_c++_rules.txt
 challenge_answer.csv
 challenge_answer.gsheet
'Copy of c++_rule_opt.gsheet'
'Copy of c++_rule_opt.xlsx'
'crawling&data_processing_edu-answer.ipynb'
 c++_rule_opt.xls
 c_rule_opt.xlsx
 c++_rule_opt.xlsx
 c++rules.gsheet
 c_rules.txt
 c_rule.xls
 c++_rule.xls
 c++知识点.gsheet
 c++知识点关系映射.gsheet
 c++知识点关系映射（可自动识别知识点）.gsheet
 delete_duplication_words_out_map.csv
 delete_duplication_words_out_map.gsheet
 delete_duplication_words.py
 extract_keywords.py
 keyword_maps_rules.ipynb
'str_text (1).gdoc'
 str_text.gdoc
 str_text.txt
 temp.csv
 test.csv
 test.xls
 Untitled0.ipynb
 知识点位置序列.gsheet
 简历.gdoc


In [0]:
!pip3 install xlutils

Collecting xlutils
[?25l  Downloading https://files.pythonhosted.org/packages/c7/55/e22ac73dbb316cabb5db28bef6c87044a95914f713a6e81b593f8a0d2f79/xlutils-2.0.0-py2.py3-none-any.whl (55kB)
[K     |██████                          | 10kB 17.4MB/s eta 0:00:01[K     |████████████                    | 20kB 3.2MB/s eta 0:00:01[K     |█████████████████▉              | 30kB 4.0MB/s eta 0:00:01[K     |███████████████████████▉        | 40kB 2.8MB/s eta 0:00:01[K     |█████████████████████████████▊  | 51kB 3.5MB/s eta 0:00:01[K     |████████████████████████████████| 61kB 2.9MB/s 
Installing collected packages: xlutils
Successfully installed xlutils-2.0.0


In [0]:
import xlwt
import xlrd
from xlutils.copy import copy
import re

workbook = xlrd.open_workbook(r'/content/drive/My Drive/实验/知识点/c++_rule_opt.xls')
# 通过index来获得一个sheet对象，index从0开始算起
sheet = workbook.sheet_by_index(0)
print(sheet.name)   # sheet 的名字
print (sheet.nrows) # sheet 的行数

word_bag = []
word_set = []
word_set_title = []
word_set_description = []
string = []
for i in range(sheet.nrows):
    # 获得特定的cell对象的值
    str_cell_title = sheet.cell(i,0).value
    str_cell_description = sheet.cell(i,7).value
    # 去掉每行句子头尾的双引号，单引号和逗号
    str_cell_title = str_cell_title.strip('"')
    str_cell_description = str_cell_description.strip('"')
    str_cell_title = str_cell_title.strip("'")
    str_cell_description = str_cell_description.strip("'")
    str_cell_title = str_cell_title.strip(',')
    str_cell_description = str_cell_description.strip(',')
    # 去掉每行句子中的双引号，单引号，用空格代替
    str_cell_title = str_cell_title.replace('"', ' ')
    str_cell_description = str_cell_description.replace('"', ' ')
    str_cell_title = str_cell_title.replace("'", ' ')
    str_cell_description = str_cell_description.replace("'", ' ')
    # 第八列句子中的 <p>,</p>,<em>,</em>,<li>,</li>,<code>,</code>,<h2>,</h2>,<ul>,</ul>
    # <pre>,</pre>,\n 用空格代替
    string = ['<p>','</p>','<em>','</em>','<li>','</li>','<code>','</code>',
              '<h2>','</h2>','<ul>','</ul>','<pre>','</pre>','\n','\\n']
    for j in range(len(string)):
        str_cell_description = str_cell_description.replace(string[j], ' ')
    # 把每行的句子用空格分成若干个词
    word_set_title = str_cell_title.split(" ")
    word_set_description = str_cell_description.split(" ")
    # 合并两个 list
    word_set = word_set_title + word_set_description
    word_bag.append(word_set)
print(word_bag)

C++_rules_opt
439


In [0]:
# word_bag 为词袋，里面存着所有文本中出现过的词
# set() 函数创建一个无序不重复元素集
word_bag_union = []
for i in range(sheet.nrows-1):
    word_bag_union_tmp = set(word_bag[i]).union(set(word_bag[i+1]))
    word_bag_union = set.union(word_bag_union_tmp, word_bag_union)
    # for k,v in word_bag_union_tmp.items():
    #     word_bag_union[k] = v
    # word_bag_union = dict(word_bag_union_tmp.items() + word_bag_union.items())
print(word_bag_union)

for i in range(sheet.nrows):
    # dict.fromkeys(docList[0].keys(), 0): 再另所有 keys 的 value 为 0，分别是 {'dog': 0, 'on': 0, 'cat': 0, 'my': 0, 'sat': 0, 'The': 0, 'bed': 0, 'face': 0}
    wordDict = dict.fromkeys(word_bag_union, 0)
    # 动态生成变量 wordDict_0 ... wordDict_438，并把 value 为 0 赋给每个 key
    exec('wordDict_{} = {}'.format(i, wordDict))

print(wordDict_0)
print(wordDict_438)



In [0]:
# 必须先跑了上个 cell，再跑这个 cell，不然数据会累加
# 遍历第一列每行句子的词汇，出现了该词汇则该词汇数量 +1
word_set_title_1 = []
word_set_description_1 = []
names = locals()
for i in range(sheet.nrows):
    # 获得特定的cell对象的值，（i，0）为第 i 行， 第 1 列
    str_cell_title_1 = sheet.cell(i,0).value
    str_cell_description_1 = sheet.cell(i,7).value
    # 去掉每行句子头尾的双引号，单引号和逗号
    str_cell_title_1 = str_cell_title_1.strip('"')
    str_cell_description_1 = str_cell_description_1.strip('"')
    str_cell_title_1 = str_cell_title_1.strip("'")
    str_cell_description_1 = str_cell_description_1.strip("'")
    str_cell_title_1 = str_cell_title_1.strip(',')
    str_cell_description_1 = str_cell_description_1.strip(',')
    # 去掉每行句子中的双引号，单引号，用空格代替
    str_cell_title_1 = str_cell_title_1.replace('"', ' ')
    str_cell_description_1 = str_cell_description_1.replace('"', ' ')
    str_cell_title_1 = str_cell_title_1.replace("'", ' ')
    str_cell_description_1 = str_cell_description_1.replace("'", ' ')
    # 第八列句子中的 <p>,</p>,<em>,</em>,<li>,</li>,<code>,</code>,<h2>,</h2>,<ul>,</ul>
    # <pre>,</pre>,\n 用空格代替
    string_1 = ['<p>','</p>','<em>','</em>','<li>','</li>','<code>','</code>',
              '<h2>','</h2>','<ul>','</ul>','<pre>','</pre>','\n','\\n']
    for j in range(len(string_1)):
        str_cell_description_1 = str_cell_description_1.replace(string[j], ' ')
    # 把每行的句子用空格分成若干个词
    word_set_title_1 = str_cell_title_1.split(" ")
    word_set_description_1 = str_cell_description_1.split(" ")
    # 合并两个 list
    word_set_1 = word_set_title_1 + word_set_description_1
    # 遍历第一列每行句子的词汇，出现了该词汇则该词汇数量 +1
    # names['wordDict_' + str(i)] 为动态生成变量，如 wordDict_0，wordDict_1
    # wordDict_i 中存着第一列每行句子的 one-hot 字典
    for word in word_set_1:
        names['wordDict_' + str(i)][word] = names['wordDict_' + str(i)][word] +1

# 这里用来检查第一列第一行的词汇(即变量 wordDict_0)，哪些词汇 value >=1 
print({k:v for k, v in wordDict_0.items() if v >= 1})


{'summary': 1, 'description': 1}


In [0]:
import pandas as pd
pd.DataFrame([wordDict_0, wordDict_1, wordDict_2])

Unnamed: 0,Unnamed: 1,which,std::make_unique&lt;Person&gt;(,simplifies,*memccpy(void,method(),Green,generally,"reference,",10);,>C++,std::auto_ptr&lt;Shape&gt;,chain,accessible.,goes,"memcpy(&amp;dest,",declares,AirPlane,"x(i),",non-empty,sets,becoming,sprintf(),rewritten,indicates,0xFA,if(ARR!=nullptr),Method,"&gt;,",atol,https://www.securecoding.cert.org/confluence/x/RAE,invalidated,explanations,v(...);,below.,19-3-1,"cont.substring(pos1,",valid,Non-zero,sequences,...,effects.,unwary,g(shared_ptr&lt;S&gt;(new,"{APPLE,",so,business,rule;,MSC04-C.</a>,depends,"conditions,","Today,",logical,permutation.size());,"2004,",precedence,<ctime>,_fixed,hope,confuse,side-effects.,delete[],[](FILE*file){return,"nowhere,",inside,(Person*)malloc(sizeof(Person));,keystrokes.,comma,sequence.,(E.G.,maintenance.,"changes,",(void,Unevaluated,inner,^[a-z][a-zA-Z0-9]*$,(const-qualified,classic,"clumsy,",hopelessly,"CFront,"
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,45,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,23,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### 计算 TF
$词频（TF）=\frac{某个词在文章中的出现次数}{文章的总词数}$

In [0]:
# 计算 TF
def computeTF(wordDict, row_word):
    # 用一个字典对象保存 TF，把所有对应于 bow 文档里的 TF都计算出来
    tfDict = {}
    wordCount = len(row_word) # 文章的总字数
    # 遍历字典中的每个词汇
    # Python字典items()方法以列表返回可遍历的(键,值)元组数组
    for word, count in wordDict.items():    # 刚才统计的 wordDictA 和 wordDictB 字典中的 value，就是 count
        tfDict[word] = count/float(wordCount)
    return tfDict

In [0]:
names = locals()

for i in range(sheet.nrows):
    # 获得特定的cell对象的值，（i，0）为第 i 行， 第 1 列
    str_cell_title_2 = sheet.cell(i,0).value
    str_cell_description_2 = sheet.cell(i,7).value
    # 去掉每行句子头尾的双引号，单引号和逗号
    str_cell_title_2 = str_cell_title_2.strip('"')
    str_cell_description_2 = str_cell_description_2.strip('"')
    str_cell_title_2 = str_cell_title_2.strip("'")
    str_cell_description_2 = str_cell_description_2.strip("'")
    str_cell_title_2 = str_cell_title_2.strip(',')
    str_cell_description_2 = str_cell_description_2.strip(',')
    # 去掉每行句子中的双引号，单引号，用空格代替
    str_cell_title_2 = str_cell_title_2.replace('"', ' ')
    str_cell_description_2 = str_cell_description_2.replace('"', ' ')
    str_cell_title_2 = str_cell_title_2.replace("'", ' ')
    str_cell_description_2 = str_cell_description_2.replace("'", ' ')
    # 第八列句子中的 <p>,</p>,<em>,</em>,<li>,</li>,<code>,</code>,<h2>,</h2>,<ul>,</ul>
    # <pre>,</pre>,\n 用空格代替
    string_1 = ['<p>','</p>','<em>','</em>','<li>','</li>','<code>','</code>',
              '<h2>','</h2>','<ul>','</ul>','<pre>','</pre>','\n','\\n']
    for j in range(len(string_1)):
        str_cell_description_2 = str_cell_description_2.replace(string[j], ' ')
    # 把每行的句子用空格分成若干个词
    word_set_title_2 = str_cell_title_2.split(" ")
    word_set_description_2 = str_cell_description_2.split(" ")
    # 合并两个 list
    word_set_2 = word_set_title_2 + word_set_description_2
    # names['tf_' + str(i)] 为动态生成变量，如 tf_0，tf_1
    names['tf_' + str(i)] = computeTF(names['wordDict_' + str(i)], word_set_2)

print(tf_0)



In [0]:
tf_1

{'': 0.23809523809523808,
 'which': 0.0,
 'std::make_unique&lt;Person&gt;(': 0.0,
 'simplifies': 0.0,
 '*memccpy(void': 0.0,
 'method()': 0.0,
 'Green': 0.0,
 'generally': 0.0,
 'reference,': 0.0,
 '10);': 0.0,
 '>C++': 0.005291005291005291,
 'std::auto_ptr&lt;Shape&gt;': 0.0,
 'chain': 0.0,
 'accessible.': 0.0,
 'goes': 0.0,
 'memcpy(&amp;dest,': 0.0,
 'declares': 0.0,
 'AirPlane': 0.0,
 'x(i),': 0.0,
 'non-empty': 0.0,
 'sets': 0.0,
 'becoming': 0.0,
 'sprintf()': 0.0,
 'rewritten': 0.0,
 'indicates': 0.0,
 '0xFA': 0.0,
 'if(ARR!=nullptr)': 0.0,
 'Method': 0.0,
 '&gt;,': 0.0,
 'atol': 0.0,
 'https://www.securecoding.cert.org/confluence/x/RAE': 0.0,
 'invalidated': 0.0,
 'explanations': 0.0,
 'v(...);': 0.0,
 'below.': 0.0,
 '19-3-1': 0.0,
 'cont.substring(pos1,': 0.0,
 'valid': 0.0,
 'Non-zero': 0.0,
 'sequences': 0.0,
 '2.5.': 0.0,
 'Modern': 0.0,
 'Enumerations': 0.0,
 'fclose': 0.0,
 '-1;': 0.0,
 'resources:': 0.0,
 'addressed': 0.0,
 '<td>getwd</td>': 0.0,
 'reaches': 0.0,
 'free

### 计算 IDF
$逆文档频率（IDF）= log(\frac{语料库的文档总数}{包含该词的文档数 + 1})$

In [0]:
# 计算 IDF
# docList 为所有文档的 list
def computeIDF(docList):
    import math
    # 用一个字典对象保存 IDF
    idfDict = {}
    N = len(docList)    # 语料库的文档总数
    
    # 初始化 idfDict 字典，已备后续计算填入值
    # docList[0].keys(): 取出第一个文档（字典）里的所有 keys，分别是 ['dog', 'on', 'cat', 'my', 'sat', 'The', 'bed', 'face']
    # 其实取第几个文档的字典里的 keys 都一样，在前面我们已经把所有词都放入了每个文档里，只不过 value 值不一样，相当于 one-hot
    # dict.fromkeys(docList[0].keys(), 0): 再另所有 keys 的 value 为 0，分别是 {'dog': 0, 'on': 0, 'cat': 0, 'my': 0, 'sat': 0, 'The': 0, 'bed': 0, 'face': 0}
    idfDict = dict.fromkeys(docList[0].keys(), 0)
    for doc in docList: # docList = [wordDictA, wordDictB]
        for word, val in doc.items():
            if val > 0:
                # idfDict 为一个统计词典，遍历每一个文档，把出现过至少一次的词对应在 idfDict 字典中的 value 都+1，则 value 就能表示改词在几个文档中出现过
                idfDict[word] += 1  
    
    for word, val in idfDict.items():
        idfDict[word] = math.log10(N / float(val))  # log() 计算 math.log10()
        
    return idfDict

In [0]:
# 先把第一列的 one-hot 数据按行读入一个 list
# 再调用函数 computeIDF 计算 IDF
names = locals()
wordDict_list = []

for i in range(sheet.nrows):
    wordDict_list.append(names['wordDict_' + str(i)])
print(wordDict_list[0])



In [0]:
idfs = computeIDF(wordDict_list)
idfs

{'': 0.000990409738021873,
 'which': 0.6786766928965661,
 'std::make_unique&lt;Person&gt;(': 2.6424645202421213,
 'simplifies': 2.6424645202421213,
 '*memccpy(void': 2.6424645202421213,
 'method()': 2.6424645202421213,
 'Green': 2.34143452457814,
 'generally': 2.040404528914159,
 'reference,': 2.040404528914159,
 '10);': 2.34143452457814,
 '>C++': 0.8716125085999771,
 'std::auto_ptr&lt;Shape&gt;': 2.6424645202421213,
 'chain': 1.9434945159061026,
 'accessible.': 2.6424645202421213,
 'goes': 1.9434945159061026,
 'memcpy(&amp;dest,': 2.6424645202421213,
 'declares': 2.34143452457814,
 'AirPlane': 2.6424645202421213,
 'x(i),': 2.6424645202421213,
 'non-empty': 2.34143452457814,
 'sets': 2.34143452457814,
 'becoming': 2.6424645202421213,
 'sprintf()': 2.6424645202421213,
 'rewritten': 2.6424645202421213,
 'indicates': 1.6424645202421213,
 '0xFA': 2.6424645202421213,
 'if(ARR!=nullptr)': 2.6424645202421213,
 'Method': 2.6424645202421213,
 '&gt;,': 2.34143452457814,
 'atol': 2.64246452024212

### 计算 TF-IDF
$TF-IDF=词频（TF）\times 逆文档频率（IDF）$

In [0]:
# 计算 TF-IDF
# tf 为之前算出的每个文档中词的 TF 值
def computeTFIDF(tf, idfs):
    tfidf = {}
    for word, val in tf.items():
        tfidf[word] = val*idfs[word]
    return tfidf

In [0]:
names = locals()
for i in range(sheet.nrows):
    # tfidf_0 = computeTFIDF(tf_0, idfs)
    names['tfidf_' + str(i)] = computeTFIDF(names['tf_' + str(i)], idfs)
    
print(tfidf_0)
print(tfidf_1)
# 这里用来检查第一列第一行的词汇(即变量 wordDict_0)，哪些词汇 value >=1 
print({k:v for k, v in tfidf_0.items() if v >= 1})
print({k:v for k, v in tfidf_1.items() if v > 0.01})
print({k:v for k, v in tfidf_2.items() if v > 0.01})
print({k:v for k, v in tfidf_3.items() if v > 0.01})

{'summary': 1.3212322601210607, 'description': 1.17071726228907}
{'unlocked': 0.010283039766698955, 'RAII': 0.02959227412473774, 'to)': 0.01398129375789482, 'scoped_lock': 0.01398129375789482, 'chances': 0.01398129375789482, 'corruption.': 0.01398129375789482, 'ES.84</a>': 0.01398129375789482, 'temporary,': 0.01398129375789482, 'acquired': 0.012388542458085397, 'resource': 0.0493204568745629, 'controls': 0.01398129375789482, 'compliant.': 0.012388542458085397, 'scoped_lock{myMutex};': 0.01398129375789482, 'lifetime': 0.03238737347482792, 'f()': 0.014195074334160212, 'use,': 0.010795791158275971, 'created,': 0.01398129375789482, 'object:': 0.01398129375789482, 'objects.': 0.011456842674722006, 'https://github.com/isocpp/CppCoreGuidelines/blob/036324/CppCoreGuidelines.md#es84-dont-try-to-declare-a-local-variable-with-no-name': 0.01398129375789482, 'temporaries': 0.01398129375789482, 'mutex': 0.030849119300096865, 'locked': 0.010283039766698955, '(try': 0.01398129375789482, 'associates': 

----

## 本来想用 TF-IDF 计算知识点在 rulue 中的重要程度，一次来判断 rulus 属于哪个知识点，但是现在发现，有很多 rules 中并没有出现知识点

现在想用两段文本的相似度来判断rules属于哪一个知识点

In [0]:
wordset = {"a","b"}
print(wordset)

{'a', 'b'}


In [0]:
!pip install google_images_download 



In [0]:
from google_images_download import google_images_download   #importing the library

response = google_images_download.googleimagesdownload()   #class instantiation

arguments = {"keywords":"black bear,grizzly bear,teddy bear","limit":100,"print_urls":True}   #creating list of arguments
paths = response.download(arguments)   #passing the arguments to the function
print(paths)


Item no.: 1 --> Item name = black bear
Evaluating...
Starting Download...


Unfortunately all 100 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!

Errors: 0


Item no.: 2 --> Item name = grizzly bear
Evaluating...
Starting Download...


Unfortunately all 100 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!

Errors: 0


Item no.: 3 --> Item name = teddy bear
Evaluating...
Starting Download...


Unfortunately all 100 could not be downloaded because some images were not downloadable. 0 is all we got for this search filter!

Errors: 0

({'black bear': [], 'grizzly bear': [], 'teddy bear': []}, 0)


We are not convinced that our means for answering RQ3 is
fair, although we do not yet have a better procedure.
Restricting the population to only those users who answer
questions, measuring scores and normalizing by the number of
people answering questions all reduce the amount of support
available as evidence for use of new technologies by older
programmers, and weaker statistical measures may suit for
establishing use, e.g. simple counts of the number of different
ages of users asking and answering questions about a given
technology/tag. 