## jieba第三方库

- 利用中文词库，确定汉字关联概率。
- 汉字间概率大的组成词组，确定分词结果。

- 在cmd中安装：pip install jieba  

- 精确模式lcut：精确分开，不存在冗余单词（最常用）  
- 全模式：扫描所有可能词语，有冗余  
- 搜索引擎：在精确模式基础上，对长词进行切分（在特定场合常用）

In [1]:
import jieba

In [2]:
# 精确模式
ls = '中国是一个伟大的国家'
jieba.lcut(ls)

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\admin\AppData\Local\Temp\jieba.cache
Loading model cost 0.847 seconds.
Prefix dict has been built succesfully.


['中国', '是', '一个', '伟大', '的', '国家']

In [3]:
# 全模式
jieba.lcut(ls,cut_all=True)

['中国', '国是', '一个', '伟大', '的', '国家']

In [4]:
# 搜索引擎模式
jieba.lcut_for_search(ls)

['中国', '是', '一个', '伟大', '的', '国家']

In [5]:
jieba.lcut_for_search("中华人民共和国是伟大的")

['中华', '华人', '人民', '共和', '共和国', '中华人民共和国', '是', '伟大', '的']

In [6]:
# 向分析词典增加新词w
jieba.add_word("蟒蛇语言")

- 英文词频统计：去除标点符号、大小写干扰。

In [7]:
def getText():
    txt = open("D:/Vina_test/hamlet.txt","r").read()  # 读取txt数据
    txt = txt.lower() # 去除大小写干扰
    for ch in "!'#$%()*<+,.~`_>?/@[\\]{}^&~|\‘’":
        txt = txt.replace(ch," ")  # 去除标点符号干扰，将特殊符号替换为空格
    return txt

hamletTxt = getText()
words = hamletTxt.split()#默认用空格分开，返回列表
# 字典表词&词频对应关系
counts = {}
for word in words:
    #counts.get从字典中获得该word键的次数，并+1.若不存在，新增该键并返回0
    counts[word] = counts.get(word,0) +1
items = list(counts.items())#字典类型转换为列表类型便于操作
items.sort(key=lambda x:x[1],reverse = True) #按第二个键值，，默认从小到大，reverse(True)从大到小排序
for i in range(10):
    word,count = items[i] # 赋值
    print("{0:<10}{1:>5}".format(word,count))


the        1142
and         964
to          744
of          668
i           625
a           531
you         526
my          513
hamlet      461
in          449


- 中文词频统计：不用考虑标点符号、大小写问题，首先分词

In [8]:
import jieba 
txt = open("D:/Vina_test/threekingdoms.txt","r",encoding="utf-8").read()  # 读取txt数据
words = jieba.lcut(txt)
counts={}
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0) +1
items = list(counts.items())  # 构造中文单词的列表
items.sort(key=lambda x:x[1],reverse=True)
for i in range(15):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word,count))

曹操          953
孔明          836
将军          772
却说          656
玄德          585
关公          510
丞相          491
二人          469
不可          440
荆州          425
玄德曰         390
孔明曰         390
不能          384
如此          378
张飞          358


- 剔除无关词语

In [9]:
import jieba 
txt = open("D:/Vina_test/threekingdoms.txt","r",encoding="utf-8").read()  # 读取txt数据
# 多次尝试确定排除词库
excludes ={"主公","一人","不知","人马","都督","今日","魏兵","陛下","不敢","引兵","东吴","于是","大喜","天下","次日","将军","却说","荆州","二人","不可","不可","不能","如此","左右","如何","商议","军士",'军马'}
words = jieba.lcut(txt)
counts={}
# 名词关联
for word in words:
    if len(word) == 1:
        continue
    elif word =="诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公"  or word == "云长":
        rword = "关羽"
    elif word == "孟德"  or word == "丞相":
        rword = "曹操"
    elif word == "玄德"  or word == "玄德曰":
        rword = "刘备"
    else:
        rword = word
    counts[rword] = counts.get(rword,0) +1
# 排除词库
for word in excludes:
    del counts[word]
items = list(counts.items())  # 构造中文单词的列表
items.sort(key=lambda x:x[1],reverse=True)
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word,count))

曹操         1451
孔明         1383
刘备         1252
关羽          784
张飞          358
吕布          300
赵云          278
孙权          264
司马懿         221
周瑜          217


## 词云图

- cmd中安装（c++14.0环境）：
```
pip install wordcloud
```

- wordcloud库小写，对象WordCloud大写  
```
w = wordcloud.WordCloud()   #生成词云对象w
```

|参数|描述|
|-------------------|:------------------------------------|
|width|图片宽度，默认400像素|
|height|图片高度，默认200像素|
|min_font_size|词云中最小字号，默认4号|
|max_font_size|词云中最大字号，根据高度自动调节|
|font_step|字体字号的步进间隔，默认为1|
|font_path|指定字体文件路径，默认为None<br>如微软雅黑，font_path="msyh.ttc"|
|max_words|词云显示的最大单次数，默认200|
|stopwords|排除词列表，不显示，stop_words={"Python"}|
|mask|指定词云形状，默认长方形<br><p align="left">>>>from scipy.misc import imread</p><br><p align="left">>>>mk=imread("Vina.png")</p><br><p align="left">>>>w.wordcloud.WordCloud(mask=mk)</p>|
|background_color|指定背景颜色，默认黑色<br><p lign="left">>>>w.wordcloud.WordCloud(background_color = "white")</p>|

|方法|描述|
|-----------------------------:|--------------------------:|
|w.generate("txt")|向WordCloud对象w中加载文本txt|
|w.to_file("png")|输出词云图像为 .png或.jpg格式|

- 按空格进行分割，过滤了频率较低或短小的字符，其余按频率分配字体字号
- 英文不需要分词，中文需要分词

In [1]:
import wordcloud
c = wordcloud.WordCloud()
c.generate("wordcloud by Python")
c.to_file('D:/Vina_test/wordcloud_test.jpg')

<wordcloud.wordcloud.WordCloud at 0x1d14aef7048>

- 对字符串

In [1]:
import jieba
import wordcloud
txt = '中央财经大学是教育部直属的、教育部、财政部和北京市共建的大学，是国家“双一流”建设、“211工程”建设和首批“优势学科创新平台”项目建设高校。'
w = wordcloud.WordCloud(width=1000,height=700,font_path="msyh.ttc")
w.generate(" ".join(jieba.lcut(txt)))#jieba分词后的列表元素，用空格连接，再赋给Wordcloud对象
w.to_file("D:/Vina_test/中财词云.png")

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\admin\AppData\Local\Temp\jieba.cache
Loading model cost 0.969 seconds.
Prefix dict has been built succesfully.


<wordcloud.wordcloud.WordCloud at 0x291dcd584e0>

- 对txt文件分词后制作词云

In [3]:
import jieba
import wordcloud
f = open("D:/Vina_test/新时代中国特色社会主义.txt","r",encoding="utf-8")
t = f.read()
f.close()
ls = jieba.lcut(t)
txt = " ".join(ls)
w = wordcloud.WordCloud(width=1000,height=700,font_path="msyh.ttc",background_color = "white",max_words = 30)
w.generate(txt)
w.to_file("D:/Vina_test/新时代中国特色社会主义1.png")

<wordcloud.wordcloud.WordCloud at 0x1d14aed8940>

In [4]:
import jieba
import wordcloud
f = open("D:/Vina_test/关于实施乡村振兴战略的意见.txt","r",encoding="utf-8")
t = f.read()
f.close()
ls = jieba.lcut(t)
txt = " ".join(ls)
w = wordcloud.WordCloud(width=1000,height=700,font_path="msyh.ttc",background_color = "white",max_words = 30)
w.generate(txt)
w.to_file("D:/Vina_test/关于实施乡村振兴战略的意见1.png")

<wordcloud.wordcloud.WordCloud at 0x1d14d34c160>

- mask生成其他形状词云

In [5]:
# 五角星
import jieba
import wordcloud
from scipy.misc import imread
mask = imread("D:/Vina_test/五角星.jpg")
f = open("D:/Vina_test/新时代中国特色社会主义.txt","r",encoding="utf-8")
t = f.read()
f.close()
ls = jieba.lcut(t)
txt = " ".join(ls)
w = wordcloud.WordCloud(width=1000,height=700,font_path="msyh.ttc",background_color = "white",mask = mask)
w.generate(txt)
w.to_file("D:/Vina_test/新时代中国特色社会主义2.png")

`imread` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``imageio.imread`` instead.
  """


<wordcloud.wordcloud.WordCloud at 0x1d14d136898>

In [6]:
# 五角星
import jieba
import wordcloud
from scipy.misc import imread
mask = imread("D:/Vina_test/五角星.jpg")
f = open("D:/Vina_test/关于实施乡村振兴战略的意见.txt","r",encoding="utf-8")
t = f.read()
f.close()
ls = jieba.lcut(t)
txt = " ".join(ls)
w = wordcloud.WordCloud(width=1000,height=700,font_path="msyh.ttc",background_color = "white",mask = mask)
w.generate(txt)
w.to_file("D:/Vina_test/关于实施乡村振兴战略的意见2.png")

`imread` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``imageio.imread`` instead.
  """


<wordcloud.wordcloud.WordCloud at 0x1d14cfbcd68>

In [7]:
# 中国地图
import jieba
import wordcloud
from scipy.misc import imread
mask = imread("D:/Vina_test/中国地图.jpg")
f = open("D:/Vina_test/新时代中国特色社会主义.txt","r",encoding="utf-8")
t = f.read()
f.close()
ls = jieba.lcut(t)
txt = " ".join(ls)
w = wordcloud.WordCloud(width=1000,height=700,font_path="msyh.ttc",background_color = "white",mask = mask)
w.generate(txt)
w.to_file("D:/Vina_test/新时代中国特色社会主义3.png")

`imread` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``imageio.imread`` instead.
  """


<wordcloud.wordcloud.WordCloud at 0x1d14d34c400>

In [8]:
# 中国地图
import jieba
import wordcloud
from scipy.misc import imread
mask = imread("D:/Vina_test/中国地图.jpg")
f = open("D:/Vina_test/关于实施乡村振兴战略的意见.txt","r",encoding="utf-8")
t = f.read()
f.close()
ls = jieba.lcut(t)
txt = " ".join(ls)
w = wordcloud.WordCloud(width=1000,height=700,font_path="msyh.ttc",background_color = "white",mask = mask)
w.generate(txt)
w.to_file("D:/Vina_test/关于实施乡村振兴战略的意见3.png")

`imread` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``imageio.imread`` instead.
  """


<wordcloud.wordcloud.WordCloud at 0x1d14d17fba8>

- 生态宜居乡村建设词云图，剔除无效词语

In [10]:
import jieba
import wordcloud
from scipy.misc import imread
mask = imread("D:/Vina_test/中国地图.jpg")
f = open("D:/Vina_test/生态村.txt","r",encoding="utf-8")
t = f.read()
f.close()
ls = jieba.lcut(t)
txt = " ".join(ls)
w = wordcloud.WordCloud(stopwords={"农村","村民","村庄","建设"},width=1000,height=700,font_path="msyh.ttc",background_color = "white",mask = mask)
w.generate(txt)
w.to_file("D:/Vina_test/生态村.png")

`imread` is deprecated in SciPy 1.0.0, and will be removed in 1.2.0.
Use ``imageio.imread`` instead.
  after removing the cwd from sys.path.


<wordcloud.wordcloud.WordCloud at 0x2668faeba90>