## 监督学习和无监督学习
本书中涉及的监督学习算法如下：
- 神经网络
- 决策树
- 支持向量机
- 贝叶斯过滤

涉及到的非监督学习算法如下：
- 聚类
- 非负矩阵因式分解
- 自组织映射

本节主要内容即为聚类

In [1]:
import feedparser
import re

# 返回一个RSS订阅源的标题和包含单词计数情况的字典
def getwordcounts(url):
    # 解析订阅源
    d=feedparser.parse(url)
    wc={}
    
    # 循环遍历所有的文章条目
    for e in d.entries:
        if 'summary' in e:
            summary=e.summary
        else:
            summary=e.description
            
        # 提取一个单词列表
        words=getwords(e.title+' '+summary)
        for word in words:
            wc.setdefault(word,0)
            wc[word]+=1
    return d.feed.title,wc

# 将摘要传给函数getwords,把其中所有的HTML标记剥离掉，并以非字母字符作为分隔符拆分出单词，再将结果以列表的形式加以返回
def getwords(html):
    # 去除所有的HTML标记
    txt=re.compile(r'<[^>]+>').sub('',html)
    
    #利用所有非字母字符拆分出单词
    words=re.compile(r'[^A-Z^a-z]+').split(txt)
    
    #转化成小写形式
    return [word.lower() for word in words if words!='']

循环遍历订阅源并生成数据集。

In [6]:
apcount={}
wordcounts={}
with open('feedlist.txt','r') as f:
    feedlist=[line for line in f.readlines()]
#  feedlist=[line for line in file('feedlist.txt')]   
# 生成每个博客的单词统计
for feedurl in feedlist:
    try:
        title,wc=getwordcounts(feedurl)
        wordcounts[title]=wc
        for word,count in wc.items():
            apcount.setdefault(word,0)
            if count>1:
                apcount[word]+=1
    except:
        print('Failed to parse feed %s' % feedurl)

Failed to parse feed http://battellemedia.com/index.xml

Failed to parse feed http://blog.guykawasaki.com/index.rdf

Failed to parse feed http://feeds.searchenginewatch.com/sewblog

Failed to parse feed http://blog.topix.net/index.rdf

Failed to parse feed http://blogs.abcnews.com/theblotter/index.rdf

Failed to parse feed http://feeds.feedburner.com/ConsumingExperienceFull

Failed to parse feed http://flagrantdisregard.com/index.php/feed/

Failed to parse feed http://featured.gigaom.com/feed/

Failed to parse feed http://feeds.feedburner.com/instapundit/main

Failed to parse feed http://jeremy.zawodny.com/blog/rss2.xml

Failed to parse feed http://michellemalkin.com/index.rdf

Failed to parse feed http://moblogsmoproblems.blogspot.com/rss.xml

Failed to parse feed http://beta.blogger.com/feeds/27154654/posts/full?alt=rss

Failed to parse feed http://powerlineblog.com/index.rdf

Failed to parse feed http://feeds.feedburner.com/Publishing20

Failed to parse feed http://scienceblogs.com/

建立一个单词列表。去掉类似“the”这样的单词和“film-flam”这样的生僻单词。

我们可以选择10%-50%出现的单词。

In [7]:
wordlist=[]
for w,bc in apcount.items():
    frac=float(bc)/len(feedlist)
    if frac>0.1 and frac<0.5:
        wordlist.append(w)

利用上述单词列表和博客列表来建立一个文本文件，其中包含一个大矩阵，记录着针对每个博客的所有单词的统计情况

In [10]:
with open('blogdata.txt','w') as f:
    f.write('Blog')
    for word in wordlist:
        f.write('\t%s' %word)
    f.write('\n')
    for blog,wc in wordcounts.items():
        f.write(blog)
        for word in wordlist:
            if word in wc:
                f.write('\t%d' %wc[word])
            else:
                f.write('\t0')
        f.write('\n')

这个时候目录下应该会有一个blogdata.txt文件，我们来看一下（格式比较乱）

In [11]:
with open('blogdata.txt','r') as f:
    print(f.read())

Blog	drive	come	music	shows	open	through	p	bad	make	better	bill	world	help	reason	four	ve	me	do	order	john	their	wednesday	best	could	else	another	low	more	city	before	phone	still	need	run	working	available	long	yet	program	special	weeks	youtube	here	web	become	good	try	far	he	i	which	major	x	job	same	almost	became	those	google	hope	behind	take	human	took	important	being	though	my	man	tv	am	an	anything	put	buy	hard	course	plan	watch	coming	car	some	morning	wouldn	see	community	reading	work	matter	tell	but	about	trump	does	big	possible	announced	them	up	down	certain	every	perhaps	your	second	top	when	june	than	everything	very	close	b	needs	short	soon	offers	note	case	start	like	doing	email	late	sure	takes	bit	online	change	saying	lots	version	think	million	without	so	story	new	let	real	stories	idea	small	design	state	website	clear	health	just	including	photos	hear	year	home	thought	security	days	content	example	others	where	stop	news	stay	deal	his	such	play	too	are	right	she	or	turn	whe

## 分级聚类
不断的将最为相似的两个群组两两合并，构造出一个群组的层级结构

每个群组都是从单一元素开始的。在每次迭代的过程中，分级聚类算法会计算每两个群组间的距离，并将距离最近的两个群组合并成一个群组。

In [12]:
def readfile(filename):
    with open(filname,'r') as f:
        lines=[line for line in f.readlines()]
    
    # 第一行是列标题
    colnames=lines[0].strip().split('\t')[1:]
    rownames=[]
    data=[]
    for line in lines[1:]:
        p=line.strip().split('\t')
        # 每行的第一列是行名
        rownames.append(p[0])
        # 剩余部分就是该行对应的数据
        data.append([float(x) for x in p[1:]])
    return rownames,colnames,data

下面我们定义紧密度。使用**皮尔逊相关度**来判断两组数据与某条直线的拟合程度。


In [13]:
from math import sqrt
def pearson(v1,v2):
    sum1=sum(v1)
    sum2=sum(v2)
    
    #求平方和
    sum1Sq=sum([pow(v,2) for v in v1])
    sum2Sq=sum([pow(v,2) for v in v2])
    
    # 求乘积之和
    pSum=sum([v1[i]*v2[i] for i in range(len(v1))])
  
    # 计算r
    num=pSum-(sum1*sum2/len(v1))
    den=sqrt((sum1Sq-pow(sum1,2)/len(v1))*(sum2Sq-pow(sum2,2)/len(v1)))
    if den==0: 
        return 0

    return 1.0-num/den 

皮尔逊相关度在两者完全匹配的情况下为1.0，毫无关系的情况下为0

而我们用1减去最后的结果，目的是让相似度越大的两个元素之间的距离变得越小

新建一个bicluster类，保存节点的位置信息。

In [15]:
class bicluster:
    def __init__(self,vec,left=None,right=None,distance=0.0,id=None):
        self.left=left
        self.right=right
        self.vec=vec
        self.id=id
        self.distance=distance

将每个配对的相关度计算保存起来

In [None]:
def hcluster(rows,distance=pearson)