## 集体智慧编程 [code](https://github.com/cataska/programming-collective-intelligence-code)

> Reference: 
> 
> pandas: 
> 
> 	http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
> 
> 	http://www.open-open.com/lib/view/open1402477162868.html
> 
> Pics:
> 
> 	http://image.baidu.com/albumlist/134217728%2027274463

### 0. 推荐系统
##### 协同过滤
> - 构建数据集

In [None]:
# A dictionary of movie critics and their ratings of a small
# set of movies
critics={'Lisa Rose': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.5,
 'Just My Luck': 3.0, 'Superman Returns': 3.5, 'You, Me and Dupree': 2.5, 
 'The Night Listener': 3.0},
'Gene Seymour': {'Lady in the Water': 3.0, 'Snakes on a Plane': 3.5, 
 'Just My Luck': 1.5, 'Superman Returns': 5.0, 'The Night Listener': 3.0, 
 'You, Me and Dupree': 3.5}, 
'Michael Phillips': {'Lady in the Water': 2.5, 'Snakes on a Plane': 3.0,
 'Superman Returns': 3.5, 'The Night Listener': 4.0},
'Claudia Puig': {'Snakes on a Plane': 3.5, 'Just My Luck': 3.0,
 'The Night Listener': 4.5, 'Superman Returns': 4.0, 
 'You, Me and Dupree': 2.5},
'Mick LaSalle': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0, 
 'Just My Luck': 2.0, 'Superman Returns': 3.0, 'The Night Listener': 3.0,
 'You, Me and Dupree': 2.0}, 
'Jack Matthews': {'Lady in the Water': 3.0, 'Snakes on a Plane': 4.0,
 'The Night Listener': 3.0, 'Superman Returns': 5.0, 'You, Me and Dupree': 3.5},
'Toby': {'Snakes on a Plane':4.5,'You, Me and Dupree':1.0,'Superman Returns':4.0}}


In [None]:
# A function to merge lists
def merge_list(*l):
    return reduce(lambda x,y: (x.extend(y) or x), l)

In [None]:
# It's better to use pandas
import pandas
# get all movies
movies = set(merge_list(*[person.keys() for person in critics.values()]))
people = critics.keys()
critics_matrix = pandas.DataFrame([[critics[person].get(m, 0.0) for m in movies] for person in people], index=people, columns=movies)
critics_matrix.head(10)
# get a person:
# critics_matrix.loc["Toby"]
# get a movie:
# critics_matrix["The Night Listener"]

Unnamed: 0,Lady in the Water,Snakes on a Plane,Just My Luck,Superman Returns,The Night Listener,"You, Me and Dupree"
Jack Matthews,3.0,4.0,0.0,5.0,3.0,3.5
Mick LaSalle,3.0,4.0,2.0,3.0,3.0,2.0
Claudia Puig,0.0,3.5,3.0,4.0,4.5,2.5
Lisa Rose,2.5,3.5,3.0,3.5,3.0,2.5
Toby,0.0,4.5,0.0,4.0,0.0,1.0
Gene Seymour,3.0,3.5,1.5,5.0,3.0,3.5
Michael Phillips,2.5,3.0,0.0,3.5,4.0,0.0


### 相似度的度量
> - 欧氏距离： $ D_{eu} = \sqrt{(x_1-y_1)^2+(x_2-y_2)^2} $ . 为了使距离和相似度具有正相关，取
$ Similarity = \dfrac{1}{1+D_{eu}}$

In [None]:
def distance(v1, v2):
    index = (v1 != 0) * (v2 != 0)
    return 1/(1+((v1-v2)[index]**2).sum()**0.5)

def sim_distance(matrix, p1, p2):
    v1 = matrix.loc[p1]
    v2 = matrix.loc[p2]
    # Here zero stands for No-score, shoundn't be considered
    return distance(v1, v2)
# distance = sim_distance(critics_matrix, "Gene Seymour", "Lisa Rose")
# distance = sim_distance(critics_matrix, "Lisa Rose", "Lisa Rose")
# print distance

> 皮尔逊相关度(Pearson Correlation Score)：在数据相对不是很规范的时候将会起到比较好的效果

In [None]:
def pearson(v1, v2):
    index = (v1 != 0) * (v2 != 0)
    # correlation
    return v1[index].corr(v2[index])
def sim_pearson(matrix, p1, p2):
    v1 = matrix.loc[p1]
    v2 = matrix.loc[p2]
    # Here zero stands for No-score, shoundn't be considered
    return pearson(v1, v2)
# distance = sim_pearson(critics_matrix, "Gene Seymour", "Lisa Rose")
# print distance

## ！重要：这里有计算矩阵内pearson和euclidean的快捷方式

In [None]:
# free way to calculate pearson correlation coefficients
def get_sims(matrix, method="pearson"):
    if method == "pearson":
        dist = matrix.T.corr('pearson')
        """ another implement
        corr = matrix.T.apply(lambda col1: matrix.T.apply(lambda col2: pearson(col1, col2)))
        """
    else:
        # free way to calculate euclidean distance
        dist = matrix.T.apply(lambda col1: matrix.T.apply(lambda col2: distance(col1, col2)))
    return dist

### 推荐过程

In [None]:
# 计算推荐
import numpy as np
# 选择pearson相关系数为相似度测量
sim = get_sims(critics_matrix, "pearson")
# 选择欧氏距离为相似度测量
sim = get_sims(critics_matrix, "euclidean")

def get_recom(sim, matrix):
    person_num = sim.shape[0]
    scores = sim.dot(matrix).divide(sim.sum(), axis=0)
    recom = scores * (matrix == 0)
    return recom
get_recom(sim, critics_matrix)

Unnamed: 0,Lady in the Water,Snakes on a Plane,Just My Luck,Superman Returns,The Night Listener,"You, Me and Dupree"
Jack Matthews,0.0,0,1.110282,0,0.0,0.0
Mick LaSalle,0.0,0,0.0,0,0.0,0.0
Claudia Puig,1.582296,0,0.0,0,0.0,0.0
Lisa Rose,0.0,0,0.0,0,0.0,0.0
Toby,1.52954,0,1.094246,0,2.311728,0.0
Gene Seymour,0.0,0,0.0,0,0.0,0.0
Michael Phillips,0.0,0,1.251455,0,0.0,1.740976


In [None]:
critics_matrix

Unnamed: 0,Lady in the Water,Snakes on a Plane,Just My Luck,Superman Returns,The Night Listener,"You, Me and Dupree"
Jack Matthews,3.0,4.0,0.0,5.0,3.0,3.5
Mick LaSalle,3.0,4.0,2.0,3.0,3.0,2.0
Claudia Puig,0.0,3.5,3.0,4.0,4.5,2.5
Lisa Rose,2.5,3.5,3.0,3.5,3.0,2.5
Toby,0.0,4.5,0.0,4.0,0.0,1.0
Gene Seymour,3.0,3.5,1.5,5.0,3.0,3.5
Michael Phillips,2.5,3.0,0.0,3.5,4.0,0.0


## 可以根据`recom`得出结论，推荐给Toby的为 `The Night Listener > Lady in the Water > Just My Luck`

# 商品相似度的度量

In [None]:
# 实际上只需要求critics的转置即可
critics_t = critics_matrix.T
sim = get_sims(critics_t, method="pearson")
recom = get_recom(sim, critics_t)
# print sim
recom

Unnamed: 0,Jack Matthews,Mick LaSalle,Claudia Puig,Lisa Rose,Toby,Gene Seymour,Michael Phillips
Lady in the Water,0.0,0,0.830085,0,-0.59782,0,0.0
Snakes on a Plane,-0.0,0,0.0,0,-0.0,0,0.0
Just My Luck,-0.235952,0,0.0,0,-1.796236,0,-0.298214
Superman Returns,0.0,0,0.0,0,0.0,0,0.0
The Night Listener,0.0,0,0.0,0,-3.748855,0,0.0
"You, Me and Dupree",0.0,0,0.0,0,0.0,0,1.560825


## 