# 20201104 矩阵分解和贝叶斯模型

**截止时间：2020 年 11 月 8 日中午 12:00**

---

注意事项：

1. 请仔细阅读并理解 **SVD_Model.ipynb** 文件中的代码，然后完成以下问题。

2. 如果调用其他库（除 numpy、pandas、scipy、sklearn 等常见库之外的），请在注释中标记。如果有创新思路也请在注释中说明。你的代码应当具有较高的可读性。

3. 作业提交请命名为【姓名-学号.ipynb】


## 问题一

参考 SVD_Model.ipynb 文件中的代码，使用同样的方法预测**每种职业**的人对**每部电影**的评价。

1.0 导入数据

In [1]:
import numpy as np
import pandas as pd
import math

ratings = pd.read_csv('ratings.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'movie_id', 'rating', 'timestamp'])
users = pd.read_csv('users.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])
movies = pd.read_csv('movies.csv', sep='\t', encoding='latin-1', usecols=['movie_id', 'title', 'genres'])
users
# movies

Unnamed: 0,user_id,gender,zipcode,age_desc,occ_desc
0,1,F,48067,Under 18,K-12 student
1,2,M,70072,56+,self-employed
2,3,M,55117,25-34,scientist
3,4,M,02460,45-49,executive/managerial
4,5,M,55455,25-34,writer
...,...,...,...,...,...
6035,6036,F,32603,25-34,scientist
6036,6037,F,76006,45-49,academic/educator
6037,6038,F,14706,56+,academic/educator
6038,6039,F,01060,45-49,other or not specified


1.1 合并用户的职业。求每个职业对每一部电影评分的平均值。

In [2]:
ratingsTable=ratings.merge(users.loc[:,('user_id','occ_desc')],on='user_id').pivot_table(index='occ_desc',columns='movie_id',values='rating',aggfunc='mean').fillna(0)
ratingsTable

movie_id,1,2,3,4,5,6,7,8,9,10,...,3943,3944,3945,3946,3947,3948,3949,3950,3951,3952
occ_desc,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
K-12 student,3.904255,3.322581,3.6,3.0,3.692308,4.25,3.5,3.090909,2.6,4.081081,...,3.75,1.75,1.8,2.25,3.0,4.214286,4.583333,4.0,3.0,3.933333
academic/educator,4.229299,3.301887,2.675676,2.809524,3.230769,3.578947,3.342105,3.25,1.666667,3.183333,...,3.111111,2.0,0.0,2.0,3.25,3.767857,4.173913,4.0,3.666667,3.757576
artist,4.0,2.970588,2.777778,2.5,3.333333,3.853659,3.208333,3.0,4.0,3.694444,...,3.4,0.0,1.0,3.0,3.5,3.368421,4.0,3.333333,5.0,3.434783
clerical/admin,4.492537,3.285714,2.909091,3.333333,3.285714,3.625,3.2,3.0,2.0,3.608696,...,4.0,0.0,1.0,4.0,5.0,4.346154,4.375,3.5,0.0,3.333333
college/grad student,4.040404,3.116505,3.060606,2.392857,2.860465,4.034965,3.5,2.5625,2.55,3.633333,...,2.857143,1.0,2.066667,2.533333,3.666667,3.773723,4.306452,3.111111,4.6,3.6
customer service,4.025641,3.4,2.75,2.2,3.111111,3.909091,3.571429,0.0,2.666667,3.3,...,1.0,0.0,0.0,2.666667,2.0,4.111111,3.666667,0.0,2.0,3.333333
doctor/health care,4.394737,3.107143,2.714286,3.125,3.333333,4.068966,3.214286,4.0,2.4,3.72,...,4.333333,0.0,1.0,2.666667,3.666667,3.633333,4.125,3.0,4.5,4.105263
executive/managerial,4.189055,3.115942,3.045455,2.944444,3.228571,3.92381,3.183673,2.5,3.333333,3.597938,...,4.25,0.0,1.0,2.230769,3.818182,3.57,4.0,3.75,3.2,3.877551
farmer,4.5,2.666667,3.0,0.0,1.0,4.0,3.0,0.0,3.0,3.25,...,0.0,0.0,0.0,1.0,0.0,4.5,0.0,0.0,0.0,0.0
homemaker,4.028571,3.5,2.875,0.0,3.142857,4.5,4.25,3.666667,0.0,3.1,...,0.0,0.0,0.0,0.0,3.5,3.461538,2.666667,0.0,4.0,4.0


1.2 划分训练集和验证集。可以使用留出法，也可以使用交叉验证。

In [3]:
ratingsSeq=np.array(ratingsTable).reshape(-1)
row=ratingsTable.shape[0]
col=ratingsTable.shape[1]
# display(ratingsSeq!=0)
seq=np.arange(row*col)[ratingsSeq!=0]
# display(seq)
total=seq.size
np.random.shuffle(seq)
K=10
length=total//K

1.3 训练模型和验证。在训练集上做 svd 分解，给出验证集上的预测。要求输出训练集和验证集上评分预测的 RMSE （方均根误差）

In [4]:
def calSvd(mat,N):
    mean=mat.mean(axis=1,keepdims=True)
    mat-=mean
    u,s,v=np.linalg.svd(mat)
    return (u[:,:N]@np.diag(s[:N])@v[:N])+mean
sumErr=0
for i in range(K):
    newMat=ratingsSeq.copy()
    testNum=seq[i*length:(i+1)*length]
    newMat[testNum]=0
    newMat=calSvd(newMat.reshape(row,col),3).reshape(-1)
    testSeq=newMat[testNum]
    ansSeq=ratingsSeq[testNum]
    sumErr+=(((ansSeq-testSeq)**2).mean())**0.5/K
sumErr

1.3479145117487463

## 问题二

使用 naïve 贝叶斯分类器，根据用户的**观影兴趣**预测用户的**职业**。

2.1 划分训练集和验证集


In [5]:
dictMovie={}
numTypes=0
for s in movies.genres:
    listWord=s.split('|')
    for word in listWord:
        if not word in dictMovie:
            dictMovie[word]=numTypes
            numTypes+=1
featureMovies=np.zeros((movies.movie_id.max()+1,numTypes))
for i,s in zip(movies.movie_id,movies.genres):
    listWord=s.split('|')
    for word in listWord:
        featureMovies[i][dictMovie[word]]=1
dictOccu={}
numOccu=0
numUsers=users.shape[0]
for s in users.occ_desc:
    if not s in dictOccu:
        dictOccu[s]=numOccu
        numOccu+=1
occuUsers=np.zeros(numUsers,dtype='int64')
for i,s in zip(users.user_id,users.occ_desc):
    occuUsers[i-1]=dictOccu[s]
featureUsers=np.zeros((numUsers,numTypes))
for user,movie,rate in zip(ratings.user_id,ratings.movie_id,ratings.rating):
    featureUsers[user-1]+=rate/5*featureMovies[movie]
K=10
seq=np.arange(numUsers)
np.random.shuffle(seq)
length=numUsers//K


2.2 训练模型和验证。在训练集上训练 naïve 贝叶斯分类器，给出在验证集上的预测。要求输出训练集和验证集上的准确率和 F-score。

In [6]:
from sklearn.metrics import f1_score
def train(feature,occu):
    probOccu=np.bincount(occu).astype('float64').reshape(-1,1)
    probOccu+=0.2
    probOccu/=probOccu.sum()
    probFeature=np.zeros((numOccu,numTypes))
    for i,arr in enumerate(feature):
        probFeature[occu[i]]+=arr
    probFeature+=0.2
    probFeature/=probFeature.sum(axis=1,keepdims=True)
    return np.log(probFeature),np.log(probOccu)
def predict(arr,feature,occu):
    prob=(arr*feature).sum(axis=1,keepdims=True)+occu
    return prob.argmax()
def evaluate(result,answer):
    a=(result==answer).sum()/result.size
    f=f1_score(answer,result,average='macro')
    return a,f
sumA,sumF=0,0
for i in range(K):
    testSeq=seq[i*length:(i+1)*length]
    trainSeq=np.append(seq[:i*length],seq[(i+1)*length:])
    probFeature,probOccu=train(featureUsers[trainSeq],occuUsers[trainSeq])
    result=np.zeros(testSeq.size,dtype='int64')
    for i,s in enumerate(testSeq):
        result[i]=predict(featureUsers[s],probFeature,probOccu)
    a,f=evaluate(result,occuUsers[testSeq])
    print('Accuracy:%f F-score(macro):%f'%(a,f))
    sumA+=a/K
    sumF+=f/K
print('Average: Accuracy:%f F-score(macro):%f'%(sumA,sumF))

Accuracy:0.105960 F-score(macro):0.071050
Accuracy:0.094371 F-score(macro):0.068933
Accuracy:0.094371 F-score(macro):0.082845
Accuracy:0.097682 F-score(macro):0.060382
Accuracy:0.097682 F-score(macro):0.065735
Accuracy:0.110927 F-score(macro):0.068566
Accuracy:0.094371 F-score(macro):0.061803
Accuracy:0.114238 F-score(macro):0.063973
Accuracy:0.099338 F-score(macro):0.062939
Accuracy:0.084437 F-score(macro):0.061069
Average: Accuracy:0.099338 F-score(macro):0.066729
