ざっくりKerasでMFを書いちゃう  
→datasetの複数入力受け渡しが面倒臭い、やはり信頼できるのは生TF  
信頼、というか、責任が全て自分にある

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from tqdm import tqdm_notebook as tqdm
import matplotlib.pyplot as plt
import sklearn
from sklearn.decomposition import NMF
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from collections import defaultdict
from copy import copy, deepcopy
import os
from datetime import datetime
import random
import math

np.random.seed(1234)
sns.set_style("darkgrid")
pd.set_option("display.max_columns", 100)
pd.set_option("display.max_rows", 500)

%matplotlib inline
%config InlineBackend.figure_formats = {"png", "retina"}

In [2]:
from logging import getLogger
logger = getLogger(__name__)
import sklearn
from typing import List, Any, Dict

In [3]:
import tensorflow as tf

print(tf.__version__)
tf.test.is_gpu_available()

1.12.0


True

In [4]:
# config = tf.ConfigProto()
# config.gpu_options.allow_growth = True
# config.gpu_options.per_process_gpu_memory_fraction = 0.4
# sess = tf.Session(config=config)

In [5]:
tf.test.gpu_device_name()

'/device:GPU:0'

モジュールのpyファイルの変更がリアルタイムに反映されるようにする

In [6]:
%load_ext autoreload

In [7]:
%autoreload 2

# Datasetの読み込み

MovieLenzを使用  
Rendle_et_al_2009(BPR)に従って、レーティングデータを抜く  
映画を見るか見ないかの2値分類を、映画を見たというログのみから当てにいく  
BPRではMovieLenzを使用    
10以上のitemをratingしたユーザー  
10以上のuserにratingされたitem  
に絞って、10000users, 5000items, 565738ratingsが集まった  

# Iterator・Datasetの作成

In [8]:
log_data = pd.read_csv("../input/movielenz/ratings.csv")

In [9]:
log_data["userId"].nunique(), log_data["movieId"].nunique()

(671, 9066)

In [10]:
log_data.groupby(["userId"])[["movieId"]].aggregate(["count", "nunique"]).aggregate(["min", "mean", "median", "max"])

Unnamed: 0_level_0,movieId,movieId
Unnamed: 0_level_1,count,nunique
min,20.0,20.0
mean,149.037258,149.037258
median,71.0,71.0
max,2391.0,2391.0


一人あたり最低20, 中央値71, 最大2391ratingしている

In [11]:
log_data.groupby(["movieId"])[["userId"]].count().aggregate(["min", "mean", "median", "max"])

Unnamed: 0,userId
min,1.0
mean,11.030664
median,3.0
max,341.0


1映画あたり最低1、中央値3, 最大341人からratingを受けている

In [12]:
class Dataset(object):
    
    def __init__(self, data):
        self.data = data
        self.n_data = len(data)
        self.n_users = len(set(data["userId"]))
        self.n_items = len(set(data["movieId"]))
        self.user_ids = list(data["userId"])
        self.item_ids = list(data["movieId"])
        self.user2index = dict(zip(np.unique(self.user_ids), range(self.n_users)))
        self.item2index = dict(zip(np.unique(self.item_ids), range(self.n_items)))

In [13]:
dataset = Dataset(log_data)

In [14]:
dataset.n_data, dataset.n_users, dataset.n_items

(100004, 671, 9066)

# 学習

In [15]:
import sys
sys.path.append("/root/docker/tfrecos/")

In [16]:
import tfrecos as tfr

In [17]:
model = tfr.model.MatrixFactorization(
    n_latent_factors=10,
    learning_rate=0.01,
    reg_user=0.001,
    reg_item=0.001,
    batch_size=1000,
    epoch_size=50,
    test_size=0.1,
    save_directory_path="../logs/20190712_MF",
    scope_name="MF",
    try_count=5,
    n_users=dataset.n_users,
    n_items=dataset.n_items,
    user2index=dataset.user2index,
    item2index=dataset.item2index)

In [18]:
ckpt_path = "../logs/20190712_MF/checkpoint/model_3.ckpt"

In [20]:
model.build_model(ckpt_path)

Please call load_weights instead


AssertionError: None

In [23]:
latest = tf.train.latest_checkpoint(model.checkpoint_path)

In [24]:
model.load_weights(latest)

INFO:tensorflow:Restoring parameters from ../logs/20190712_MF/checkpoint/model_19.ckpt


In [18]:
model.fit(user_ids=dataset.user_ids, item_ids=dataset.item_ids)

train_dataset out of range
0.6104006 0.60147685
valid_dataset out of range
0.5779597 0.5658795
INFO:tensorflow:../logs/20190712_MF/checkpoint/model_0.ckpt is not in all_model_checkpoint_paths. Manually adding it.
train_dataset out of range
0.5574784 0.5436275
valid_dataset out of range
0.5403211 0.52565765
INFO:tensorflow:../logs/20190712_MF/checkpoint/model_1.ckpt is not in all_model_checkpoint_paths. Manually adding it.
train_dataset out of range
0.48243958 0.46333647
valid_dataset out of range
0.4436193 0.42003465
INFO:tensorflow:../logs/20190712_MF/checkpoint/model_2.ckpt is not in all_model_checkpoint_paths. Manually adding it.
train_dataset out of range
0.3879348 0.35938638
valid_dataset out of range
0.38026455 0.3503985
INFO:tensorflow:../logs/20190712_MF/checkpoint/model_3.ckpt is not in all_model_checkpoint_paths. Manually adding it.
train_dataset out of range
0.3691074 0.33654755
valid_dataset out of range
0.39172578 0.35916933
INFO:tensorflow:../logs/20190712_MF/checkpoint/m

In [25]:
item_factors = model.get_item_factors(item_ids=dataset.item_ids,
                                     normalize=True)

In [26]:
item_factors.shape

(100004, 10)

In [27]:
user_factors = model.get_user_factors(user_ids=dataset.user_ids[:10],
                                     normalize=True)

In [28]:
user_factors.shape

(10, 10)

In [29]:
predictions = model.predict(user_ids=dataset.user_ids,
                            item_ids=dataset.item_ids)

In [31]:
np.mean(predictions)

0.8686248044368201

clickがされているデータに対しての予測値の平均が0.8→それなりに当たっている