Permalink
Browse files

Merge branch 'master' of

https://beechung@github.com/beechung/Latent-Factor-Models.git

Conflicts:
	doc/tutorial.pdf
  • Loading branch information...
2 parents 77a115f + 61829ec commit a1a1d1601d0f656f63a9b1923094b7e475ac34ca Bee-Chung Chen committed Jan 7, 2012
Showing with 44 additions and 20 deletions.
  1. +24 −6 doc/quick-start.tex
  2. BIN doc/tutorial.pdf
  3. +17 −13 src/R/BST.R
  4. +3 −1 src/unit-test/multicontext_model/regression-test-0.R
View
@@ -7,7 +7,7 @@ \subsection{Quick Start}
\parahead{Step 1}
Read training and test observation tables ({\tt obs.train} and {\tt obs.test}), their corresponding observation feature tables ({\tt x\_obs.train} and {\tt x\_obs.test}), the source feature table ({\tt x\_src}), the destination feature table ({\tt x\_dst}) and the edge context feature table ({\tt x\_ctx}) from the corresponding files. Note that if you replace these tables with your data, you must not change the column names. Assuming we use the dense format of the feature files, a sample code can be
{\small\begin{verbatim}
-input.dir = "test-data/multicontext_model/simulated-mtx-uvw-10K"
+>input.dir = "test-data/multicontext_model/simulated-mtx-uvw-10K"
obs.train = read.table(paste(input.dir,"/obs-train.txt",sep=""),
sep="\t", header=FALSE, as.is=TRUE);
names(obs.train) = c("src_id", "dst_id", "src_context",
@@ -38,8 +38,8 @@ \subsection{Quick Start}
\end{verbatim}}
Then we run a simple latent factor model using the following command
{\small\begin{verbatim}
->ans = fit.bst(obs.train=obs.train, obs.test=obs.test, x.obs.train=x.obs.train,
- x.obs.test=x.obs.test, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
+>ans = fit.bst(obs.train=obs.train, obs.test=obs.test, x_obs.train=x_obs.train,
+ x_obs.test=x_obs.test, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
out.dir = "/tmp/unit-test/simulated-mtx-uvw-10K",
model.name="uvw", nFactors=3, nIter=10);
\end{verbatim}}
@@ -48,11 +48,12 @@ \subsection{Quick Start}
\parahead{Step 3}
Once Step 2 is finished, we have the predicted values of the response variable $y$, since we have the test data as input of the function (If we do not have test data, we can simply omit the {\tt obs.test} and {\tt x.obs.test} option, and the final output would only have model parameters without prediction results). Check out the file {\tt prediction} inside the output directory (In our example, {\tt /tmp/unit-test/simulated-mtx-uvw-10K\_uvw/prediction} is the filename). The file has two columns: the original observed $y$ and the predicted $y$ (the {\tt pred\_y} column). Standard metrics such as complete data log-likelihood and RMSE (root mean squared error) have been generated in the file {\tt summary}. Check Section \ref{sec:output} for more details.
+Please note that the predicted values of $y$ for model {\tt uvw} can also be found at {\tt ans$pred.y$uvw}.
\parahead{Run multiple models simultaneously}
We actually are able to run multiple BST models simultaneously using the following command
{\small\begin{verbatim}
->ans = fit.bst(obs.train=obs.train, obs.test=obs.test, x.obs.train=x.obs.train,
- x.obs.test=x.obs.test, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
+>ans = fit.bst(obs.train=obs.train, obs.test=obs.test, x_obs.train=x_obs.train,
+ x_obs.test=x_obs.test, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
out.dir = "/tmp/unit-test/simulated-mtx-uvw-10K",
model.name=c("uvw1", "uvw2"), nFactors=c(1,2), nIter=10);
\end{verbatim}}
@@ -69,7 +70,24 @@ \subsection{Quick Start}
\item {\tt nIter} specifies the number of EM iterations. All the models share the same number of iterations.
\item {\tt nSamplesPerIter} specifies the number of Gibbs samples per E-step. It can be either a scalar which means every EM iteration share the same {\tt nSamplesPerIter}, or it can be a vector with length equal to {\tt nIter}, i.e. each EM iteration has their own values of {\tt nSamplesPerIter}. Note that all models share the same {\tt nSamplesPerIter}.
\item {\tt is.logistic} specifies whether we want to use logistic link function for our models on binary response data. Default is FALSE. It can be either a boolean value that is shared by all models, or a vector of boolean values with length equal to the number of models.
-\item {\tt src.dst.same} specifies whether the source and destination nodes are actually the same. If they are, $\left<\bm{u}_i, \bm{v}_j, \bm{w}_k\right>$ in the model specified in Eq~\ref{eq:uvw-model} will be replaced by $\left<\bm{v}_i, \bm{v}_j, \bm{w}_k\right>$. Default is FALSE.
+\item {\tt src.dst.same}: Whether source nodes and destination nodes refer to the same set of entities. For example, if source nodes represent users and destination nodes represents items, {\tt src.dst.same} should be set to {\tt FALSE}. However, if both source and destination nodes represent users (e.g., users rate other users) and ${\tt src\_id} = A$ refers to the same user $A$ as ${\tt dst\_id} = A$, the {\tt src.dst.same} should be set to {\tt TRUE}. Regarding to modeling, when {\tt src.dst.same} is TRUE, $\left<\bm{u}_i, \bm{v}_j, \bm{w}_k\right>$ in the model specified in Eq~\ref{eq:uvw-model} will be replaced by $\left<\bm{v}_i, \bm{v}_j, \bm{w}_k\right>$. The default of {\tt src.dst.same} is FALSE.
\item {\tt control} has a list of more advanced parameters that will be introduced later.
\end{itemize}
+\parahead{Advanced parameters} {\tt control=fit.bst.control(...)} contains the following advanced parameters:
+\begin{itemize}
+\item {\tt rm.self.link}: Whether to remove self-edges. If {\tt src.dst.same=TRUE}, you can choose to remove observations with ${\tt src\_id} = {\tt dst\_id}$ by setting {\tt rm.self.link=FALSE}. Otherwise, {\tt rm.self.link} should be set to {\tt FALSE}. The default of {\tt rm.self.link} is FALSE.
+\item {\tt add.intercept}: Whether you want to add an intercept to each feature matrix. If {\tt add.intercept=TRUE}, a column of all 1s will be added to every feature matrix. The default of {\tt add.intercept} is TRUE.
+\item {\tt has.gamma} specifies whether to include $\gamma_k$ in the model specified in Eq~\ref{eq:uvw-model} or not. If {\tt has.gamm=FALSE}, $\gamma_k$ will be disabled or removed from the model. For the default, {\tt has.gamma} is set as FALSE unless the training response data {\tt obs.train} do not have any source or destination context, but have edge context.
+\item {\tt reg.algo} and {\tt reg.control} specify how the regression priors will to be fitted. If they are set to {\tt NULL} (default), R's basic linear regression function {\tt lm} will be used to fit the prior regression coefficients $\bm{g}, \bm{d}, \bm{h}, G, D$ and $H$. Currently, we only support two other algorithms {\tt "GLMNet"} and {\tt "RandomForest"}. Therefore, {\tt reg.algo} can only have three types of values: NULL, {\tt "GLMNet"} and {\tt "RandomForest"} (both are strings). Notice that if {\tt "RandomForest"} is used, the regression priors become nonlinear; see~\cite{gmf:recsys11} for more information.
+\item {\tt nBurnin} is the number of burn-in samples per E-step. The default is 10\% of {\tt nSamplesPerIter}.
+\item {\tt init.params} is a list of the initial values of all the variance component parameters. The default values of {\tt init.params} is
+{\small\begin{verbatim}
+init.params=list(var_alpha=1, var_beta=1, var_gamma=1,
+ var_u=1, var_v=1, var_w=1, var_y=NULL,
+ relative.to.var_y=FALSE, var_alpha_global=1,
+ var_beta_global=1)
+\end{verbatim}}
+The details of the meanings of these parameters can be found at Section \ref{}.
+\item {\tt random.seed} is the random seed for the model fitting procedure.
+\end{itemize}
View
Binary file not shown.
View
@@ -6,13 +6,13 @@ fit.bst <- function(
code.dir = "", # The top-level directory of where code get installed, "" if you are in that directory
obs.train, # The training response data
obs.test = NULL, # The testing response data
- x.obs.train = NULL, # The data of training observation features
- x.obs.test = NULL, # The data of testing observation features
+ x_obs.train = NULL, # The data of training observation features
+ x_obs.test = NULL, # The data of testing observation features
x_src = NULL, # The data of context features for source nodes
x_dst = NULL, # The data of context features for destination nodes
x_ctx = NULL, # The data of context features for edges
out.dir = "", # The directory of output files
- model.name = "model", #The name of the model, can be any string or vector of strings
+ model.name = "model", #The name of the model, can be any string or a vector of strings
nFactors, # Number of factors, can be any positive integer or vector of positive intergers with length=length(model.name)
nIter = 20, # Number of EM iterations
nSamplesPerIter = 200, # Number of Gibbs samples per E step, can be a vector of numbers with length=nIter
@@ -25,6 +25,9 @@ fit.bst <- function(
if (code.dir!="") code.dir = sprintf("%s/",code.dir);
#if (out.dir!="") out.dir = sprintf("%s/",out.dir);
+ if (floor(nIter)!=nIter || nIter<=0 || length(nIter)>1) stop("nIter must be a positive integer scalar!");
+ if (floor(nSamplesPerIter)!=nSamplesPerIter || nSamplesPerIter<=0 || length(nSamplesPerIter)>1) stop("nSamplesPerIter must be a positive integer scalar!");
+
# Load all the required libraries and source code
if (class(try(load.code(code.dir)))=="try-error") stop("Wrong code.dir. Please double check where the code is installed.");
@@ -33,10 +36,11 @@ fit.bst <- function(
if (!is.null(obs.test)) {
if (is.null(obs.test$src_id) || is.null(obs.test$dst_id) || is.null(obs.test$y)) stop("obs.test must have src_id, dst_id, and response y");
}
- names(x_src)[1] = "src_id";
- names(x_dst)[1] = "dst_id";
- names(x_ctx)[1] = "ctx_id";
-
+ if (is.null(x_obs.train) && !is.null(x_obs.test)) stop("x_obs.train does not exist while x_obs.test is used!");
+ if (is.null(x_obs.test) && !is.null(x_obs.train)) stop("x_obs.test does not exist while x_obs.train is used!");
+ if (!is.null(x_obs.train) && !is.null(x_obs.test)) {
+ if (ncol(x_obs.train)!=ncol(x_obs.test)) stop("ncol(x_obs.train)!=ncol(x_obs.test)! The number of features for training and test should be exactly the same!");
+ }
# Index data: Put the input data into the right form
# Convert IDs into numeric indices and
# Convert some data frames into matrices
@@ -61,7 +65,7 @@ fit.bst <- function(
# Model Settings
if (is.null(control$has.gamma)) {
control$has.gamma = FALSE;
- if (is.null(obs.train$src_context) && is.null(obs.train$dst_context) && !is.null(obs.train$ctx_id)) control$has.gamma = TRUE;
+ if (is.null(x_src) && is.null(x_dst) && !is.null(x_ctx)) control$has.gamma = TRUE;
}
if (length(nFactors)==1) nFactors = rep(nFactors, length(model.name));
if (length(control$has.gamma)==1) control$has.gamma = rep(control$has.gamma,length(model.name));
@@ -72,13 +76,15 @@ fit.bst <- function(
for (i in 1:length(is.logistic))
{
if (is.logistic[i]!=0 && is.logistic[i]!=1) stop("is.logistic must be boolean!");
+ if (is.logistic[i] && length(which(obs.train$y!=0 & obs.train$y!=1))>0) stop("Logistic link function should not be used for non-binary training data! Please set is.logistic=F");
+ if (is.logistic[i] && length(which(obs.test$y!=0 & obs.test$y!=1))>0) stop("Logistic link function should not be used for non-binary test data! Please set is.logistic=F");
}
setting = data.frame(
name = model.name,
nFactors = nFactors, # number of interaction factors
has.u = rep(!src.dst.same,length(model.name)), # whether to use u_i' v_j or v_i' v_j
- has.gamma = control$has.gamma, # just set to F
+ has.gamma = control$has.gamma,
nLocalFactors = rep(0,length(model.name)), # just set to 0
is.logistic = is.logistic # whether to use the logistic model for binary rating
);
@@ -156,8 +162,6 @@ load.code <- function(code.dir)
source(sprintf("%ssrc/R/model/multicontext_model_utils.R",code.dir));
source(sprintf("%ssrc/R/model/multicontext_model_MStep.R",code.dir));
source(sprintf("%ssrc/R/model/multicontext_model_EM.R",code.dir));
- #source(sprintf("%ssrc/R/model/GLMNet.R",code.dir));
- #source(sprintf("%ssrc/R/model/RandomForest.R",code.dir));
}
fit.bst.control <- function (
@@ -187,7 +191,7 @@ fit.bst.control <- function (
if (reg.algo!="GLMNet" && reg.algo!="RandomForest") stop("reg.algo must be NULL, GLMNet, or RandomForest. Make sure they are strings.");
}
if (!is.null(nBurnin)) {
- if (nBurnin<0 || !is.integer(nBurnin)) stop("nBurnin must be a positive integer");
+ if (nBurnin<0 || floor(nBurnin)!=nBurnin || length(nBurnin)>1) stop("nBurnin must be a positive integer");
}
list(rm.self.link=rm.self.link,add.intercept=add.intercept, has.gamma=has.gamma, reg.algo=reg.algo, reg.control=reg.control, nBurnin=nBurnin, init.params=init.params, random.seed=random.seed)
-}
+}
@@ -19,8 +19,10 @@ x_ctx = read.table(paste(input.dir,"/dense-feature-ctxt.txt",sep=""), sep="\t",
names(x_ctx)[1] = "ctx_id";
# (2) Call BST
-ans = fit.bst(obs.train=obs.train, obs.test=obs.test, x.obs.train=x.obs.train, x.obs.test=x.obs.test, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
+ans = fit.bst(obs.train=obs.train, obs.test=obs.test, x_obs.train=x_obs.train, x_obs.test=x_obs.test, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
out.dir = "/tmp/unit-test/simulated-mtx-uvw-10K", model.name=c("uvw1", "uvw2"), nFactors=c(1,2), nIter=10);
+#ans = fit.bst(obs.train=obs.train, x_obs.train=x_obs.train, x_src=x_src, x_dst=x_dst, x_ctx=x_ctx,
+# out.dir = "/tmp/unit-test/simulated-mtx-uvw-10K", model.name=c("uvw1", "uvw2"), nFactors=c(1,2), nIter=10);
# (3) Compare to the reference run
warnings()

0 comments on commit a1a1d16

Please sign in to comment.