Skip to content

Commit

Permalink
again readthedocs
Browse files Browse the repository at this point in the history
  • Loading branch information
David committed May 28, 2018
1 parent 2a98689 commit b5c16ff
Showing 1 changed file with 121 additions and 121 deletions.
242 changes: 121 additions & 121 deletions hpfrec/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,127 +5,127 @@
pd.options.mode.chained_assignment = None

class HPF:
"""
Hierarchical Poisson Factorization
Model for recommending items based on probabilistic Poisson factorization
on sparse count data (e.g. number of times a user played different songs),
using variational inference with coordinate-ascent.
Can use different stopping criteria for the opimization procedure:
1) Run for a fixed number of iterations (stop_crit='maxiter').
2) Calculate the log-likelihood every N iterations (stop_crit='train-llk' and check_every)
and stop once {1 - curr/prev} is below a certain threshold (stop_thr)
3) Calculate the log-likelihood in a user-provided validation set (stop_crit='val-llk', val_set and check_every)
and stop once {1 - curr/prev} is below a certain threshold. For this criterion, you might want to lower the
default threshold (see Note).
4) Check the the difference in the user-factor matrix after every N iterations (stop_crit='diff-norm', check_every)
and stop once the *l2-norm* of this difference is below a certain threshold (stop_thr).
Note that this is *not a percent* difference as it is for log-likelihood criteria, so you should put a larger
value than the default here.
This is a much faster criterion to calculate and is recommended for larger datasets.
If passing reindex=True, it will internally reindex all user and item IDs. Your data will not require
reindexing if the IDs for users and items in counts_df meet the following criteria:
1) Are all integers.
2) Start at zero.
3) Don't have any enumeration gaps, i.e. if there is a user '4', user '3' must also be there.
If you only want to obtain the fitted parameters and use your own API later for recommendations,
you can pass produce_dicts=False and pass a folder where to save them in csv format (they are also
available as numpy arrays in this object's Theta and Beta attributes). Otherwise, the model
will create Python dictionaries with entries for each user and item, which can take quite a bit of
RAM memory. These are required for making predictions later through this package's API.
Passing verbose=True will also print RMSE (root mean squared error) at each iteration.
For slighly better speed pass verbose=False once you know what a good threshold should be
for your data.
Note
----
If 'check_every' is not None and stop_crit is not 'diff-norm', it will, every N iterations,
calculate the log-likelihood of the data. By default, this is the full likelihood, including a constant
that depends on the data but not on the parameters and which is quite slow to compute. The reason why
it's calculated by default like this is because, if not adding this constant, the number can turn positive
and will mess with the stopping criterion for likelihood. You can nevertheless choose to turn this constant off
if you are confident that your likelihood values will not get positive.
If you pass a validation set, it will calculate the log-likelihood *of the non-zero observations
only*, rather than the complete likelihood that includes also the combinations of users and items
not present in the data (assumed to be zero), thus it's more likely that you might see positive numbers here.
Compared to ALS, iterations from this algorithm are a lot faster to compute, so don't be scared about passing
large numbers for maxiter.
In some unlucky cases, the parameters will become NA in the first iteration, in which case you should see
weird values for log-likelihood and RMSE. If this happens, try again with a different random seed.
Parameters
----------
k : int
Number of latent factors to use.
a : float
Shape parameter for the user-factor matrix.
a_prime : float
Shape parameter and dividend of the rate parameter for the user activity vector.
b_prime : float
Divisor of the rate parameter for the user activity vector.
c : float
Shape parameter for the item-factor matrix.
c_prime : float
Shape parameter and dividend of the rate parameter for the item popularity vector.
d_prime : float
Divisor o the rate parameter for the item popularity vector.
ncores : int
Number of cores to use to parallelize computations.
If set to -1, will use the maximum available on the computer.
stop_crit : str, one of 'maxiter', 'train-llk', 'val-llk', 'diff-norm'
Stopping criterion for the optimization procedure.
check_every : None or int
Calculate log-likelihood every N iterations.
stop_thr : float
Threshold for proportion increase in log-likelihood or l2-norm for difference between matrices.
maxiter : int
Maximum number of iterations for which to run the optimization procedure.
reindex : bool
Whether to reindex data internally.
random_seed : int or None
Random seed to use when starting the parameters.
allow_inconsistent_math : bool
Whether to allow inconsistent floating-point math (producing slightly different results on each run)
which would allow parallelization of the updates for the shape parameters of Lambda and Gamma.
verbose : bool
Whether to print convergence messages.
keep_data : bool
Whether to keep information about which user was associated with each item
in the training set, so as to exclude those items later when making Top-N
recommendations.
save_folder : str or None
Folder where to save all model parameters as csv files.
produce_dicts : bool
Whether to produce Python dictionaries for users and items, which
are used by the prediction API of this package.
Attributes
----------
Theta : array (nusers, k)
User-factor matrix.
Beta : array (nitems, k)
Item-factor matrix.
user_mapping_ : array (nusers,)
ID of the user (as passed to .fit) of each row of Theta.
item_mapping_ : array (nitems,)
ID of the item (as passed to .fit) of each row of Beta.
user_dict_ : dict (nusers)
Dictionary with the mapping between user IDs (as passed to .fit) and rows of Theta.
item_dict_ : dict (nitems)
Dictionary with the mapping between item IDs (as passed to .fit) and rows of Beta.
is_fitted : bool
Whether the model has been fit to some data.
niter : int
Number of iterations for which the fitting procedure was run.
References
----------
[1] Scalable Recommendation with Hierarchical Poisson Factorization (P. Gopalan, 2015)
"""
"""
Hierarchical Poisson Factorization
Model for recommending items based on probabilistic Poisson factorization
on sparse count data (e.g. number of times a user played different songs),
using variational inference with coordinate-ascent.
Can use different stopping criteria for the opimization procedure:
1) Run for a fixed number of iterations (stop_crit='maxiter').
2) Calculate the log-likelihood every N iterations (stop_crit='train-llk' and check_every)
and stop once {1 - curr/prev} is below a certain threshold (stop_thr)
3) Calculate the log-likelihood in a user-provided validation set (stop_crit='val-llk', val_set and check_every)
and stop once {1 - curr/prev} is below a certain threshold. For this criterion, you might want to lower the
default threshold (see Note).
4) Check the the difference in the user-factor matrix after every N iterations (stop_crit='diff-norm', check_every)
and stop once the *l2-norm* of this difference is below a certain threshold (stop_thr).
Note that this is *not a percent* difference as it is for log-likelihood criteria, so you should put a larger
value than the default here.
This is a much faster criterion to calculate and is recommended for larger datasets.
If passing reindex=True, it will internally reindex all user and item IDs. Your data will not require
reindexing if the IDs for users and items in counts_df meet the following criteria:
1) Are all integers.
2) Start at zero.
3) Don't have any enumeration gaps, i.e. if there is a user '4', user '3' must also be there.
If you only want to obtain the fitted parameters and use your own API later for recommendations,
you can pass produce_dicts=False and pass a folder where to save them in csv format (they are also
available as numpy arrays in this object's Theta and Beta attributes). Otherwise, the model
will create Python dictionaries with entries for each user and item, which can take quite a bit of
RAM memory. These are required for making predictions later through this package's API.
Passing verbose=True will also print RMSE (root mean squared error) at each iteration.
For slighly better speed pass verbose=False once you know what a good threshold should be
for your data.
Note
----
If 'check_every' is not None and stop_crit is not 'diff-norm', it will, every N iterations,
calculate the log-likelihood of the data. By default, this is the full likelihood, including a constant
that depends on the data but not on the parameters and which is quite slow to compute. The reason why
it's calculated by default like this is because, if not adding this constant, the number can turn positive
and will mess with the stopping criterion for likelihood. You can nevertheless choose to turn this constant off
if you are confident that your likelihood values will not get positive.
If you pass a validation set, it will calculate the log-likelihood *of the non-zero observations
only*, rather than the complete likelihood that includes also the combinations of users and items
not present in the data (assumed to be zero), thus it's more likely that you might see positive numbers here.
Compared to ALS, iterations from this algorithm are a lot faster to compute, so don't be scared about passing
large numbers for maxiter.
In some unlucky cases, the parameters will become NA in the first iteration, in which case you should see
weird values for log-likelihood and RMSE. If this happens, try again with a different random seed.
Parameters
----------
k : int
Number of latent factors to use.
a : float
Shape parameter for the user-factor matrix.
a_prime : float
Shape parameter and dividend of the rate parameter for the user activity vector.
b_prime : float
Divisor of the rate parameter for the user activity vector.
c : float
Shape parameter for the item-factor matrix.
c_prime : float
Shape parameter and dividend of the rate parameter for the item popularity vector.
d_prime : float
Divisor o the rate parameter for the item popularity vector.
ncores : int
Number of cores to use to parallelize computations.
If set to -1, will use the maximum available on the computer.
stop_crit : str, one of 'maxiter', 'train-llk', 'val-llk', 'diff-norm'
Stopping criterion for the optimization procedure.
check_every : None or int
Calculate log-likelihood every N iterations.
stop_thr : float
Threshold for proportion increase in log-likelihood or l2-norm for difference between matrices.
maxiter : int
Maximum number of iterations for which to run the optimization procedure.
reindex : bool
Whether to reindex data internally.
random_seed : int or None
Random seed to use when starting the parameters.
allow_inconsistent_math : bool
Whether to allow inconsistent floating-point math (producing slightly different results on each run)
which would allow parallelization of the updates for the shape parameters of Lambda and Gamma.
verbose : bool
Whether to print convergence messages.
keep_data : bool
Whether to keep information about which user was associated with each item
in the training set, so as to exclude those items later when making Top-N
recommendations.
save_folder : str or None
Folder where to save all model parameters as csv files.
produce_dicts : bool
Whether to produce Python dictionaries for users and items, which
are used by the prediction API of this package.
Attributes
----------
Theta : array (nusers, k)
User-factor matrix.
Beta : array (nitems, k)
Item-factor matrix.
user_mapping_ : array (nusers,)
ID of the user (as passed to .fit) of each row of Theta.
item_mapping_ : array (nitems,)
ID of the item (as passed to .fit) of each row of Beta.
user_dict_ : dict (nusers)
Dictionary with the mapping between user IDs (as passed to .fit) and rows of Theta.
item_dict_ : dict (nitems)
Dictionary with the mapping between item IDs (as passed to .fit) and rows of Beta.
is_fitted : bool
Whether the model has been fit to some data.
niter : int
Number of iterations for which the fitting procedure was run.
References
----------
[1] Scalable Recommendation with Hierarchical Poisson Factorization (P. Gopalan, 2015)
"""
def __init__(self, k=30, a=0.3, a_prime=0.3, b_prime=1.0,
c=0.3, c_prime=0.3, d_prime=1.0, ncores=-1,
stop_crit='train-llk', check_every=10, stop_thr=1e-3,
Expand Down

0 comments on commit b5c16ff

Please sign in to comment.