again readthedocs

david-cortes · May 28, 2018 · b5c16ff · b5c16ff
1 parent 2a98689
commit b5c16ff
Showing 1 changed file with 121 additions and 121 deletions.
diff --git a/hpfrec/__init__.py b/hpfrec/__init__.py
@@ -5,127 +5,127 @@
 pd.options.mode.chained_assignment = None
 
 class HPF:
-        """
-        Hierarchical Poisson Factorization
-
-        Model for recommending items based on probabilistic Poisson factorization
-        on sparse count data (e.g. number of times a user played different songs),
-        using variational inference with coordinate-ascent.
-
-        Can use different stopping criteria for the opimization procedure:
-        1) Run for a fixed number of iterations (stop_crit='maxiter').
-        2) Calculate the log-likelihood every N iterations (stop_crit='train-llk' and check_every)
-           and stop once {1 - curr/prev} is below a certain threshold (stop_thr)
-        3) Calculate the log-likelihood in a user-provided validation set (stop_crit='val-llk', val_set and check_every)
-           and stop once {1 - curr/prev} is below a certain threshold. For this criterion, you might want to lower the
-           default threshold (see Note).
-        4) Check the the difference in the user-factor matrix after every N iterations (stop_crit='diff-norm', check_every)
-           and stop once the *l2-norm* of this difference is below a certain threshold (stop_thr).
-           Note that this is *not a percent* difference as it is for log-likelihood criteria, so you should put a larger
-           value than the default here.
-           This is a much faster criterion to calculate and is recommended for larger datasets.
-        
-        If passing reindex=True, it will internally reindex all user and item IDs. Your data will not require
-        reindexing if the IDs for users and items in counts_df meet the following criteria:
-        1) Are all integers.
-        2) Start at zero.
-        3) Don't have any enumeration gaps, i.e. if there is a user '4', user '3' must also be there.
-
-        If you only want to obtain the fitted parameters and use your own API later for recommendations,
-        you can pass produce_dicts=False and pass a folder where to save them in csv format (they are also
-        available as numpy arrays in this object's Theta and Beta attributes). Otherwise, the model
-        will create Python dictionaries with entries for each user and item, which can take quite a bit of
-        RAM memory. These are required for making predictions later through this package's API.
-
-        Passing verbose=True will also print RMSE (root mean squared error) at each iteration.
-        For slighly better speed pass verbose=False once you know what a good threshold should be
-        for your data.
-
-        Note
-        ----
-        If 'check_every' is not None and stop_crit is not 'diff-norm', it will, every N iterations,
-        calculate the log-likelihood of the data. By default, this is the full likelihood, including a constant
-        that depends on the data but not on the parameters and which is quite slow to compute. The reason why
-        it's calculated by default like this is because, if not adding this constant, the number can turn positive
-        and will mess with the stopping criterion for likelihood. You can nevertheless choose to turn this constant off
-        if you are confident that your likelihood values will not get positive.
-        If you pass a validation set, it will calculate the log-likelihood *of the non-zero observations
-        only*, rather than the complete likelihood that includes also the combinations of users and items
-        not present in the data (assumed to be zero), thus it's more likely that you might see positive numbers here.
-        Compared to ALS, iterations from this algorithm are a lot faster to compute, so don't be scared about passing
-        large numbers for maxiter.
-        In some unlucky cases, the parameters will become NA in the first iteration, in which case you should see
-        weird values for log-likelihood and RMSE. If this happens, try again with a different random seed.
-
-        Parameters
-        ----------
-        k : int
-            Number of latent factors to use.
-        a : float
-            Shape parameter for the user-factor matrix.
-        a_prime : float
-            Shape parameter and dividend of the rate parameter for the user activity vector.
-        b_prime : float
-            Divisor of the rate parameter for the user activity vector.
-        c : float
-            Shape parameter for the item-factor matrix.
-        c_prime : float
-            Shape parameter and dividend of the rate parameter for the item popularity vector.
-        d_prime : float
-            Divisor o the rate parameter for the item popularity vector.
-        ncores : int
-            Number of cores to use to parallelize computations.
-            If set to -1, will use the maximum available on the computer.
-        stop_crit : str, one of 'maxiter', 'train-llk', 'val-llk', 'diff-norm'
-            Stopping criterion for the optimization procedure.
-        check_every : None or int
-            Calculate log-likelihood every N iterations.
-        stop_thr : float
-            Threshold for proportion increase in log-likelihood or l2-norm for difference between matrices.
-        maxiter : int
-            Maximum number of iterations for which to run the optimization procedure.
-        reindex : bool
-            Whether to reindex data internally.
-        random_seed : int or None
-            Random seed to use when starting the parameters.
-        allow_inconsistent_math : bool
-            Whether to allow inconsistent floating-point math (producing slightly different results on each run)
-            which would allow parallelization of the updates for the shape parameters of Lambda and Gamma.
-        verbose : bool
-            Whether to print convergence messages.
-        keep_data : bool
-            Whether to keep information about which user was associated with each item
-            in the training set, so as to exclude those items later when making Top-N
-            recommendations.
-        save_folder : str or None
-            Folder where to save all model parameters as csv files.
-        produce_dicts : bool
-            Whether to produce Python dictionaries for users and items, which
-            are used by the prediction API of this package.
-        
-        Attributes
-        ----------
-        Theta : array (nusers, k)
-            User-factor matrix.
-        Beta : array (nitems, k)
-            Item-factor matrix.
-        user_mapping_ : array (nusers,)
-            ID of the user (as passed to .fit) of each row of Theta.
-        item_mapping_ : array (nitems,)
-            ID of the item (as passed to .fit) of each row of Beta.
-        user_dict_ : dict (nusers)
-            Dictionary with the mapping between user IDs (as passed to .fit) and rows of Theta.
-        item_dict_ : dict (nitems)
-            Dictionary with the mapping between item IDs (as passed to .fit) and rows of Beta.
-        is_fitted : bool
-            Whether the model has been fit to some data.
-        niter : int
-            Number of iterations for which the fitting procedure was run.
-
-        References
-        ----------
-        [1] Scalable Recommendation with Hierarchical Poisson Factorization (P. Gopalan, 2015)
-        """
+    """
+    Hierarchical Poisson Factorization
+
+    Model for recommending items based on probabilistic Poisson factorization
+    on sparse count data (e.g. number of times a user played different songs),
+    using variational inference with coordinate-ascent.
+
+    Can use different stopping criteria for the opimization procedure:
+    1) Run for a fixed number of iterations (stop_crit='maxiter').
+    2) Calculate the log-likelihood every N iterations (stop_crit='train-llk' and check_every)
+       and stop once {1 - curr/prev} is below a certain threshold (stop_thr)
+    3) Calculate the log-likelihood in a user-provided validation set (stop_crit='val-llk', val_set and check_every)
+       and stop once {1 - curr/prev} is below a certain threshold. For this criterion, you might want to lower the
+       default threshold (see Note).
+    4) Check the the difference in the user-factor matrix after every N iterations (stop_crit='diff-norm', check_every)
+       and stop once the *l2-norm* of this difference is below a certain threshold (stop_thr).
+       Note that this is *not a percent* difference as it is for log-likelihood criteria, so you should put a larger
+       value than the default here.
+       This is a much faster criterion to calculate and is recommended for larger datasets.
+    
+    If passing reindex=True, it will internally reindex all user and item IDs. Your data will not require
+    reindexing if the IDs for users and items in counts_df meet the following criteria:
+    1) Are all integers.
+    2) Start at zero.
+    3) Don't have any enumeration gaps, i.e. if there is a user '4', user '3' must also be there.
+
+    If you only want to obtain the fitted parameters and use your own API later for recommendations,
+    you can pass produce_dicts=False and pass a folder where to save them in csv format (they are also
+    available as numpy arrays in this object's Theta and Beta attributes). Otherwise, the model
+    will create Python dictionaries with entries for each user and item, which can take quite a bit of
+    RAM memory. These are required for making predictions later through this package's API.
+
+    Passing verbose=True will also print RMSE (root mean squared error) at each iteration.
+    For slighly better speed pass verbose=False once you know what a good threshold should be
+    for your data.
+
+    Note
+    ----
+    If 'check_every' is not None and stop_crit is not 'diff-norm', it will, every N iterations,
+    calculate the log-likelihood of the data. By default, this is the full likelihood, including a constant
+    that depends on the data but not on the parameters and which is quite slow to compute. The reason why
+    it's calculated by default like this is because, if not adding this constant, the number can turn positive
+    and will mess with the stopping criterion for likelihood. You can nevertheless choose to turn this constant off
+    if you are confident that your likelihood values will not get positive.
+    If you pass a validation set, it will calculate the log-likelihood *of the non-zero observations
+    only*, rather than the complete likelihood that includes also the combinations of users and items
+    not present in the data (assumed to be zero), thus it's more likely that you might see positive numbers here.
+    Compared to ALS, iterations from this algorithm are a lot faster to compute, so don't be scared about passing
+    large numbers for maxiter.
+    In some unlucky cases, the parameters will become NA in the first iteration, in which case you should see
+    weird values for log-likelihood and RMSE. If this happens, try again with a different random seed.
+
+    Parameters
+    ----------
+    k : int
+        Number of latent factors to use.
+    a : float
+        Shape parameter for the user-factor matrix.
+    a_prime : float
+        Shape parameter and dividend of the rate parameter for the user activity vector.
+    b_prime : float
+        Divisor of the rate parameter for the user activity vector.
+    c : float
+        Shape parameter for the item-factor matrix.
+    c_prime : float
+        Shape parameter and dividend of the rate parameter for the item popularity vector.
+    d_prime : float
+        Divisor o the rate parameter for the item popularity vector.
+    ncores : int
+        Number of cores to use to parallelize computations.
+        If set to -1, will use the maximum available on the computer.
+    stop_crit : str, one of 'maxiter', 'train-llk', 'val-llk', 'diff-norm'
+        Stopping criterion for the optimization procedure.
+    check_every : None or int
+        Calculate log-likelihood every N iterations.
+    stop_thr : float
+        Threshold for proportion increase in log-likelihood or l2-norm for difference between matrices.
+    maxiter : int
+        Maximum number of iterations for which to run the optimization procedure.
+    reindex : bool
+        Whether to reindex data internally.
+    random_seed : int or None
+        Random seed to use when starting the parameters.
+    allow_inconsistent_math : bool
+        Whether to allow inconsistent floating-point math (producing slightly different results on each run)
+        which would allow parallelization of the updates for the shape parameters of Lambda and Gamma.
+    verbose : bool
+        Whether to print convergence messages.
+    keep_data : bool
+        Whether to keep information about which user was associated with each item
+        in the training set, so as to exclude those items later when making Top-N
+        recommendations.
+    save_folder : str or None
+        Folder where to save all model parameters as csv files.
+    produce_dicts : bool
+        Whether to produce Python dictionaries for users and items, which
+        are used by the prediction API of this package.
+    
+    Attributes
+    ----------
+    Theta : array (nusers, k)
+        User-factor matrix.
+    Beta : array (nitems, k)
+        Item-factor matrix.
+    user_mapping_ : array (nusers,)
+        ID of the user (as passed to .fit) of each row of Theta.
+    item_mapping_ : array (nitems,)
+        ID of the item (as passed to .fit) of each row of Beta.
+    user_dict_ : dict (nusers)
+        Dictionary with the mapping between user IDs (as passed to .fit) and rows of Theta.
+    item_dict_ : dict (nitems)
+        Dictionary with the mapping between item IDs (as passed to .fit) and rows of Beta.
+    is_fitted : bool
+        Whether the model has been fit to some data.
+    niter : int
+        Number of iterations for which the fitting procedure was run.
+
+    References
+    ----------
+    [1] Scalable Recommendation with Hierarchical Poisson Factorization (P. Gopalan, 2015)
+    """
     def __init__(self, k=30, a=0.3, a_prime=0.3, b_prime=1.0,
                  c=0.3, c_prime=0.3, d_prime=1.0, ncores=-1,
                  stop_crit='train-llk', check_every=10, stop_thr=1e-3,