uploaded

david-cortes · May 28, 2018 · c7bb8af · c7bb8af
1 parent 4ace5f5
commit c7bb8af
Show file tree

Hide file tree

Showing 14 changed files with 3,069 additions and 0 deletions.
diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,2 @@
+global-include *.pyx
+global-include *.pxd 
diff --git a/README.md b/README.md
@@ -0,0 +1,130 @@
+# Hierarchical Poisson Factorization
+
+This is a Python package for hierarchical Poisson factorization, a form of probabilistic matrix factorization used for recommender systems with implicit count data, based on the paper _Scalable Recommendation with Hierarchical Poisson Factorization (P. Gopalan, 2015)_.
+
+Supports parallelization (through OpenMP) and different stopping criteria for the coordinate-ascent procedure. The bottleneck computations are written in fast Cython code.
+
+## Model description
+
+The model consists in producing a non-negative low-rank matrix factorization of counts data (such as number of times each user played each song in some internet service) `Y ~= UV'`, produced by a generative model as follows:
+```
+ksi_u ~ Gamma(a_prime, a_prime/b_prime)
+Theta_uk ~ Gamma(a, ksi_u)
+
+eta_i ~ Gamma(c_prime, c_prime/d_prime)
+Beta_ik ~ Gamma(c, eta_i)
+
+Y_ui ~ Poisson(Theta_u' Beta_i)
+```
+The parameters are fit using mean-field approximation (a form of Bayesian variational inference) with coordinate ascent (updating each parameter separately until convergence).
+
+## Why is it more efficient
+
+In typical settings for recommendations with implicit data, most users ever see/click/play/buy a handful selected items out of all the available catalog, thus a matrix of user-item interactions would be extremely sparse (most entries would be zero). Algorithms like implicit-ALS or BPR (Bayesian personalized ranking) require iterating over some or all of the missing combinations not seen in the data (e.g. songs not played by each user) in order to compute their respective loss functions, which is slow and not very scalable.
+
+However, Poisson likelihood is given by the formula:
+```L(y) = yhat^y * exp(-yhat) / y!```
+
+If taking the logarithm (log-likelihood), then this becomes:
+```l(y) = -log(y!) + y*log(yhat) - yhat```
+
+Since `log(0!) = 0`, and the sum of predictions for all combinations of users and items can be quickly calculated by `sum yhat = sum_{i,j} <U_i, V_j> = <sum_i U_i, sum_j V_j>` (`U` and `V` are non-negative matrices), it means the model doesn't ever need to make calculations on values that are equal to zero - simply not adding them to calculations would implicitly assume that they are zero.
+
+Moreover, negative Poisson log-likelihood is a more appropriate loss for count data than squared loss, which tends to produce not-so-good results when the values to predict follow an exponential rather than a normal distribution.
+
+## Installation
+
+Package is available on PyPI, can be installed with:
+```pip install hpfrec```
+
+As it contains Cython code, it requires a C compiler. In Windows, this usually means it requires a Visual Studio installation (or MinGW + GCC), and if using Anaconda, might also require configuring it to use said Visual Studio instead of MinGW, otherwise the installation from `pip` might fail. For more details see this guide:
+[Cython Extensions On Windows](https://github.com/cython/cython/wiki/CythonExtensionsOnWindows)
+
+On Python 2.7 on Windows, it might additionally requiring installing extra Visual Basic modules.
+
+On Linux and Mac, the `pip` install should work out-of-the-box, as long as the system has `gcc` (included by default in most installs).
+
+## Sample usage
+
+```python
+import pandas as pd, numpy as np
+from hpfrec import HPF
+
+## Generating sample counts data
+nusers = 10**2
+nitems = 10**2
+nobs = 10**4
+counts_df = pd.DataFrame({
+	'UserId' : np.random.randint(nusers, size=nobs),
+	'ItemId' : np.random.randint(items, size=nobs),
+	'Count' : np.random.gamma(1,1, size=nobs)
+	})
+
+## Initializing the model object
+recommender = HPF()
+
+## Full function call
+recommender = HPF(k=20,
+				  a=.3, a_prime=.3, b_prime=1.0,
+				  c=.3, c_prime=.3, d_prime=1.0,
+				  ncores=-1, stop_crit='train-llk', check_every=10, stop_thr=1e-3,
+				  maxiter=100, reindex=True, random_seed=None,
+				  allow_inconsistent_math=False, verbose=True, full_llk=True,
+                  keep_data=True, save_folder=None, produce_dicts=True
+				  )
+
+## Fitting to the data
+recommender.fit(counts_df)
+
+## Fitting the model while monitoring a validation set
+recommender = HPF(k=20,
+				  a=.3, a_prime=.3, b_prime=1.0,
+				  c=.3, c_prime=.3, d_prime=1.0,
+				  ncores=-1, stop_crit='val-llk', check_every=10, stop_thr=1e-3,
+				  maxiter=100, reindex=True, random_seed=None,
+				  allow_inconsistent_math=False, verbose=True, full_llk=True,
+                  keep_data=True, save_folder=None, produce_dicts=True
+				  )
+recommender.fit(counts_df, val_set=counts_df.sample(10**3))
+## Note: a real validation should NEVER be a subset of the training set
+
+## Making predictions
+recommender.topN(user=10, n=10, exclude_seen=True)
+recommender.topN(user=10, n=10, exclude_seen=False, items_pool=np.array([1,2,3,4,5,6,100]))
+recommender.predict(user=10, item=11)
+recommender.predict(user=[10,10,10], item=[1,2,3])
+recommender.predict(user=[10,11,12], item=[4,5,6])
+
+## Evaluating model likelihood
+recommender.eval_llk(counts_df, full_llk=True)
+```
+
+If passing `reindex=True`, all user and item IDs that you pass to `.fit` will be reindexed internally (they need to be hashable types like `str`, `int` or `tuple`), and you  can use these same IDs to make predictions later. The IDs returned by `predict` and `topN` are these IDs passed to `.fit` too.
+
+For a more detailed example, see the IPython notebook [recommending songs with EchoNest MillionSong dataset](http://nbviewer.jupyter.org/github/david-cortes/hpfrec/blob/master/example/hpfrec_echonest.ipynb) illustrating its usage with the EchoNest TasteProfile dataset.
+
+This package contains only functionality related to fitting this model. For general evaluation metrics for recommendations on implicit data see other packages such as [lightFM](https://github.com/lyst/lightfm).
+
+## Documentation
+
+Documentation is available at readthedocs: [http://hpfrec.readthedocs.io/en/latest/](http://hpfrec.readthedocs.io/en/latest/)
+
+It is also internally documented through docstrings (e.g. you can try `help(hpfrec.HPF))`, `help(hpfrec.HPF.fit)`, etc.
+
+## Improving performance
+
+For better performance, use scipy and numpy libraries compiled against MKL. In Windows, you can find Python wheels (installable with pip after downloading them) of numpy and scipy precompiled with MKL in [Christoph Gohlke's website](https://www.lfd.uci.edu/~gohlke/pythonlibs/). In Linux and Mac, these come by default in Anaconda installations (but are likely to get overwritten if you enable `conda-forge`). In some small experiments from my side, this yields a near 4x speed improvement compared to using free linear algebra libraries.
+
+The constructor for HPF allows some parameters to make it run faster (if you know what you're doing): these are `allow_inconsistent_math=True`, `stop_crit='diff-norm'`, `reindex=False`, and, `verbose=False`. See the documentation for more details.
+
+## Troubleshooting
+
+* Package uses only one CPU core: make sure that your C compiler supports OpenMP (both Visual Studio and GCC do).
+* Error with `vcvarsall.bat`: see installation instructions (you need to configure your Python installation to use Visual Studio and set the correct paths to libraries). If you are using Python 2, try installing under a Python 3 environment and the problem might disappear.
+* Parameters turn to NaN: you might have run into an unlucky parmeter initialization. Try using a different random seed, or changing the number of latent factors (`k`). If passing `reindex=False`, try changing to `reindex=True`.
+
+The package has only been tested under Python 3.6.
+
+## References
+* [1] Gopalan, Prem, Jake M. Hofman, and David M. Blei. "Scalable Recommendation with Hierarchical Poisson Factorization." UAI. 2015.
+* [2] Gopalan, Prem, Jake M. Hofman, and David M. Blei. "Scalable recommendation with poisson factorization." arXiv preprint arXiv:1311.1704 (2013).
diff --git a/docs/Makefile b/docs/Makefile
@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = python -msphinx
+SPHINXPROJ    = hpfrec
+SOURCEDIR     = .
+BUILDDIR      = _build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
diff --git a/docs/conf.py b/docs/conf.py
@@ -0,0 +1,183 @@
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+#
+# hpfrec documentation build configuration file, created by
+# sphinx-quickstart on Sun May 27 22:58:46 2018.
+#
+# This file is execfile()d with the current directory set to its
+# containing dir.
+#
+# Note that not all possible configuration values are present in this
+# autogenerated file.
+#
+# All configuration values have a default; values that are commented out
+# serve to show the default.
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+# import os
+# import sys
+# sys.path.insert(0, os.path.abspath('.'))
+
+
+# -- General configuration ------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#
+# needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = ['sphinx.ext.autodoc', 'sphinx.ext.napoleon', 'sphinx_rtd_theme']
+napoleon_google_docstring = False
+napoleon_use_param = False
+napoleon_use_ivar = True
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix(es) of source filenames.
+# You can specify multiple suffix as a list of string:
+#
+# source_suffix = ['.rst', '.md']
+source_suffix = '.rst'
+
+# The master toctree document.
+master_doc = 'index'
+
+# General information about the project.
+project = 'hpfrec'
+copyright = '2018, David Cortes'
+author = 'David Cortes'
+
+# The version info for the project you're documenting, acts as replacement for
+# |version| and |release|, also used in various other places throughout the
+# built documents.
+#
+# The short X.Y version.
+version = '0.1'
+# The full version, including alpha/beta/rc tags.
+release = '0.1'
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#
+# This is also used if you do content translation via gettext catalogs.
+# Usually you set "language" from the command line for these cases.
+language = None
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This patterns also effect to html_static_path and html_extra_path
+exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = 'sphinx'
+
+# If true, `todo` and `todoList` produce output, else they produce nothing.
+todo_include_todos = False
+
+
+# -- Options for HTML output ----------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = 'alabaster'
+on_rtd = os.environ.get('READTHEDOCS', None) == 'True'
+
+if not on_rtd:  # only import and set the theme if we're building docs locally
+    import sphinx_rtd_theme
+    html_theme = 'sphinx_rtd_theme'
+    html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
+
+# otherwise, readthedocs.org uses their theme by default, so no need to specify it
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+#
+# html_theme_options = {}
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+
+# Custom sidebar templates, must be a dictionary that maps document names
+# to template names.
+#
+# This is required for the alabaster theme
+# refs: http://alabaster.readthedocs.io/en/latest/installation.html#sidebars
+html_sidebars = {
+    '**': [
+        'about.html',
+        'navigation.html',
+        'relations.html',  # needs 'show_related': True theme option to display
+        'searchbox.html',
+        'donate.html',
+    ]
+}
+
+
+# -- Options for HTMLHelp output ------------------------------------------
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'hpfrecdoc'
+
+
+# -- Options for LaTeX output ---------------------------------------------
+
+latex_elements = {
+    # The paper size ('letterpaper' or 'a4paper').
+    #
+    # 'papersize': 'letterpaper',
+
+    # The font size ('10pt', '11pt' or '12pt').
+    #
+    # 'pointsize': '10pt',
+
+    # Additional stuff for the LaTeX preamble.
+    #
+    # 'preamble': '',
+
+    # Latex figure (float) alignment
+    #
+    # 'figure_align': 'htbp',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title,
+#  author, documentclass [howto, manual, or own class]).
+latex_documents = [
+    (master_doc, 'hpfrec.tex', 'hpfrec Documentation',
+     'David Cortes', 'manual'),
+]
+
+
+# -- Options for manual page output ---------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [
+    (master_doc, 'hpfrec', 'hpfrec Documentation',
+     [author], 1)
+]
+
+
+# -- Options for Texinfo output -------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+    (master_doc, 'hpfrec', 'hpfrec Documentation',
+     author, 'hpfrec', 'One line description of project.',
+     'Miscellaneous'),
+]
+
+
+
diff --git a/docs/hpfrec.rst b/docs/hpfrec.rst
@@ -0,0 +1,10 @@
+hpfrec package
+==============
+
+Module contents
+---------------
+
+.. automodule:: hpfrec
+    :members:
+    :undoc-members:
+    :show-inheritance: 
diff --git a/docs/index.rst b/docs/index.rst
@@ -0,0 +1,20 @@
+.. hpfrec documentation master file, created by
+   sphinx-quickstart on Sun May 27 22:58:46 2018.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Welcome to hpfrec's documentation!
+==================================
+
+.. toctree::
+   :maxdepth: 2
+   :caption: Contents:
+
+
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`