Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeWarning: divide by zero encountered in log #174

Closed
variablenerd opened this issue Sep 15, 2020 · 13 comments
Closed

RuntimeWarning: divide by zero encountered in log #174

variablenerd opened this issue Sep 15, 2020 · 13 comments
Assignees

Comments

@variablenerd
Copy link

When passing a GSDMM short text clustering model to pyLDAvis for visualisation, I sometimes get 'divide by zero' warnings even though the visualisation is created successfully. How can these be resolved? Is it because of a small corpus? I am usually building these models on around 100 documents containing 10-15 tokens each. Screenshot attached, would appreciate help on this!

I am using Python 3.7 on MacOS Catalina version 10.15.3.

Screenshot 2020-09-15 at 16 18 29 (2)

@msusol
Copy link
Collaborator

msusol commented Feb 24, 2021

Is this an issue given the latest release v3.2.2?

@TimSchopf
Copy link

TimSchopf commented Mar 7, 2021

I got the same issue using Python 3.8.5, numpy 1.20.1 and pyLDAvis 3.2.2 on macOS BigSur 11.2.2. However I did not use GSDMM but visualized data with the "bring your own model" functionality.

@msusol
Copy link
Collaborator

msusol commented Mar 7, 2021

Do you have a Jupyter notebook available that will reproduce the error? I'd like to reproduce the error, then implement a fix showing the error is handled.

@TimSchopf
Copy link

Do you have a Jupyter notebook available that will reproduce the error? I'd like to reproduce the error, then implement a fix showing the error is handled.

Yes this zip file contains an example notebook with the issue.

@msusol
Copy link
Collaborator

msusol commented Mar 7, 2021

So I am not seeing the error with your example. Python 3.9.2 on BigSur

Can you please provide me the exact error stack trace you are seeing?

I am pretty sure all we need to do is change instance of np.log(..) to np.log(... + episolon) where episilon=1e15

https://github.com/RaRe-Technologies/gensim/blob/2dcaaf80f4fb8023acc2f118b0966d92fca9500e/gensim/topic_coherence/direct_confirmation_measure.py

['appnope==0.1.2', 'argon2-cffi==20.1.0', 'async-generator==1.10', 'attrs==20.3.0', 'backcall==0.2.0', 'bleach==3.3.0', 'blessings==1.7', 'cffi==1.14.4', 'click==7.1.2', 'decorator==4.4.2', 'defusedxml==0.6.0', 'entrypoints==0.3', 'funcy==1.15', 'future==0.18.2', 'gensim==3.8.3', 'ipykernel==5.4.3', 'ipython-genutils==0.2.0', 'ipython==7.20.0', 'ipywidgets==7.6.3', 'jedi==0.18.0', 'jinja2==2.11.3', 'joblib==1.0.0', 'jsonschema==3.2.0', 'jupyter-client==6.1.11', 'jupyter-console==6.2.0', 'jupyter-core==4.7.1', 'jupyter-http-over-ws==0.0.8', 'jupyter-tensorboard==0.2.0', 'jupyter==1.0.0', 'jupyterlab-pygments==0.1.2', 'jupyterlab-widgets==1.0.0', 'markupsafe==1.1.1', 'mistune==0.8.4', 'nbclient==0.5.1', 'nbconvert==6.0.7', 'nbformat==5.1.2', 'nest-asyncio==1.5.1', 'nltk==3.5', 'notebook==6.2.0', 'numexpr==2.7.2', 'numpy==1.20.0', 'packaging==20.9', 'pandas==1.2.1', 'pandocfilters==1.4.3', 'parso==0.8.1', 'pexpect==4.8.0', 'pickleshare==0.7.5', 'pip==21.0.1', 'prometheus-client==0.9.0', 'prompt-toolkit==3.0.14', 'ptyprocess==0.7.0', 'pycparser==2.20', 'pygments==2.7.4', 'pyldavis==3.2.2', 'pyparsing==2.4.7', 'pyrsistent==0.17.3', 'python-dateutil==2.8.1', 'pytz==2021.1', 'pyzmq==22.0.2', 'qtconsole==5.0.2', 'qtpy==1.9.0', 'regex==2020.11.13', 'scikit-learn==0.24.1', 'scipy==1.6.0', 'send2trash==1.5.0', 'setuptools==52.0.0', 'six==1.15.0', 'sklearn==0.0', 'smart-open==4.1.2', 'terminado==0.9.2', 'testpath==0.4.4', 'threadpoolctl==2.1.0', 'tornado==6.1', 'tqdm==4.56.0', 'traitlets==5.0.5', 'wcwidth==0.2.5', 'webencodings==0.5.1', 'wheel==0.36.2', 'widgetsnbextension==3.5.1']

@TimSchopf
Copy link

I always get the following three RuntimeWarning messages when calling the pyLDAvis.prepare function:

/opt/anaconda3/lib/python3.8/site-packages/pyLDAvis/_prepare.py:236: RuntimeWarning: divide by zero encountered in log log_1 = np.log(pd.eval("(topic_given_term.T / topic_proportion)"))

/opt/anaconda3/lib/python3.8/site-packages/pyLDAvis/_prepare.py:259: RuntimeWarning: divide by zero encountered in log log_lift = np.log(pd.eval("topic_term_dists / term_proportion")).astype("float64")

/opt/anaconda3/lib/python3.8/site-packages/pyLDAvis/_prepare.py:260: RuntimeWarning: divide by zero encountered in log log_ttd = np.log(topic_term_dists).astype("float64")

Installed packages:

['aiohttp-cors==0.7.0', 'aiohttp==3.7.3', 'aioredis==1.3.1', 'alabaster==0.7.12', 'anaconda-client==1.7.2', 'anaconda-navigator==1.10.0', 'anaconda-project==0.8.3', 'applaunchservices==0.2.1', 'appnope==0.1.0', 'appscript==1.1.1', 'argh==0.26.2', 'argon2-cffi==20.1.0', 'asn1crypto==1.4.0', 'astroid==2.4.2', 'astropy==4.0.2', 'async-generator==1.10', 'async-timeout==3.0.1', 'atomicwrites==1.4.0', 'attrs==20.3.0', 'autoflake==1.4', 'autopep8==1.5.5', 'babel==2.8.1', 'backcall==0.2.0', 'backports.functools-lru-cache==1.6.1', 'backports.shutil-get-terminal-size==1.0.0', 'backports.tempfile==1.0', 'backports.weakref==1.0.post1', 'beautifulsoup4==4.9.3', 'bitarray==1.6.1', 'bkcharts==0.2', 'bleach==3.2.1', 'blessings==1.7', 'blis==0.7.4', 'bokeh==2.2.3', 'boto==2.49.0', 'bottleneck==1.3.2', 'brotlipy==0.7.0', 'cachetools==4.1.1', 'catalogue==1.0.0', 'certifi==2020.6.20', 'cffi==1.14.3', 'chardet==3.0.4', 'click==7.1.2', 'cloudpickle==1.6.0', 'clyent==1.2.2', 'colorama==0.4.4', 'colorful==0.5.4', 'conda-build==3.20.5', 'conda-package-handling==1.7.2', 'conda-verify==3.4.2', 'conda==4.9.2', 'contextlib2==0.6.0.post1', 'cryptography==3.1.1', 'cycler==0.10.0', 'cymem==2.0.5', 'cython==0.29.21', 'cytoolz==0.11.0', 'dask==2.30.0', 'decorator==4.4.2', 'defusedxml==0.6.0', 'diff-match-patch==20200713', 'dill==0.3.3', 'distributed==2.30.1', 'docutils==0.16', 'entrypoints==0.3', 'et-xmlfile==1.0.1', 'fastcache==1.1.0', 'filelock==3.0.12', 'flake8==3.8.4', 'flask==1.1.2', 'fsspec==0.8.3', 'funcy==1.15', 'future==0.18.2', 'gensim==4.0.0b0', 'gevent==20.9.0', 'glob2==0.7', 'globre==0.1.5', 'gmpy2==2.0.8', 'google-api-core==1.23.0', 'google-auth==1.23.0', 'google==3.0.0', 'googleapis-common-protos==1.52.0', 'gpustat==0.6.0', 'greenlet==0.4.17', 'grpcio==1.33.2', 'h5py==2.10.0', 'hdbscan==0.8.27', 'heapdict==1.0.1', 'hiredis==1.1.0', 'html5lib==1.1', 'idna==2.10', 'imageio==2.9.0', 'imagesize==1.2.0', 'importlib-metadata==2.0.0', 'iniconfig==1.1.1', 'intervaltree==3.1.0', 'ipykernel==5.3.4', 'ipython-genutils==0.2.0', 'ipython==7.21.0', 'ipywidgets==7.5.1', 'isort==5.6.4', 'itsdangerous==1.1.0', 'jdcal==1.4.1', 'jedi==0.17.1', 'jinja2==2.11.2', 'joblib==1.0.1', 'json5==0.9.5', 'jsonschema==3.2.0', 'jupyter-client==6.1.7', 'jupyter-console==6.2.0', 'jupyter-core==4.6.3', 'jupyter==1.0.0', 'jupyterlab-pygments==0.1.2', 'jupyterlab-server==1.2.0', 'jupyterlab==2.2.6', 'keyring==21.4.0', 'kiwisolver==1.3.0', 'lazy-object-proxy==1.4.3', 'libarchive-c==2.9', 'llvmlite==0.34.0', 'locket==0.2.0', 'lxml==4.6.1', 'mapply==0.1.4', 'markupsafe==1.1.1', 'matplotlib==3.3.2', 'mccabe==0.6.1', 'mistune==0.8.4', 'mkl-fft==1.2.0', 'mkl-random==1.1.1', 'mkl-service==2.3.0', 'mock==4.0.2', 'modin==0.8.2', 'more-itertools==8.6.0', 'mpmath==1.1.0', 'msgpack==1.0.0', 'multidict==5.0.2', 'multipledispatch==0.6.0', 'multiprocess==0.70.11.1', 'murmurhash==1.0.5', 'navigator-updater==0.2.1', 'nbclient==0.5.1', 'nbconvert==6.0.7', 'nbformat==5.0.8', 'nest-asyncio==1.4.2', 'networkx==2.5', 'nltk==3.5', 'nose==1.3.7', 'notebook==6.1.4', 'numba==0.51.2', 'numexpr==2.7.1', 'numpy==1.20.1', 'numpydoc==1.1.0', 'nvidia-ml-py3==7.352.0', 'olefile==0.46', 'opencensus-context==0.1.2', 'opencensus==0.7.11', 'openpyxl==3.0.5', 'packaging==20.4', 'panda==0.3.1', 'pandarallel==1.5.1', 'pandas==1.1.4', 'pandocfilters==1.4.3', 'parso==0.7.0', 'partd==1.1.0', 'path==15.0.0', 'pathlib2==2.3.5', 'pathos==0.2.7', 'pathtools==0.1.2', 'patsy==0.5.1', 'pep8==1.7.1', 'pexpect==4.8.0', 'pickleshare==0.7.5', 'pillow==8.0.1', 'pip==21.0.1', 'pkginfo==1.6.1', 'plac==1.1.3', 'pluggy==0.13.1', 'ply==3.11', 'pox==0.2.9', 'ppft==1.6.6.3', 'preshed==3.0.5', 'prometheus-client==0.8.0', 'prompt-toolkit==3.0.8', 'protobuf==3.14.0', 'psutil==5.7.2', 'ptyprocess==0.6.0', 'py-spy==0.3.3', 'py==1.9.0', 'pyarrow==1.0.0', 'pyasn1-modules==0.2.8', 'pyasn1==0.4.8', 'pycodestyle==2.6.0', 'pycosat==0.6.3', 'pycparser==2.20', 'pycurl==7.43.0.6', 'pydocstyle==5.1.1', 'pyflakes==2.2.0', 'pygments==2.7.2', 'pyldavis==3.2.2', 'pylint==2.6.0', 'pynndescent==0.5.2', 'pyodbc==4.0.0-unsupported', 'pyopenssl==19.1.0', 'pyparsing==2.4.7', 'pyqt5-qt==5.15.2', 'pyqt5-sip==12.8.1', 'pyqt5==5.15.3', 'pyqtwebengine-qt==5.15.2', 'pyqtwebengine==5.15.3', 'pyrsistent==0.17.3', 'pysocks==1.7.1', 'pytest==0.0.0', 'python-dateutil==2.8.1', 'python-jsonrpc-server==0.4.0', 'python-language-server==0.35.1', 'pytz==2020.1', 'pywavelets==1.1.1', 'pyyaml==5.3.1', 'pyzmq==19.0.2', 'qdarkstyle==2.8.1', 'qtawesome==1.0.1', 'qtconsole==4.7.7', 'qtpy==1.9.0', 'ray==1.0.1.post1', 'redis==3.4.1', 'regex==2020.10.15', 'requests==2.24.0', 'rope==0.18.0', 'rsa==4.6', 'rtree==0.9.4', 'ruamel-yaml==0.15.87', 'scikit-image==0.17.2', 'scikit-learn==0.24.1', 'scipy==1.5.2', 'seaborn==0.11.0', 'send2trash==1.5.0', 'setuptools==50.3.1.post20201107', 'simplegeneric==0.8.1', 'singledispatch==3.4.0.3', 'six==1.15.0', 'smart-open==4.0.1', 'snowballstemmer==2.0.0', 'sortedcollections==1.2.1', 'sortedcontainers==2.2.2', 'soupsieve==2.0.1', 'spacy==2.3.5', 'sphinx==3.2.1', 'sphinxcontrib-applehelp==1.0.2', 'sphinxcontrib-devhelp==1.0.2', 'sphinxcontrib-htmlhelp==1.0.3', 'sphinxcontrib-jsmath==1.0.1', 'sphinxcontrib-qthelp==1.0.3', 'sphinxcontrib-serializinghtml==1.1.4', 'sphinxcontrib-websupport==1.2.4', 'spyder-kernels==1.9.4', 'spyder==4.1.5', 'sqlalchemy==1.3.20', 'srsly==1.0.5', 'statsmodels==0.12.0', 'swifter==1.0.7', 'sympy==1.6.2', 'tables==3.6.1', 'tblib==1.7.0', 'terminado==0.9.1', 'testpath==0.4.4', 'thinc==7.4.5', 'threadpoolctl==2.1.0', 'tifffile==2020.10.1', 'tmtoolkit==0.10.0', 'toml==0.10.1', 'toolz==0.11.1', 'tornado==6.0.4', 'tqdm==4.50.2', 'traitlets==5.0.5', 'typing-extensions==3.7.4.3', 'ujson==4.0.1', 'umap-learn==0.5.1', 'unicodecsv==0.14.1', 'urllib3==1.25.11', 'wasabi==0.8.2', 'watchdog==0.10.3', 'wcwidth==0.2.5', 'webencodings==0.5.1', 'werkzeug==1.0.1', 'wheel==0.35.1', 'widgetsnbextension==3.5.1', 'wrapt==1.11.2', 'wurlitzer==2.0.1', 'xlrd==1.2.0', 'xlsxwriter==1.3.7', 'xlwings==0.20.8', 'xlwt==1.3.0', 'xmltodict==0.12.0', 'yapf==0.30.0', 'yarl==1.6.3', 'zict==2.0.0', 'zipp==3.4.0', 'zope.event==4.5.0', 'zope.interface==5.1.2']

@msusol msusol self-assigned this Mar 9, 2021
msusol added a commit that referenced this issue Mar 12, 2021
- removing unused variables
- providing for np.log(0) fix
- also fixing flake issue with 'l'
msusol added a commit that referenced this issue Mar 12, 2021
RuntimeWarning: divide by zero encountered in log #174

- providing for np.log(0) fix (removing pd.eval())
@msusol
Copy link
Collaborator

msusol commented Mar 12, 2021

Err..forgot to run the pytest before committing...TBC

msusol added a commit that referenced this issue Mar 14, 2021
@msusol
Copy link
Collaborator

msusol commented Mar 14, 2021

@TimSchopf Please give the recent commit a try

pip install -e git://github.com/bmabey/pyLDAvis.git@b1bd4f9#egg=pyLDAvis

@TimSchopf
Copy link

@TimSchopf Please give the recent commit a try

pip install -e git://github.com/bmabey/pyLDAvis.git@b1bd4f9#egg=pyLDAvis

I tested the commit and got the following RuntimeWarning messages:

/Users/timschopf/src/pyldavis/pyLDAvis/_prepare.py:238: RuntimeWarning: divide by zero encountered in log log_1 = np.log(

/Users/timschopf/src/pyldavis/pyLDAvis/_prepare.py:263: RuntimeWarning: divide by zero encountered in log log_lift = np.log(

@msusol
Copy link
Collaborator

msusol commented Mar 14, 2021

Can you try upgrading to pandas==1.2.3. I can not reproduce the same runtime warning (for a "true" divide by zero), even with contrived data using only pandas. When pd.eval() is given division by zero, it returns 'np. Inf' currently... And my latest commit correctly handles this.

Note: RuntimeWarning: divide by zero encountered in log also comes from np.log(0)

np.log(0)

:1: RuntimeWarning: divide by zero encountered in log
np.log(0)
-inf

So my approach may need to change to address that scenario versus a true division by zero # / 0

@msusol
Copy link
Collaborator

msusol commented Mar 14, 2021

Try this code in a notebook by itself:

import numpy as np
import pandas as pd

pd.__version__
np.__version__

'1.2.3'
'1.20.1'

Create some mock data:

epsilon = 1e-6

topic_proportion = pd.Series([0.526902, 0.473098])
topic_proportion.index.name = 'topic'

topic_given_term = pd.DataFrame([(0.491445, 0.553725, 0.573605),
                                 (0.508555, 0.446275, 0.426395)],
                                columns =[0, 1, 2]) 
topic_given_term.columns.names = ['term']
topic_given_term.index.name = 'topic'

topic_term_dists = pd.DataFrame([(0.098468, 0.145994, 0.150607),
                                 (0.101896, 0.117664, 0.111955)],
                                columns =[0, 1, 2]) 
topic_term_dists.columns.names = ['term']
topic_term_dists.index.name = 'topic'

term_frequency = pd.Series([2.101885, 2.784418, 2.778742])
term_frequency.index.name = 'term'

# marginal distribution over terms (width of blue bars)
term_proportion = term_frequency / term_frequency.sum()

# compute the distinctiveness and saliency of the terms:
# this determines the R terms that are displayed when no topic is selected
tt_sum = topic_term_dists.sum()
topic_given_term = pd.eval("topic_term_dists / tt_sum")

And now, should return

print(pd.eval("topic_given_term.T / topic_proportion").where(pd.eval("topic_given_term.T / topic_proportion") != np.inf, np.log(epsilon)))

topic 0 1
term
0 0.932708 1.074945
1 1.050907 0.943304
2 1.088638 0.901282

Now, let's create the condition for divide by zero

topic_proportion = pd.Series([1, 0])  # 100% , 0% probability
topic_proportion.index.name = 'topic'

# marginal distribution over terms (width of blue bars)
term_proportion = term_frequency / term_frequency.sum()

# compute the distinctiveness and saliency of the terms:
# this determines the R terms that are displayed when no topic is selected
tt_sum = topic_term_dists.sum()
topic_given_term = pd.eval("topic_term_dists / tt_sum")

..and now run the division by zero scenario

print(pd.eval("topic_given_term.T / topic_proportion"))

topic 0 1
term
0 0.491446 inf
1 0.553725 inf
2 0.573605 inf

print(pd.eval("topic_given_term.T / topic_proportion").where(pd.eval("topic_given_term.T / topic_proportion") != np.inf, epsilon))

topic 0 1
term
0 0.491446 0.000001
1 0.553725 0.000001
2 0.573605 0.000001

where then ..

print(np.log(pd.eval("topic_given_term.T / topic_proportion").where(pd.eval("topic_given_term.T / topic_proportion") != np.inf, epsilon)))

topic 0 1
term
0 -0.710404 -13.815511
1 -0.591087 -13.815511
2 -0.555813 -13.815511

@TimSchopf
Copy link

Can you try upgrading to pandas==1.2.3. I can not reproduce the same runtime warning, even with contrived data using only pandas. When pd.eval() is given division by zero, it returns 'np. Inf' currently... And my latest commit correctly handles this.

Note: RuntimeWarning: divide by zero encountered in log also comes from np.log(0)

np.log(0)

:1: RuntimeWarning: divide by zero encountered in log
np.log(0)
-inf

So my approach may need to change to address that scenario versus a true division by zero # / 0

After upgrading to pandas==1.2.3 and numpy==1.20.1 I got no more RuntimeWarning messages. In addition, my notebook output looks the same as yours using the proposed mock data.

@msusol
Copy link
Collaborator

msusol commented Mar 14, 2021

My solution will need to change then to address only np.log(0), but it looks like it will correctly handle inside pd.eval() (see no warning below, only for np.log(0) directly.

topic_proportion = pd.Series([0.526902, 0.473098])
topic_proportion.index.name = 'topic'

topic_given_term = pd.DataFrame([(0.491445, 0.553725, 0.573605),
                                 (0.508555, 0.446275, 0.426395)],
                                columns =[0, 1, 2]) 
topic_given_term.columns.names = ['term']
topic_given_term.index.name = 'topic'

topic_term_dists = pd.DataFrame([#(0.098468, 0.145994, 0.150607),
                                 (0.0, 0.0, 0.0),  # np.log(0)
                                 (0.101896, 0.117664, 0.111955)],
                                columns =[0, 1, 2]) 
topic_term_dists.columns.names = ['term']
topic_term_dists.index.name = 'topic'

term_frequency = pd.Series([2.101885, 2.784418, 2.778742])
term_frequency.index.name = 'term'

print(topic_proportion, '\n')
print(topic_given_term, '\n')
print(topic_given_term.T, '\n')
print(topic_term_dists, '\n')
print(term_frequency, '\n')

# marginal distribution over terms (width of blue bars)
term_proportion = term_frequency / term_frequency.sum()

# compute the distinctiveness and saliency of the terms:
# this determines the R terms that are displayed when no topic is selected
tt_sum = topic_term_dists.sum()
topic_given_term = pd.eval("topic_term_dists / tt_sum")

print(np.log(pd.eval("topic_given_term.T / topic_proportion")))

np.log(0)

topic
0 0.526902
1 0.473098
dtype: float64

term 0 1 2
topic
0 0.491445 0.553725 0.573605
1 0.508555 0.446275 0.426395

topic 0 1
term
0 0.491445 0.508555
1 0.553725 0.446275
2 0.573605 0.426395

term 0 1 2
topic
0 0.000000 0.000000 0.000000
1 0.101896 0.117664 0.111955

term
0 2.101885
1 2.784418
2 2.778742
dtype: float64

topic 0 1
term
0 -inf 0.748453
1 -inf 0.748453
2 -inf 0.748453

:36: RuntimeWarning: divide by zero encountered in log
np.log(0)
-inf

msusol added a commit that referenced this issue Mar 14, 2021
np.log(0): throws RuntimeWarning: divide by zero encountered in log.

but np.log(pd.eval(..)) handles correctly in pandas==1.2.3 and numpy==1.20.1
@msusol msusol closed this as completed Mar 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants