RuntimeWarning: divide by zero encountered in log #174

variablenerd opened this issue Sep 15, 2020 · 13 comments

When passing a GSDMM short text clustering model to pyLDAvis for visualisation, I sometimes get 'divide by zero' warnings even though the visualisation is created successfully. How can these be resolved? Is it because of a small corpus? I am usually building these models on around 100 documents containing 10-15 tokens each. Screenshot attached, would appreciate help on this!

I am using Python 3.7 on MacOS Catalina version 10.15.3.

Screenshot 2020-09-15 at 16 18 29 (2)

msusol commented Feb 24, 2021

Is this an issue given the latest release v3.2.2?

TimSchopf commented Mar 7, 2021

I got the same issue using Python 3.8.5, numpy 1.20.1 and pyLDAvis 3.2.2 on macOS BigSur 11.2.2. However I did not use GSDMM but visualized data with the "bring your own model" functionality.

msusol commented Mar 7, 2021

Do you have a Jupyter notebook available that will reproduce the error? I'd like to reproduce the error, then implement a fix showing the error is handled.

Yes this zip file contains an example notebook with the issue.

msusol commented Mar 7, 2021

So I am not seeing the error with your example. Python 3.9.2 on BigSur

Can you please provide me the exact error stack trace you are seeing?

I am pretty sure all we need to do is change instance of np.log(..) to np.log(... + episolon) where episilon=1e15

I always get the following three RuntimeWarning messages when calling the pyLDAvis.prepare function:

/opt/anaconda3/lib/python3.8/site-packages/pyLDAvis/ RuntimeWarning: divide by zero encountered in log log_1 = np.log(pd.eval("(topic_given_term.T / topic_proportion)"))

/opt/anaconda3/lib/python3.8/site-packages/pyLDAvis/ RuntimeWarning: divide by zero encountered in log log_lift = np.log(pd.eval("topic_term_dists / term_proportion")).astype("float64")

/opt/anaconda3/lib/python3.8/site-packages/pyLDAvis/ RuntimeWarning: divide by zero encountered in log log_ttd = np.log(topic_term_dists).astype("float64")

msusol added a commit that referenced this issue Mar 12, 2021
- removing unused variables
- providing for np.log(0) fix
- also fixing flake issue with 'l'
msusol added a commit that referenced this issue Mar 12, 2021
RuntimeWarning: divide by zero encountered in log #174

- providing for np.log(0) fix (removing pd.eval())
msusol commented Mar 12, 2021

Err..forgot to run the pytest before committing...TBC

msusol added a commit that referenced this issue Mar 14, 2021
msusol commented Mar 14, 2021

@TimSchopf Please give the recent commit a try

pip install -e git://

I tested the commit and got the following RuntimeWarning messages:

/Users/timschopf/src/pyldavis/pyLDAvis/ RuntimeWarning: divide by zero encountered in log log_1 = np.log(

/Users/timschopf/src/pyldavis/pyLDAvis/ RuntimeWarning: divide by zero encountered in log log_lift = np.log(

msusol commented Mar 14, 2021

Can you try upgrading to pandas==1.2.3. I can not reproduce the same runtime warning (for a "true" divide by zero), even with contrived data using only pandas. When pd.eval() is given division by zero, it returns 'np. Inf' currently... And my latest commit correctly handles this.

Note: RuntimeWarning: divide by zero encountered in log also comes from np.log(0)


:1: RuntimeWarning: divide by zero encountered in log

So my approach may need to change to address that scenario versus a true division by zero # / 0

msusol commented Mar 14, 2021

Try this code in a notebook by itself:

import numpy as np
import pandas as pd



Create some mock data:

epsilon = 1e-6

topic_proportion = pd.Series([0.526902, 0.473098]) = 'topic'

topic_given_term = pd.DataFrame([(0.491445, 0.553725, 0.573605),
                                 (0.508555, 0.446275, 0.426395)],
                                columns =[0, 1, 2]) 
topic_given_term.columns.names = ['term'] = 'topic'

topic_term_dists = pd.DataFrame([(0.098468, 0.145994, 0.150607),
                                 (0.101896, 0.117664, 0.111955)],
                                columns =[0, 1, 2]) 
topic_term_dists.columns.names = ['term'] = 'topic'

term_frequency = pd.Series([2.101885, 2.784418, 2.778742]) = 'term'

# marginal distribution over terms (width of blue bars)
term_proportion = term_frequency / term_frequency.sum()

# compute the distinctiveness and saliency of the terms:
# this determines the R terms that are displayed when no topic is selected
tt_sum = topic_term_dists.sum()
topic_given_term = pd.eval("topic_term_dists / tt_sum")

And now, should return

print(pd.eval("topic_given_term.T / topic_proportion").where(pd.eval("topic_given_term.T / topic_proportion") != np.inf, np.log(epsilon)))

topic 0 1
0 0.932708 1.074945
1 1.050907 0.943304
2 1.088638 0.901282

Now, let's create the condition for divide by zero

topic_proportion = pd.Series([1, 0])  # 100% , 0% probability = 'topic'

# marginal distribution over terms (width of blue bars)
term_proportion = term_frequency / term_frequency.sum()

# compute the distinctiveness and saliency of the terms:
# this determines the R terms that are displayed when no topic is selected
tt_sum = topic_term_dists.sum()
topic_given_term = pd.eval("topic_term_dists / tt_sum")

..and now run the division by zero scenario

print(pd.eval("topic_given_term.T / topic_proportion"))

topic 0 1
0 0.491446 inf
1 0.553725 inf
2 0.573605 inf

print(pd.eval("topic_given_term.T / topic_proportion").where(pd.eval("topic_given_term.T / topic_proportion") != np.inf, epsilon))

topic 0 1
0 0.491446 0.000001
1 0.553725 0.000001
2 0.573605 0.000001

where then ..

print(np.log(pd.eval("topic_given_term.T / topic_proportion").where(pd.eval("topic_given_term.T / topic_proportion") != np.inf, epsilon)))

topic 0 1
0 -0.710404 -13.815511
1 -0.591087 -13.815511
2 -0.555813 -13.815511

Can you try upgrading to pandas==1.2.3. I can not reproduce the same runtime warning, even with contrived data using only pandas. When pd.eval() is given division by zero, it returns 'np. Inf' currently... And my latest commit correctly handles this.

Note: RuntimeWarning: divide by zero encountered in log also comes from np.log(0)


:1: RuntimeWarning: divide by zero encountered in log

So my approach may need to change to address that scenario versus a true division by zero # / 0

After upgrading to pandas==1.2.3 and numpy==1.20.1 I got no more RuntimeWarning messages. In addition, my notebook output looks the same as yours using the proposed mock data.

msusol commented Mar 14, 2021

My solution will need to change then to address only np.log(0), but it looks like it will correctly handle inside pd.eval() (see no warning below, only for np.log(0) directly.

topic_proportion = pd.Series([0.526902, 0.473098]) = 'topic'

topic_given_term = pd.DataFrame([(0.491445, 0.553725, 0.573605),
                                 (0.508555, 0.446275, 0.426395)],
                                columns =[0, 1, 2]) 
topic_given_term.columns.names = ['term'] = 'topic'

topic_term_dists = pd.DataFrame([#(0.098468, 0.145994, 0.150607),
                                 (0.0, 0.0, 0.0),  # np.log(0)
                                 (0.101896, 0.117664, 0.111955)],
                                columns =[0, 1, 2]) 
topic_term_dists.columns.names = ['term'] = 'topic'

term_frequency = pd.Series([2.101885, 2.784418, 2.778742]) = 'term'

print(topic_proportion, '\n')
print(topic_given_term, '\n')
print(topic_given_term.T, '\n')
print(topic_term_dists, '\n')
print(term_frequency, '\n')

# marginal distribution over terms (width of blue bars)
term_proportion = term_frequency / term_frequency.sum()

# compute the distinctiveness and saliency of the terms:
# this determines the R terms that are displayed when no topic is selected
tt_sum = topic_term_dists.sum()
topic_given_term = pd.eval("topic_term_dists / tt_sum")

print(np.log(pd.eval("topic_given_term.T / topic_proportion")))


0 0.526902
1 0.473098
dtype: float64

term 0 1 2
0 0.491445 0.553725 0.573605
1 0.508555 0.446275 0.426395

topic 0 1
0 0.491445 0.508555
1 0.553725 0.446275
2 0.573605 0.426395

term 0 1 2
0 0.000000 0.000000 0.000000
1 0.101896 0.117664 0.111955

0 2.101885
1 2.784418
2 2.778742
dtype: float64

topic 0 1
0 -inf 0.748453
1 -inf 0.748453
2 -inf 0.748453

:36: RuntimeWarning: divide by zero encountered in log

msusol added a commit that referenced this issue Mar 14, 2021
np.log(0): throws RuntimeWarning: divide by zero encountered in log.

but np.log(pd.eval(..)) handles correctly in pandas==1.2.3 and numpy==1.20.1
@msusol msusol closed this as completed Mar 14, 2021
