New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FloatingPointError: divide by zero encountered in true_divide #52
Comments
It turns out by removing the empty strings, by running a dataframe.replace('',numpy.nan,inplace=True) in the fields, that it was able to run completely without issue. |
@jmcneal84 Thank you, this is a great catch! Thanks also for following up with the workaround, but this should definitely be handled by the system. I will look into this for the next revision. |
@jmcneal84 would you have a test case for this you could upload here? I'm having trouble reproducing this exact case. Thank you! |
I unfortunately can't upload an example of the data since it's proprietary. I do believe that it had to do with empty strings being in latitude and longitude fields. |
I was getting the code to work without issues by replacing the empty strings with na, but it hit the issue again. I'm still trying to trace down exactly what field is causing the issue since the features part runs fine, but it's in the pairwise comparison that the issue is thrown. Since it seems to be a correlation or covariance issue, it may be that the the values in the column are so close to the average, that it returns 0, and thus causes it to throw the error. Below is the exact error output. C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\function_base.py:2551: RuntimeWarning: Degrees of freedom <= 0 for slice File "", line 1, in File "C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\sweetviz\sv_public.py", line 13, in analyze File "C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\sweetviz\dataframe_report.py", line 243, in init File "C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\sweetviz\dataframe_report.py", line 423, in process_associations File "C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 2322, in corr File "C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py", line 71, in _f File "C:\Users<user>l\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py", line 1352, in nancorr File "C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py", line 1373, in func File "<array_function internals>", line 6, in corrcoef File "C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\function_base.py", line 2551, in corrcoef File "<array_function internals>", line 6, in cov File "C:\Users<user>l\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\function_base.py", line 2480, in cov |
I may have figured out the issue with the data. It seems to be coming from columns that are mostly 0's with an occasional 1, 2 or 3. Out of over 2 million rows of data, if you have mostly 0's you'll get a divide by 0 error. I'm doing a random 15% sample of my data in sweetviz, and so if it happens to pull all the records with 0 in a specified column, that may explain where the error is coming from. |
I was thinking more about this last night. Since the pairwise comparison is looking at the correlation of numbers, if a column in a data set by chance has all of the same values, the standard deviation used in the correlation would be zero, and thus cause a divide by zero/floating point issue. |
I haven't had time to try this, but a possible solution may be that on line 423 of the dataframe_report.py just needs to be changed from |
Along with the above code I added one more try and except block, and it appears to work. The runtime warning is still displayed, but at least it finishes the process.
|
Hello @jmcneal84! Thank you so much for your efforts in debugging this! I have literally been overwhelmed by work until recently and could not look at this again. I finally did, and I cannot seem to get that error. I tried setting columns to NAN, an empty string, "all 0's except a single 0.00001" or combinations of these, but it's still coming out fine (at least no crashes; trivial correlations with all fields the same come out as 0). The exception catching would work, of course, but I would really like to understand the issue better. I know it's been a while but I was wondering if you had any thoughts on this since. Thanks again, |
I actually haven't really looked at it either since I made the fix in my own code. The only thing I can remember was that the entire columns were the same values. It may have failed since there were multiple columns that were all zeros. |
Hello again @jmcneal84! With the latest version 1.1 I made error handling for the correlations more robust, but I don't think I was ever able to get your error case. I think you may have moved on from that data but if you ever get a chance to test with 1.1 I would be curious to see if my changes fixed some of this. Thanks again, |
Same problem with version 1.1.2
|
I got to the same problem and found the 2 columns that were the problem. I did not manage fix the problem. I am sending you an example, since I see you were not able to reproduce the error. There are a lot of empty values, which I think is the problem.
|
Hello @KlemenVrhovec, thank you so much for your report! Because of it I was able to quickly locate the issue; it is caused when only a SINGLE line contains non-NaN for 2 features in a correlation. I must have overlooked this case when I was testing (probably only testing when there was NO data, but didn't check for a single line). Anyway, this makes a lot of sense (pretty obvious actually in retrospect) and I have now added a warning message and am assigning a correlation coefficient to 1.0. It's a bit of an edge case but that feels like the best solution; see https://stats.stackexchange.com/questions/94150/why-is-the-pearson-correlation-1-when-only-two-data-values-are-available for where I gathered that strategy. This has been published in the new 2.0.8 build, please let me know that fixes it, if so I will close this issue. Thanks again! |
Closing, will reopen if it comes up again! |
I think you have the same (or similar) problem somewhere else as well:
|
I ran into a "FloatingPointError: divide by zero encountered in true_divide" in the pairwise feature portion of the code. Apparently there was a divide by zero issue in the cov part of the underlying code.
The trace of the error is as follows:
file: sv_public.py, line 13, in analyze, pairwise_analysis, feat_cfg)
file: dataframe_report.py, line 243, in init, self.process_associations(features_to_process, source_target_series, compare_target series
file: dataframe_report.py, line 423, in process_associations, feature.source.corr(other.source, method='pearson')
file: series.py line 2254, in corr, this.values, other.values, method=method, min_periods=min_periods
file: nanops.py, line 69, in _f, return f(*args,*kwargs)
file: nanops.py, line 1240, in nancorr, return f(a,b)
file: nanops.py, line 1256, in _pearson, return np.corrcoef(a,b)[0,1]
file: <array_function internals>, line 6, in corrcoef
file: function_base.py,line 2526 in corrcoef, c=cov(x,y,rowvar)
file: <array_function internals>, line 6, in cov
file: function_base.py, line 2455, in cov, c=np.true_divide(1,fact)
My dataframe had some empty strings where nulls should have been, but there were other columns that had similar features, but they never threw this error.
The text was updated successfully, but these errors were encountered: