Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FloatingPointError: divide by zero encountered in true_divide #52

Closed
jmcneal84 opened this issue Aug 3, 2020 · 17 comments
Closed

FloatingPointError: divide by zero encountered in true_divide #52

jmcneal84 opened this issue Aug 3, 2020 · 17 comments
Labels
bug Something isn't working

Comments

@jmcneal84
Copy link

I ran into a "FloatingPointError: divide by zero encountered in true_divide" in the pairwise feature portion of the code. Apparently there was a divide by zero issue in the cov part of the underlying code.

The trace of the error is as follows:
file: sv_public.py, line 13, in analyze, pairwise_analysis, feat_cfg)
file: dataframe_report.py, line 243, in init, self.process_associations(features_to_process, source_target_series, compare_target series
file: dataframe_report.py, line 423, in process_associations, feature.source.corr(other.source, method='pearson')
file: series.py line 2254, in corr, this.values, other.values, method=method, min_periods=min_periods
file: nanops.py, line 69, in _f, return f(*args,*kwargs)
file: nanops.py, line 1240, in nancorr, return f(a,b)
file: nanops.py, line 1256, in _pearson, return np.corrcoef(a,b)[0,1]
file: <array_function internals>, line 6, in corrcoef
file: function_base.py,line 2526 in corrcoef, c=cov(x,y,rowvar)
file: <array_function internals>, line 6, in cov
file: function_base.py, line 2455, in cov, c
=np.true_divide(1,fact)

My dataframe had some empty strings where nulls should have been, but there were other columns that had similar features, but they never threw this error.

@jmcneal84
Copy link
Author

It turns out by removing the empty strings, by running a dataframe.replace('',numpy.nan,inplace=True) in the fields, that it was able to run completely without issue.

@fbdesignpro
Copy link
Owner

@jmcneal84 Thank you, this is a great catch! Thanks also for following up with the workaround, but this should definitely be handled by the system. I will look into this for the next revision.

@fbdesignpro
Copy link
Owner

@jmcneal84 would you have a test case for this you could upload here? I'm having trouble reproducing this exact case. Thank you!

@jmcneal84
Copy link
Author

I unfortunately can't upload an example of the data since it's proprietary. I do believe that it had to do with empty strings being in latitude and longitude fields.

@fbdesignpro fbdesignpro added the bug Something isn't working label Aug 12, 2020
@jmcneal84
Copy link
Author

I was getting the code to work without issues by replacing the empty strings with na, but it hit the issue again. I'm still trying to trace down exactly what field is causing the issue since the features part runs fine, but it's in the pairwise comparison that the issue is thrown. Since it seems to be a correlation or covariance issue, it may be that the the values in the column are so close to the average, that it returns 0, and thus causes it to throw the error. Below is the exact error output.

C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\function_base.py:2551: RuntimeWarning: Degrees of freedom <= 0 for slice
c = cov(x, y, rowvar)
Traceback (most recent call last):

File "", line 1, in
my_report = sv.analyze(df)

File "C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\sweetviz\sv_public.py", line 13, in analyze
pairwise_analysis, feat_cfg)

File "C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\sweetviz\dataframe_report.py", line 243, in init
self.process_associations(features_to_process, source_target_series, compare_target_series)

File "C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\sweetviz\dataframe_report.py", line 423, in process_associations
feature.source.corr(other.source, method='pearson')

File "C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\series.py", line 2322, in corr
this.values, other.values, method=method, min_periods=min_periods

File "C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py", line 71, in _f
return f(*args, **kwargs)

File "C:\Users<user>l\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py", line 1352, in nancorr
return f(a, b)

File "C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\nanops.py", line 1373, in func
return np.corrcoef(a, b)[0, 1]

File "<array_function internals>", line 6, in corrcoef

File "C:\Users<user>\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\function_base.py", line 2551, in corrcoef
c = cov(x, y, rowvar)

File "<array_function internals>", line 6, in cov

File "C:\Users<user>l\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\function_base.py", line 2480, in cov
c *= np.true_divide(1, fact)

@jmcneal84
Copy link
Author

I may have figured out the issue with the data. It seems to be coming from columns that are mostly 0's with an occasional 1, 2 or 3. Out of over 2 million rows of data, if you have mostly 0's you'll get a divide by 0 error. I'm doing a random 15% sample of my data in sweetviz, and so if it happens to pull all the records with 0 in a specified column, that may explain where the error is coming from.

@jmcneal84
Copy link
Author

I was thinking more about this last night. Since the pairwise comparison is looking at the correlation of numbers, if a column in a data set by chance has all of the same values, the standard deviation used in the correlation would be zero, and thus cause a divide by zero/floating point issue.

@jmcneal84
Copy link
Author

I haven't had time to try this, but a possible solution may be that on line 423 of the dataframe_report.py just needs to be changed from
cur_associations[other.source.name] =
feature.source.corr(other.source, method='pearson')
to
try:
cur_associations[other.source.name] =
feature.source.corr(other.source, method='pearson')
except:
cur_associations[other.source.name] = 0.0

@jmcneal84
Copy link
Author

Along with the above code I added one more try and except block, and it appears to work. The runtime warning is still displayed, but at least it finishes the process.
This is the code I changed in the datframe_report.py file starting at line 417:

elif self[feature_name]["type"] == FeatureType.TYPE_NUM: # NUM source # ------------------------------------ if self[other.source.name]["type"] == FeatureType.TYPE_NUM: # NUM-NUM try: cur_associations[other.source.name] = \ feature.source.corr(other.source, method='pearson') except: cur_associations[other.source.name] = 0.0 # TODO: display correlation error better in graph! if isnan(cur_associations[other.source.name]): cur_associations[other.source.name] = 0.0 mirror_association(self._associations, feature_name, other.source.name, \ cur_associations[other.source.name]) if process_compare: try: cur_associations[other.source.name] = \ feature.source.corr(other.source, method='pearson') except: cur_associations[other.source.name] = 0.0 # TODO: display correlation error better in graph! if isnan(cur_associations_compare[other.source.name]): cur_associations_compare[other.source.name] = 0.0 mirror_association(self._associations_compare, feature_name, other.source.name, \ cur_associations_compare[other.source.name])

@fbdesignpro
Copy link
Owner

Hello @jmcneal84! Thank you so much for your efforts in debugging this! I have literally been overwhelmed by work until recently and could not look at this again. I finally did, and I cannot seem to get that error. I tried setting columns to NAN, an empty string, "all 0's except a single 0.00001" or combinations of these, but it's still coming out fine (at least no crashes; trivial correlations with all fields the same come out as 0).

The exception catching would work, of course, but I would really like to understand the issue better. I know it's been a while but I was wondering if you had any thoughts on this since.

Thanks again,
Francois

@jmcneal84
Copy link
Author

I actually haven't really looked at it either since I made the fix in my own code. The only thing I can remember was that the entire columns were the same values. It may have failed since there were multiple columns that were all zeros.
The error code I kept getting had to deal with the pearson correlation dividing by zero at some point in the pairwise comparison of two columns.

@fbdesignpro
Copy link
Owner

Hello again @jmcneal84! With the latest version 1.1 I made error handling for the correlations more robust, but I don't think I was ever able to get your error case. I think you may have moved on from that data but if you ever get a chance to test with 1.1 I would be curious to see if my changes fixed some of this.

Thanks again,
Francois

@luchaoqi
Copy link

Same problem with version 1.1.2

~/anaconda3/envs/python3/lib/python3.6/site-packages/sweetviz/sv_public.py in compare_intra(source_df, condition_series, names, target_feat, feat_cfg, pairwise_analysis)
     42     report = sweetviz.DataframeReport([data_true, names[0]], target_feat,
     43                                       [data_false, names[1]],
---> 44                                       pairwise_analysis, feat_cfg)
     45     return report
     46 

~/anaconda3/envs/python3/lib/python3.6/site-packages/sweetviz/dataframe_report.py in __init__(self, source, target_feature_name, compare, pairwise_analysis, fc)
    245             self.progress_bar.reset(total=len(features_to_process))
    246             self.progress_bar.set_description_str("[Step 2/3] Processing Pairwise Features")
--> 247             self.process_associations(features_to_process, source_target_series, compare_target_series)
    248 
    249             self.progress_bar.reset(total=1)

~/anaconda3/envs/python3/lib/python3.6/site-packages/sweetviz/dataframe_report.py in process_associations(self, features_to_process, source_target_series, compare_target_series)
    445                         if process_compare:
    446                             cur_associations_compare[other.source.name] = \
--> 447                                 feature.compare.corr(other.compare, method='pearson')
    448                             # TODO: display correlation error better in graph!
    449                             if isnan(cur_associations_compare[other.source.name]):

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/series.py in corr(self, other, method, min_periods)
   2320         if method in ["pearson", "spearman", "kendall"] or callable(method):
   2321             return nanops.nancorr(
-> 2322                 this.values, other.values, method=method, min_periods=min_periods
   2323             )
   2324 

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/nanops.py in _f(*args, **kwargs)
     69             try:
     70                 with np.errstate(invalid="ignore"):
---> 71                     return f(*args, **kwargs)
     72             except ValueError as e:
     73                 # we want to transform an object array

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/nanops.py in nancorr(a, b, method, min_periods)
   1350 
   1351     f = get_corr_func(method)
-> 1352     return f(a, b)
   1353 
   1354 

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/nanops.py in func(a, b)
   1371 
   1372         def func(a, b):
-> 1373             return np.corrcoef(a, b)[0, 1]
   1374 
   1375         return func

<__array_function__ internals> in corrcoef(*args, **kwargs)

~/anaconda3/envs/python3/lib/python3.6/site-packages/numpy/lib/function_base.py in corrcoef(x, y, rowvar, bias, ddof)
   2549         warnings.warn('bias and ddof have no effect and are deprecated',
   2550                       DeprecationWarning, stacklevel=3)
-> 2551     c = cov(x, y, rowvar)
   2552     try:
   2553         d = diag(c)

<__array_function__ internals> in cov(*args, **kwargs)

~/anaconda3/envs/python3/lib/python3.6/site-packages/numpy/lib/function_base.py in cov(m, y, rowvar, bias, ddof, fweights, aweights)
   2478         X_T = (X*w).T
   2479     c = dot(X, X_T.conj())
-> 2480     c *= np.true_divide(1, fact)
   2481     return c.squeeze()
   2482 

FloatingPointError: divide by zero encountered in true_divide

@KlemenVrhovec
Copy link

I got to the same problem and found the 2 columns that were the problem. I did not manage fix the problem. I am sending you an example, since I see you were not able to reproduce the error. There are a lot of empty values, which I think is the problem.

example

df=pd.read_csv('data.csv')
my_report = sv.analyze(source=[df, 'Report'], pairwise_analysis='on')

@fbdesignpro
Copy link
Owner

Hello @KlemenVrhovec, thank you so much for your report! Because of it I was able to quickly locate the issue; it is caused when only a SINGLE line contains non-NaN for 2 features in a correlation. I must have overlooked this case when I was testing (probably only testing when there was NO data, but didn't check for a single line).

Anyway, this makes a lot of sense (pretty obvious actually in retrospect) and I have now added a warning message and am assigning a correlation coefficient to 1.0. It's a bit of an edge case but that feels like the best solution; see https://stats.stackexchange.com/questions/94150/why-is-the-pearson-correlation-1-when-only-two-data-values-are-available for where I gathered that strategy.

This has been published in the new 2.0.8 build, please let me know that fixes it, if so I will close this issue.

Thanks again!

@fbdesignpro
Copy link
Owner

Closing, will reopen if it comes up again!

@ds-noahdolev
Copy link

I think you have the same (or similar) problem somewhere else as well:

----> 1 my_report = sweetviz.compare([train, "Train"], [test, "Test"], "target")

/opt/Anaconda3/envs/basic_ml/lib/python3.8/site-packages/sweetviz/sv_public.py in compare(source, compare, target_feat, feat_cfg, pairwise_analysis)
     20             feat_cfg: FeatureConfig = None,
     21             pairwise_analysis: str = 'auto'):
---> 22     report = sweetviz.DataframeReport(source, target_feat, compare,
     23                                       pairwise_analysis, feat_cfg)
     24     return report

/opt/Anaconda3/envs/basic_ml/lib/python3.8/site-packages/sweetviz/dataframe_report.py in __init__(self, source, target_feature_name, compare, pairwise_analysis, fc)
    278             self.progress_bar.reset(total=len(features_to_process))
    279             self.progress_bar.set_description_str("[Step 2/3] Processing Pairwise Features")
--> 280             self.process_associations(features_to_process, source_target_series, compare_target_series)
    281 
    282             self.progress_bar.reset(total=1)

/opt/Anaconda3/envs/basic_ml/lib/python3.8/site-packages/sweetviz/dataframe_report.py in process_associations(self, features_to_process, source_target_series, compare_target_series)
    450                         if process_compare:
    451                             cur_associations_compare[other.source.name] = \
--> 452                                 associations.correlation_ratio(feature.compare, other.compare)
    453                             mirror_association(self._associations_compare, feature_name, other.source.name, \
    454                                                cur_associations_compare[other.source.name])

/opt/Anaconda3/envs/basic_ml/lib/python3.8/site-packages/sweetviz/from_dython.py in correlation_ratio(categories, measurements, nan_strategy, nan_replace_value)
    244         eta = 0.0
    245     else:
--> 246         eta = np.sqrt(numerator / denominator)
    247     return eta

FloatingPointError: divide by zero encountered in double_scalars

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants