Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skewed Data Distributions and Homoscedasticity #81

Open
vdemchenko3 opened this issue Mar 21, 2023 · 9 comments
Open

Skewed Data Distributions and Homoscedasticity #81

vdemchenko3 opened this issue Mar 21, 2023 · 9 comments

Comments

@vdemchenko3
Copy link

Hi,

I'm wondering what's the best approach for data that is highly right-skewed. Is it best to take a log transform of it to make it more "normal" or does DirectLiNGAM deal with skewed data? The causal graphs are substantially different if I take the log and then normalise the data compared to only normalising the data and keeping the skewed distribution. I couldn't find the implementations of Hyvarinen & Smith 2013 for skewed data.

Also, my understanding is that LiNGAM is specifically made for non-Gaussian distributions, but I'm a bit confused about how this impacts the adjacency matrix computation using linear regression since from my understanding non-Gaussian distributions violate homoscedasticity.

Any clarity on these two topics would be greatly appreciated!

@sshimizu2006
Copy link
Collaborator

sshimizu2006 commented Mar 22, 2023

You don't have to take a log transform to make variables more normal. Non-Gaussianty itself does not necessarily violate homoscedasticity (constant variance).

@vdemchenko3
Copy link
Author

Hi,

Thank you for your reply!

What about scaling the data such that all variables are [0,1]? I've ran analyses both with scaling and not scaling finding significantly different DAGs.

@sshimizu2006
Copy link
Collaborator

sshimizu2006 commented Apr 27, 2023

If you transform your data, the data-generating process will change. That would be the reason you get different results.

@vdemchenko3
Copy link
Author

I see so is the suggestion to not change the data at all (no minmax scaling, no log transforms) before running causal discovery?

@sshimizu2006
Copy link
Collaborator

Well, my point is that it depends on the class of the data generation process you assume.

@vdemchenko3
Copy link
Author

Could you elaborate a bit on that? I'm mostly working with survey-type data where respondents answer various questions.

@sshimizu2006
Copy link
Collaborator

Ok, well, my suggestion is that you can do log transforms if you find that previous works in your field do that, but it would be better not to do minmax scaling.

@vdemchenko3
Copy link
Author

Why is it better not to do minmax scaling?

@sshimizu2006
Copy link
Collaborator

I don't have a strong reason. Just because I don't often see minmax scaling is used in the context of causal discovery. The point is that if you do some transformation and apply LiNGAM for example, it means that you are assuming a linear non-Gaussian model for the transformed data. It is necessary to think about the validity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants