-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large calculation time difference between r-causal and causal-cmd v1.5.0 #83
Comments
@yasu-sh One of the issue has to do with default parameters. If you don't provide a specific parameter, it will use the default value set by Java (boolean type) and by Tetrad (int type and double type). Each Tetrad algorithms has its own default parameter values that is optimized for it. Use the --default flag would make all the parameters, including boolean type, to use the algorithm specific default parameter values. This may improve your run time.
Of course, there are other factors that also affect the runtime. |
@yasu-sh Also, from 2019 to this year is a lot of time--we've made a number of optimizations to the code since then, even in the FGES class. Though I'm sure Kevin is right about parameter choices and how the parameters get used. |
@jdramsey @jdramsey Thanks for your views and opinion. I tried the argument as you recommended. [1] "INFO No prior knowledge for TETRAD."
|
Thanks! |
As I mentioned yesterday, I compared 3 conditions:
basic execution at powershellv 1.5.0 v. 1.2.0 I have notices several points as below:
How come,,,? I feel like seeing v.1.2.0 by profiling, but it might not work for getting new ideas. Surely hailfinder dataset has many degrees as ground truth. it would make time-consuming. Logs
|
Well, one comment I can make is that the R version is very out of date; the person working on that left the group some time ago, and there's been some difficulty updating that, though a number of people are interested, so I may try my hand at it myself (though I am not skilled at R progrmming). It it's just a matter of changing the causal command version and updating the calls made, perhaps I can. but it's not usign 1.5.0 but some much older version. |
Much work has been done since 1.2.0 in any case; I would rely more on 1.5.0. BTW we will soon some out with a new version; enough changes have accrued I think, even in this short time. |
Thanks. Okay, I rely on ver. 1.5.0 and use different settings on the latest version, as you recommended. Surely accuracy and speed has trade-off. |
Hold on, for 1.5.0, could you add the "--defaults" flag? |
Hold on...let me think...
…On Fri, Feb 3, 2023 at 1:55 AM Yasuhiro Shimodaira ***@***.***> wrote:
Thanks. Okay, I rely on ver. 1.5.0 and use different settings on the
latest version, as you recommended.
I just thought newer version could make faster search time.
Surely accuracy and speed has trade-off.
Let me know when you notice some ideas coming up.
—
Reply to this email directly, view it on GitHub
<#83 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACLFSR4Q3PAOHSH323AKSALWVSTUHANCNFSM6AAAAAAUNPQSLU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
lngamma huh? That's interesting. I did switch from our "quick and dirty" ln
gamma calculation to the library lngamma calculation in the Apache
library--I wonder if that could have made the difference? I suppose I could
switch back...
On Fri, Feb 3, 2023 at 7:20 AM Joseph Ramsey ***@***.***>
wrote:
… Hold on...let me think...
On Fri, Feb 3, 2023 at 1:55 AM Yasuhiro Shimodaira <
***@***.***> wrote:
> Thanks. Okay, I rely on ver. 1.5.0 and use different settings on the
> latest version, as you recommended.
> I just thought newer version could make faster search time.
>
> Surely accuracy and speed has trade-off.
> Let me know when you notice some ideas coming up.
>
> —
> Reply to this email directly, view it on GitHub
> <#83 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACLFSR4Q3PAOHSH323AKSALWVSTUHANCNFSM6AAAAAAUNPQSLU>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
Here's the lngamma I was using:
public static double lngamma(double xx) {
//Returns the value ln[?(xx)] for xx > 0.
if (xx <= 0) return Double.NaN;
//Internal arithmetic will be done in double precision, a nicety
that you can omit if ?ve-?gure
//accuracy is good enough.
double x;
double y;
double tmp;
double ser;
int j;
y = x = xx;
tmp = x + 5.5;
tmp -= (x + 0.5) * Math.log(tmp);
ser = 1.000000000190015;
for (j = 0; j <= 5; j++) {
ser += ProbUtils.cof[j] / ++y;
}
return -tmp + Math.log(2.5066282746310005 * ser / x);
}
private static final double[] cof = {76.18009172947146, -86.50532032941677,
24.01409824083091, -1.231739572450155, 0.1208650973866179e-2,
-0.5395239384953e-5};
Here's the loggamma I switched to in the Apache library:
public static double logGamma(double x) {
double ret;
if (!Double.isNaN(x) && !(x <= 0.0)) {
if (x < 0.5) {
return logGamma1p(x) - FastMath.log(x);
}
if (x <= 2.5) {
return logGamma1p(x - 0.5 - 0.5);
}
if (x <= 8.0) {
int n = (int)FastMath.floor(x - 1.5);
double prod = 1.0;
for(int i = 1; i <= n; ++i) {
prod *= x - (double)i;
}
return logGamma1p(x - (double)(n + 1)) + FastMath.log(prod);
}
double sum = lanczos(x);
double tmp = x + 4.7421875 + 0.5;
ret = (x + 0.5) * FastMath.log(tmp) - tmp + HALF_LOG_2_PI +
FastMath.log(sum / x);
} else {
ret = Double.NaN;
}
return ret;
}
public static double logGamma1p(double x) throws
NumberIsTooSmallException, NumberIsTooLargeException {
if (x < -0.5) {
throw new NumberIsTooSmallException(x, -0.5, true);
} else if (x > 1.5) {
throw new NumberIsTooLargeException(x, 1.5, true);
} else {
return -FastMath.log1p(invGamma1pm1(x));
}
}
public static double log1p(double x) {
if (x == -1.0) {
return Double.NEGATIVE_INFINITY;
} else if (x == Double.POSITIVE_INFINITY) {
return Double.POSITIVE_INFINITY;
} else {
double xpa;
if (!(x > 1.0E-6) && !(x < -1.0E-6)) {
xpa = (x * 0.3333333333333333 - 0.5) * x + 1.0;
return xpa * x;
} else {
xpa = 1.0 + x;
double xpb = -(xpa - 1.0 - x);
double[] hiPrec = new double[2];
double lores = log(xpa, hiPrec);
if (Double.isInfinite(lores)) {
return lores;
} else {
double fx1 = xpb / xpa;
double epsilon = 0.5 * fx1 + 1.0;
return epsilon * fx1 + hiPrec[1] + hiPrec[0];
}
}
}
}
private static double log(double x, double[] hiPrec) {
if (x == 0.0) {
return Double.NEGATIVE_INFINITY;
} else {
long bits = Double.doubleToRawLongBits(x);
if (((bits & Long.MIN_VALUE) != 0L || x != x) && x != 0.0) {
if (hiPrec != null) {
hiPrec[0] = Double.NaN;
}
return Double.NaN;
} else if (x == Double.POSITIVE_INFINITY) {
if (hiPrec != null) {
hiPrec[0] = Double.POSITIVE_INFINITY;
}
return Double.POSITIVE_INFINITY;
} else {
int exp = (int)(bits >> 52) - 1023;
if ((bits & 9218868437227405312L) == 0L) {
if (x == 0.0) {
if (hiPrec != null) {
hiPrec[0] = Double.NEGATIVE_INFINITY;
}
return Double.NEGATIVE_INFINITY;
}
for(bits <<= 1; (bits & 4503599627370496L) == 0L; bits <<= 1) {
--exp;
}
}
double ab;
double xa;
if ((exp == -1 || exp == 0) && x < 1.01 && x > 0.99 &&
hiPrec == null) {
double xa = x - 1.0;
double xb = xa - x + 1.0;
double tmp = xa * 1.073741824E9;
double aa = xa + tmp - tmp;
double ab = xa - aa;
xa = aa;
xb = ab;
double[] lnCoef_last = LN_QUICK_COEF[LN_QUICK_COEF.length - 1];
ab = lnCoef_last[0];
xa = lnCoef_last[1];
for(int i = LN_QUICK_COEF.length - 2; i >= 0; --i) {
aa = ab * xa;
ab = ab * xb + xa * xa + xa * xb;
tmp = aa * 1.073741824E9;
ab = aa + tmp - tmp;
xa = aa - ab + ab;
double[] lnCoef_i = LN_QUICK_COEF[i];
aa = ab + lnCoef_i[0];
ab = xa + lnCoef_i[1];
tmp = aa * 1.073741824E9;
ab = aa + tmp - tmp;
xa = aa - ab + ab;
}
aa = ab * xa;
ab = ab * xb + xa * xa + xa * xb;
tmp = aa * 1.073741824E9;
ab = aa + tmp - tmp;
xa = aa - ab + ab;
return ab + xa;
} else {
double[] lnm = FastMath.lnMant.LN_MANT[(int)((bits &
4499201580859392L) >> 42)];
double epsilon = (double)(bits & 4398046511103L) /
(4.503599627370496E15 + (double)(bits & 4499201580859392L));
double lnza = 0.0;
double lnzb = 0.0;
double tmp;
double aa;
if (hiPrec != null) {
tmp = epsilon * 1.073741824E9;
aa = epsilon + tmp - tmp;
ab = epsilon - aa;
xa = aa;
double numer = (double)(bits & 4398046511103L);
double denom = 4.503599627370496E15 +
(double)(bits & 4499201580859392L);
aa = numer - aa * denom - ab * denom;
double xb = ab + aa / denom;
double[] lnCoef_last =
LN_HI_PREC_COEF[LN_HI_PREC_COEF.length - 1];
double ya = lnCoef_last[0];
double yb = lnCoef_last[1];
for(int i = LN_HI_PREC_COEF.length - 2; i >= 0; --i) {
aa = ya * xa;
ab = ya * xb + yb * xa + yb * xb;
tmp = aa * 1.073741824E9;
ya = aa + tmp - tmp;
yb = aa - ya + ab;
double[] lnCoef_i = LN_HI_PREC_COEF[i];
aa = ya + lnCoef_i[0];
ab = yb + lnCoef_i[1];
tmp = aa * 1.073741824E9;
ya = aa + tmp - tmp;
yb = aa - ya + ab;
}
aa = ya * xa;
ab = ya * xb + yb * xa + yb * xb;
lnza = aa + ab;
lnzb = -(lnza - aa - ab);
} else {
lnza = -0.16624882440418567;
lnza = lnza * epsilon + 0.19999954120254515;
lnza = lnza * epsilon + -0.2499999997677497;
lnza = lnza * epsilon + 0.3333333333332802;
lnza = lnza * epsilon + -0.5;
lnza = lnza * epsilon + 1.0;
lnza *= epsilon;
}
tmp = 0.6931470632553101 * (double)exp;
aa = 0.0;
ab = tmp + lnm[0];
xa = -(ab - tmp - lnm[0]);
tmp = ab;
aa += xa;
ab += lnza;
xa = -(ab - tmp - lnza);
tmp = ab;
aa += xa;
ab += 1.1730463525082348E-7 * (double)exp;
xa = -(ab - tmp - 1.1730463525082348E-7 * (double)exp);
tmp = ab;
aa += xa;
ab += lnm[1];
xa = -(ab - tmp - lnm[1]);
tmp = ab;
aa += xa;
ab += lnzb;
xa = -(ab - tmp - lnzb);
aa += xa;
if (hiPrec != null) {
hiPrec[0] = ab;
hiPrec[1] = aa;
}
return ab + aa;
}
}
}
}
You're suggesting the second method is slower than the first. That's a real
possibility; and then the question is, which is more accurate? My hunch is
that the library method is more accurate?
My hunch is the second should be more accurate; it's a more careful
calculation.
If I get a chance, I'll compare the two.
Joe
On Fri, Feb 3, 2023 at 7:25 AM Joseph Ramsey ***@***.***>
wrote:
… lngamma huh? That's interesting. I did switch from our "quick and dirty"
ln gamma calculation to the library lngamma calculation in the Apache
library--I wonder if that could have made the difference? I suppose I could
switch back...
On Fri, Feb 3, 2023 at 7:20 AM Joseph Ramsey ***@***.***>
wrote:
> Hold on...let me think...
>
> On Fri, Feb 3, 2023 at 1:55 AM Yasuhiro Shimodaira <
> ***@***.***> wrote:
>
>> Thanks. Okay, I rely on ver. 1.5.0 and use different settings on the
>> latest version, as you recommended.
>> I just thought newer version could make faster search time.
>>
>> Surely accuracy and speed has trade-off.
>> Let me know when you notice some ideas coming up.
>>
>> —
>> Reply to this email directly, view it on GitHub
>> <#83 (comment)>,
>> or unsubscribe
>> <https://github.com/notifications/unsubscribe-auth/ACLFSR4Q3PAOHSH323AKSALWVSTUHANCNFSM6AAAAAAUNPQSLU>
>> .
>> You are receiving this because you were mentioned.Message ID:
>> ***@***.***>
>>
>
|
Yeah, I switched to the Apache Gamma.loggamma method in the 7.1.3-1 release--item 52 here: |
...and the BDeu score makes extensive use of this function:
|
Holy crap, I never did the comparison--out to 6 decimal places they'r the same!!
Here's the code:
|
So which is faster? Here's my code:
Here's the result:
Gamma.logGamma is significantly faster in CPU time. Hmm... so that' s not the reason. |
Hold on, there was also a change in whether FGES was parallelized or not. @yasu-sh what size machine are you using? Parallelization for FGES was turned off by default for 7.1.3-1 (I'll add that to the release notes). To turn it on, I think you need to add the flag "--parallelized". Could you try that?
This is a new field in FGES that was added in April, 2022. We added this so that we could run FGES in the algcomparison API and compare it to other algorithms, where the multithreading was done across algorithm runs instead of inside FGES. |
Added this note to the release notes for 7.1.3-1 (sorry for leaving it out--that was a major release; there are probably a few other defaults I left out as well:
|
@jdramsey Thanks a lot for telling your analysis and consideration! I have checked different cofiguration at v.1.5.0 already, and I set the similar option settings including --parallelized as long as I could.
OS: Microsoft Windows 10 Pro x64 21H2 / 10.0.19044 N/A Build 19044 I would like to add some later on. See you then. |
@yasu-sh Have you discovered any new information about this? Your machine should certainly be large enough to do parallel processing. When you run java, it's possible the heap size is too emall--that would slow things down a lot. You can increase the heap size by adding a flag to the Java command like -Xmx10g for 10 Gigabytes. |
@jdramsey Unfortunately, no information so far. I noticed this phenomena on this issue at -Xmx4G flag for java. I am afraid I need to help my colleagues at another projects for 2 weeks, therefore I have little time to check about this topic. Win10 Not fully occupied by process,,, windows specific???
datafile(posted the same above): https://github.com/bd2kccd/causal-cmd/files/10575729/dt.hailfinder.csv |
This time (just added parallelized & expanded 10GB heap memory), out of memory error came up. PS E:\Tetrad> java -Xmx10G -jar "causal-cmd-1.5.0-jar-with-dependencies.jar" --data-type discrete --delimiter comma --score bdeu-score --json-graph --algorithm fges --dataset dt.hailfinder.csv --parallelized |
@jdramsey even at 10GB, the max memory size might not be enough. Let me try 25GB.. |
Sometime explosive memory consumption happens like saws(for woods).
I have checked by myself at Mac machine. |
@jdramsey The other day I have purchased a Mac mini M2(new model) with minimum spec for my private.
|
@yasu-sh I will give it a shot. Sorry, we've been focused lately on getting a new release of Tetrad and causal-cmd out the door so I haven't had a lot of time to think (other than that). I don't think the new release will solve this problem. However, I've got 64 GB of RAM on my laptop so maybe I can get further than you, I don't know. Anyway, we're supposed to get the new release out by today or maybe tomorrow so maybe after that, I'll have a little bit of time to address this. |
@yasu-sh Could you zip and attach this file for me--would speed things up for me: dt.hailfinder.csv Thanks, or give me the link to download this. You may have given this to me before but I can't find it immediately. |
Never mind, I found it in the thread. I've got a few minutes right now; let me try loading it up in the interface and searching... |
Actually, I would remove columns 27 and 41 from the data; they have too many categories. That might be the problem. Are they needed? Without them, fGES finishes quite quickly. |
Here's the dataset without these variables (tab-delimited)--see if causal-cmd can deal with this instead. My theory is that the problem is that these 11-category variables were causing the algorithm to make cross-classification tables that were just too big. |
What happened was that we had spent so much time fixing issues in Tetrad and finally had a version (7.2.2) that seemed to have fixed all outstanding reported bugs...so then we stopped and had a meeting and decided the next thing to do was to make it easier to deal with Tetrad from Python... so Bryan Andrews and I started looking around at recent methods for doing this and found JPype, which is just beautiful. |
@jdramsey I appreciate your proposals and I will check the instruction of py-tetrad. |
I agree--there is really no separation between Python and java using this method. You just have to get used to writing Java "in Python" using Python syntax; JPype handles all of the details for you. It "marries" the Python and Java virtual machines to make this happen. There are few details. You have to start the JVM from Python as before, and you can only do it once per session. Also, like I said you need to use >= 3.7 version for Python (since all underlying reference to Python 2 needs to be eliminated from the underlying code). A few other things. |
My experience on analytics is mainly on R. But some years ago I used Python more often. |
Thanks! |
I'd like to revisit R as well at some point, and see if there aren't more up-to-date ways of connecting R to Java. The main impediment is I would have to find time to do that myself; there's no one else in our group that does R. But I will try to get to it. |
One thing that may be nice from your point of view for R is that we wrote a method to save out graphs in the "endpoint matrix" format of PCALG--it's returned as a numpy array, though I may change that to a pandas data frame so the variable names can be returned as well; you can always get a numpy array from the data frame if you don't need the variable names. But that might be the beginning of a way to interface Java with R in a more up-to-date way. |
Anyway, first things first; we'd like to make it very easy to use Tetrad from Python, much easier than currently, and py-tetrad seems to promise that. |
@jdramsey I have confirmed working py-tetrad at my windows machine.
The location of 'file' is here. |
Huh, interesting! OK, I learned something. Feel free to let me know if you find any valuable tweaks. But the Java version and the Python version are good, based on what I was expecting. I think I know what you mean by "python environment only"--it would be nice if I could figure out how to publish the project in PyPl. I'll have to figure out how to fix the paths, but I can look at other projects. And I promise I will start thinking about R again. |
This was cross-comment on "python environment"... Sorry. (this was added later) -- Possible ways from R:
Questions:
|
py-tetrad ran with PyCharm CE on Win10 at Japanese locale! @jdramsey Thank you for your help! |
That's great! :-D I'm glad it worked! It's good to know there are some options for R; I'll try them out when I can. Our old project, https://github.com/bd2kccd/r-causal, used the command line option, but I still need to look closely at it. (Unfortunately, the guy who was maintaining that code left us, and we haven't had a chance to work through what he did, but I will.) The current problem is that it uses an old version of Tetrad, and a lot of the code needs to be updated; in that sense accessing Tetrad through Python would be a great option if it could work, though that does involve coordinating three languages. Also, I am curious to know whether your problems are Windows-specific yet. Maybe I could get it to work on my Mac. |
Looking briefly at the Reticulate docs, it seems I could solve the file path problem by figuring out how to publish py-tetrad to PyPl. I've never done that before, but maybe I can figure it out. Then you can just. import pytetrad, and then continue using the functionality in your project from there. I'm still learning how to do these sorts of things in Python. |
@jdramsey Thank you for your investigation and document check. For example, the latest version of reticulate(v.1.28-compiled on my PC) makes bombing error when calling python. ex. rstudio/reticulate#1258
command prompt at R console
This is return value of file
R environment information:
Python environment information
|
@jdramsey This is my understanding on several ways to use tetrad at console / other packages in Python/R.
For my understanding, rJava is current golden standard I/F between Java and R. |
@yasu-sh This is wonderful--thank you so much for doing this! Yes, this will most certainly help me. I will spend some time in the next few days studying these options. I am interested in your observation, first of all, that causal-cmd needs file access; I hadn't thought of that as a limitation for some people, but it might be; you're right. And the remark on r-causal is beneficial as well. Could reticulate's incompatibility be overcome? Anyway, this is food for thought. Today I also started another project, which I plan to proceed slowly with. We have a command-line tool for making algorithm comparisons that is quite nice, algcomparison. The procedure is based in Tetrad and has been heavily used, so it is well-tested. Still, Kevin Bui added a facility to allow external search results to be added to the comparison tables from other platforms like Python, R, or Matlab. I decided to code up a Python wrapper for it to do Tetrad and causal-learn algorithms, but I'd like to extend that to R algorithms in, say, PCALG or bnlearn. So at some point, I need to reverse what you're suggesting, i.e., run R algorithms from Python, not just Python algorithms from R. I'll have to think of how to do that as well. But this could be a real contribution to the literature, showing how algorithms from different projects compare to one another on standard statistics so that you can pick the best algorithms (across platforms) for a task you have in mind. I do have to solve that technical challenge, though. |
@yasu-sh I wonder if the slowness of reflection in for loops might be because the java jar needs to be loaded each time? This was one of the clever things about JPype, I thought--their insistence that the Java jar is loaded only once per session. It doesn't seem to me that reflection itself should take very much time. |
@jdramsey Thanks for your quick reply. file access on causal-cmd
Regarding the file access, file-aceess is not limitation. I just thought it could become the limitation if you wants to hand over big-data(>2G). causa-cmd from console is basic and reliable. I like to use if there are no performance issues. (So far I use small data) Reflection on R JavaNo, it does not mean java jar's multi-calling. Java Reflection API need to find correct methods from many possibilities.
R from PythonI have heard that a colleague uses RPY2. It seems to work without fatal issues. |
@yasu-sh On reflection--I see. That is a real limitation. We tested the data translation methods in JPype with some rather large datasets, and they didn't slow down like that. I need to think about it. I did notice that the Tetrad data loading routine implemented by Kevin was much faster than the one in Python for continuous data; I should test that with discrete data as well. Let me add a discrete simulation to py_tetrad, do a save and load in Python, and transfer it back to Java, and see where there is a slowdown. |
@yasu-sh It's fast enough in py-tetrad; I made a 500-variable dataset with N = 500 using this method: https://github.com/cmu-phil/py-tetrad/blob/main/examples/simulating_data_discrete.py Then I converted it from Tetrad to pandas in Python and back again and added print statements to see how long each step took. Loading the JVM took a few seconds, and the simulation itself also took a few seconds, but the conversion to pandas and the conversion from pandas to Tetrad each took about one second, which I thought was OK. So the question is whether there's a method to transfer a dataset to R that's about the same speed. |
@yasu-sh I spent some time today turning py-tetrad into a Python package. This may solve the file path problem. It's not done by any means, but it's going in the right direction. All of the hard-coded paths are gone. It's not much of a package yet, just two files plus several examples, but it will grow. Also, it must be installed by checking it out from GitHub and then using pip to install the package, so the instructions have changed. But perhaps now in R you can import the package and run the examples? I'll have to try it. |
@jdramsey package runs successfully on python
I'll check at R next. |
@jdramsey By the way, shall we move to py-tetrad issue from here? I think this issue can be closed and the matter is for py-tetrad now. |
Sounds good. |
I am wondering why the calculation time has big difference between as of 2019 and as of 2023.
Could you tell me why this big difference happens? algorithms is different?
I have used profiling tool at IntelliJ IDEA at causal-cmd v.1.5.0. but the process looks normal and test consumes log-gamma calculation a lot.
r-causal based on causal-cmd ver. 1.2.0-SNAPSHOT
causal-cmd based on the @kvb2univpitt 's release
no prior (no information at prior.txt)
dataset: hailfinder (https://github.com/bd2kccd/causal-cmd/files/10555256/dt.tetrad.csv)
condition is same as below, including dataset.:
#80 (comment)
The process aborted since the profiling makes much longer.
The text was updated successfully, but these errors were encountered: