Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enable tests in ga.spec #5

Closed
marcindulak opened this issue Feb 15, 2020 · 33 comments
Closed

enable tests in ga.spec #5

marcindulak opened this issue Feb 15, 2020 · 33 comments

Comments

@marcindulak
Copy link
Contributor

As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1709933#c15 we should execute tests during the build of ga, but it seems like the test is specific to mpich

fedpkg/ga/ga.spec

Lines 214 to 220 in 7cf8461

%if %{?do_test}0
%{_mpich_load}
cd %{name}-%{version}-mpich
make check
cd ..
%{_mpich_unload}
%endif

@marcindulak
Copy link
Contributor Author

marcindulak commented Feb 15, 2020

The mpi modules set the MPI_SUFFIX variable, which includes underscore

grep MPI_SUFFIX /etc/modulefiles/mpi/mpich-x86_64 /usr/share/modulefiles/mpi/openmpi-x86_64 
/etc/modulefiles/mpi/mpich-x86_64:setenv        MPI_SUFFIX    _mpich
/usr/share/modulefiles/mpi/openmpi-x86_64:setenv			MPI_SUFFIX	_openmpi

We should use that in the spec instead of manually use a dash (-) in

fedpkg/ga/ga.spec

Lines 147 to 151 in 7cf8461

pushd %{name}-%{ga_version}
popd
for i in mpich openmpi; do
cp -a %{name}-%{ga_version} %{name}-%{version}-$i
done

@edoapra
Copy link
Owner

edoapra commented Feb 16, 2020

There is an "chicken and egg" issue here (likely to affect in GlobalArrays/ga#154, too)
If you try to run the GA tests with the current ga.spec that uses the --enable-peigs option ,
you end up with test binaries that need the Peigs library to run (you will get an undefined pdspev_ symbol).
The obvious solution is to eliminate the --enable-peigs option, but this will make the GA RPMs useless for NWChem unless NWChem has been built with ScaLapack, instead of Peigs. Luckily,
this should be the case, but I need to test this.

@edoapra
Copy link
Owner

edoapra commented Feb 18, 2020

@marcindulak I have updated my ga and nwchem trees to enable the ga tests.
Successful koji rpm logs for epel7, f30, rawhide
https://koji.fedoraproject.org/koji/tasks?owner=edoapra&state=all

@edoapra
Copy link
Owner

edoapra commented Feb 28, 2020

@marcindulak ga 5.7.2 is out
https://github.com/GlobalArrays/ga/releases/tag/v5.7.2
I have updated ga.spec and elempatch.patch accordingly.
2c11609

@marcindulak
Copy link
Contributor Author

@edoapra
Copy link
Owner

edoapra commented Mar 2, 2020

I will have a look at it. In the meantime, I have just partially removed the last commit
6ba1a3e

@edoapra
Copy link
Owner

edoapra commented Mar 2, 2020

After adding verbose output for tests 6c92d39,
the log shows that gethostbyname is failing.
This could be due to the network on the koji build machines not been correctly configured ... can we check this?
Bye the way, the build works on my RHEL6 box
https://koji.fedoraproject.org/koji/getfile?taskID=42116109&volume=DEFAULT&name=build.log&offset=-4000

===========================================================================
   Communication Runtime for Extreme Scale (comex) 1.1: ./test-suite.log   
===========================================================================
6 of 6 tests failed.  
.. contents:: :depth: 2
FAIL: testing/perf (exit: 1)
============================
Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(467)..............: 
MPID_Init(177).....................: channel initialization failed
MPIDI_CH3_Init(70).................: 
MPID_nem_init(319).................: 
MPID_nem_tcp_init(171).............: 
MPID_nem_tcp_get_business_card(418): 
MPID_nem_tcp_init(377).............: gethostbyname failed, buildvm-10.phx2.fedoraproject.org  (errno 2)

@edoapra
Copy link
Owner

edoapra commented Mar 2, 2020

errno=2 should correspond to

TRY_AGAIN
              A temporary error occurred on an authoritative name server.  Try again later.

@edoapra
Copy link
Owner

edoapra commented Mar 2, 2020

This is a recent MPICH bug report that could be related
https://src.fedoraproject.org/rpms/mpich/pull-request/2

@edoapra
Copy link
Owner

edoapra commented Mar 2, 2020

openmpi tests do work, instead

@marcindulak
Copy link
Contributor Author

marcindulak commented Mar 4, 2020

I believe I've seen already el6 "gethostbyname failed". Maybe we should exclude mpich from el6 tests? Can you look into el7/el8 failures?

After 6ba1a3e and 6c92d39

mpich openmpi
el6 gethostbyname failed ? (slow/hangs) https://koji.fedoraproject.org/koji/taskinfo?taskID=42159052
el7 ? (slow/hangs) Comparison failed https://koji.fedoraproject.org/koji/taskinfo?taskID=42117101
el8 OK Segmentation fault https://koji.fedoraproject.org/koji/taskinfo?taskID=42159069

Another thing to check/fix - how do we limit the number of cores used by the tests? I would set it to 2.

@edoapra
Copy link
Owner

edoapra commented Mar 4, 2020

7fb103a sets NPROCS=2 for tests
Are you sure that all your builds were using commit 6c92d39? I do not see the verbose log in your failed tests for epel7

@marcindulak
Copy link
Contributor Author

marcindulak commented Mar 4, 2020

epel7 https://koji.fedoraproject.org/koji/taskinfo?taskID=42159068 is the hanging build and it uses make VERBOSE=1 check (that line is in the log). The el6 hanging build is here https://koji.fedoraproject.org/koji/taskinfo?taskID=42160273

I've updated the table above to the right epel7 failing with "Comparison failed" with openmpi.

@edoapra
Copy link
Owner

edoapra commented Mar 4, 2020

Thanks for the prompt update
The openmpi version for el7 looks odd (or old) to me. From root.log

openmpi-devel             x86_64   1.10.7-5.el7     

The seg fault for el8 could be an issue that was reported for openmpi 4.1.0
open-mpi/ompi#6789
I am trying the workaround

@edoapra
Copy link
Owner

edoapra commented Mar 4, 2020

I will have to get rid of the NPROCS=2 commit since some tests do require 4 processes

@marcindulak
Copy link
Contributor Author

I will have to get rid of the NPROCS=2 commit since some tests do require 4 processes

Can we disable those tests? We don't know what cpu resources we have on the koji build servers.

Apparently el7 provides a newer openmpi, but called openmpi3 https://bugzilla.redhat.com/show_bug.cgi?id=1709933#c3 and scalapack.spec uses special logic to handle this exception in naming https://bugzilla.redhat.com/show_bug.cgi?id=1709933#c18

@edoapra
Copy link
Owner

edoapra commented Mar 4, 2020

Pushed ed797e7 to limit number of tests and restore NPROC=2

@edoapra
Copy link
Owner

edoapra commented Mar 4, 2020

Green light now on epel8 and epel7

@edoapra
Copy link
Owner

edoapra commented Mar 4, 2020

@marcindulak please try the latest commit c495ba3

@marcindulak
Copy link
Contributor Author

el6/el7/el8/f33 build now. Should I package this ga for all of them or are we waiting still for some changes in ga needed by nwchem?

@edoapra
Copy link
Owner

edoapra commented Mar 4, 2020

I would go ahead an package this ga since it contains all the changes needed by nwchem

@edoapra
Copy link
Owner

edoapra commented Mar 5, 2020

small ga.spec changes with f978657

@marcindulak
Copy link
Contributor Author

How critical is this change? Could you merge the fedora spec https://src.fedoraproject.org/rpms/ga/tree/master into your repo? We diverged in terms of the changelog, since I'm not merging your github changes into the fedora git repository (I had to release for example 5.6.5-8). We could agree on how we synchronize work between this github and fedora repository.

@edoapra
Copy link
Owner

edoapra commented Mar 5, 2020

These changes are not critical at all. They bear zero impact on functionality

@edoapra
Copy link
Owner

edoapra commented Mar 5, 2020

Could you merge the fedora spec https://src.fedoraproject.org/rpms/ga/tree/master into your repo? We diverged in terms of the changelog, since I'm not merging your github changes into the fedora git repository (I had to release for example 5.6.5-8). We could agree on how we synchronize work between this github and fedora repository.

8354f3a

@marcindulak
Copy link
Contributor Author

Thanks.

dereferencing_fix.patch seems to be missing. If the patches are not critical, can we hold them back, and rely on them being part of the source of the next ga release instead?

@edoapra
Copy link
Owner

edoapra commented Mar 5, 2020

Thanks for spotting this 6dd7868

Yes, we can hold them back and wait for next ga release

@marcindulak
Copy link
Contributor Author

Would you be willing then to go back to 5.7.2-2 in your repo? Fedora will keep making auto commits to the spec master branch and we'll diverge in the changelog.

@edoapra
Copy link
Owner

edoapra commented Mar 5, 2020

Back to 5.7.2-2
c3019a0

@marcindulak
Copy link
Contributor Author

I meant rather to sync to what's currently in https://src.fedoraproject.org/rpms/ga/blob/master/f/ga.spec, without additions to the changelog or

fedpkg/ga/ga.spec

Lines 16 to 17 in c3019a0

Patch1: ga572_version.patch
Patch2: dereferencing_fix.patch

Making another build of all epel and fedora's will require 1-2 hours of attention, and I prefer to not spend time on those now.

I think the way to go forward could be to have a branch in your repo called e.g. develop, and keeping your master branch always synced with the fedora master. We could prepare new builds in the develop branch and merge only to your master when things work, then I take your master and write into the fedora master.

edoapra added a commit that referenced this issue Mar 6, 2020
edoapra added a commit that referenced this issue Mar 6, 2020
@marcindulak
Copy link
Contributor Author

I think this issue can be closed as fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants