Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Troubles with GS register #607

Open
gweodoo opened this issue Jul 21, 2017 · 6 comments
Open

Troubles with GS register #607

gweodoo opened this issue Jul 21, 2017 · 6 comments

Comments

@gweodoo
Copy link
Contributor

gweodoo commented Jul 21, 2017

Hi guys,

I'm still working on DMTCP integration as a C/R system for a custom MPI implementation and I'm focused to the DMTCP TLS mechanism.

TL;DR

The GS register does not seem to be handled in a per-thread storage, leading to inconsistent values at restart time.

Explanation

I saw that TLS save/restore are handled through a ThreadTLSInfo, located in a per-thread structure, and TLSInfo_RestoreTLSState()/TLSInfo_SaveTLSState() are responsible of updating this tlsInfo. %fs register seems to be stored in a field named gdtentrytls, which is a 1-entry array:

typedef struct _ThreadTLSInfo {
  segreg_t fs, gs;  // thread local storage pointers
  struct user_desc gdtentrytls[1];
} ThreadTLSInfo;

This field is updated by tls_get_thread_area()/tls_set_thread_area(), the "internals" that TLSInfo_* functions are wrapping. Why %gs is not handled accordingly ? It seems that this register is stored in a per-process storage: myinfo_gs. However, %gs being a per-CPU register too, it should be stored the same way as %fs. The fact is that we use GS (supposed unused in 64-bit mode) to reproduce a TLS mechanism for our own user-level thread model (like the NPTL does with FS).

Is there any reason for FS and GS to be handled differently ? According to me, these registers could just be saved/restored through arch_prctl calls, manipulating 64-bit addresses. Then, DMTCP just needs to update PID/TID in the TCB accordingl (at least for x86_64).

Furthermore, I saw that the struct user_desc was used to store the FS register in the base_addr member (probably to transparently switch between arch_prctl() and {g,s}et_thread_area() depending on the architecture). But, on 64-bit systems, this member is an int, to stay compliant with the long definition in 32-bit mode. This leads arch_prctl to erase the next struct member, named limit. As these two fields are depicted as "not used" for 64-bit mode, it is not a problem but just to say (actually a bit confusing when debugging).

This post is based on my knowledge of TLS mechanism for x86_64 architecture and complexity introduced by DMTCP should be related to the need to support more than one (like ARM). So, maybe it is non-trivial to do it.

For now, my work-around (reachable at gweodoo/dmtcp@07b4957) consists in replacing the memeber with a 2-entry array, to store both the register states and an update of two "internals" accordingly (still on the 2.5 branch). But i think a proper solution would be better.

Let me know if some parts of my post are unclear, I will be pleased to provide some examples.
Thanks for reading and your help.

@rohgarg
Copy link
Contributor

rohgarg commented Jul 21, 2017

Hi @gweodoo! Thanks again for continuing to work on this.

It's been a while since I looked at this part of the code, so it'll take
me some time to review this. Your analysis and patch seem reasonable. You
are right about the fact that we try to support many different platforms:
x86, x86-64, ARM, and AArch64; I don't know what the proper fix would
look like yet.

However, I do have one question: why did we not see this issue earlier? In
other words, what's the sequence of steps/scenario/context that lead/leads
to this bug?

@rohgarg rohgarg added this to the 2.5.1 milestone Jul 21, 2017
@gweodoo
Copy link
Contributor Author

gweodoo commented Jul 21, 2017

To be short, we discovered it because we need to use GS as a per-CPU register.

We implement a user-level thread scheduler relying over POSIX threads. In such case, we can potentially have multiple user-level threads scheduled on the same Unix thread. The FS register is used by the NPTL to store POSIX thread storage while the GS is used to provide the "per user-level" storage (a "nested one").

One typical example is a simple MPI program setting its rank as a global variable. As this data changes at MPI task level, the storage model for this variable has to change too. Thus, we reproduce, somehow, what the libc does with the TLS mechanism at our scale.
For further explanation, see here.

For 64bit mode, GS register is not used, whereas in 32bit mode, DMTCP uses a macro TLSSEGREG to switch the base_addr field to the corresponding GS register, so it should work.

For a proper fix, would not be better to replace these explicit field with a "JIT-determined" type, depending on the architecture ? Something like:

typedef enum { FS, GS, NB } reg_type_t;
#ifdef __i386__
#define TLS_TYPE_T struct user_desc
#define tls_get_thread_area(st, which) get_thread_area(st)
#define tls_set_thread_area(st, which) set_thread_area(st)
#elif __x86_64__
#define TLS_TYPE_T unsigned long
#define tls_get_thread_area(st, which) arch_prctl(ARCH_GET_#which, st)
#define tls_set_thread_area(st, which) arch_prctl(ARCH_SET_#which, st)
#elif __arm__
//...
#endif
TLS_TYPE_T reg_arrays[NB];

and then, only referring a register by calling tls_get_thread_area(reg_arrays + GS, GS)

@rohgarg
Copy link
Contributor

rohgarg commented Jul 21, 2017

Hmmm... Before I try to come up with one, do you have a simple program that I can try locally (to reproduce this issue and poke around)?

(I don't think I need access to your entire stack; just a program that simulates this bug by using the GS register?)

@rohgarg rohgarg added the 3.0.0 label Jul 21, 2017
@gweodoo
Copy link
Contributor Author

gweodoo commented Jul 24, 2017

Sure, the following code should reproduce the pattern: create multiple threads bound to the same core, then each thread sets its own buffer for the GS register:

#define _GNU_SOURCE
#include <pthread.h>
#include <asm/prctl.h>
#include <sys/prctl.h>
#include <assert.h>
#include <stdio.h>
#include <unistd.h>

#ifndef NB_TH
#define NB_TH 4
#endif

void * func (void * arg)
{
	int id = *(int*)arg;
	void* buf = &id; /* this whill be our GS buffer */
	cpu_set_t set;

	/* check the effective affinity */
	CPU_ZERO(&set);
	pthread_getaffinity_np(pthread_self(), sizeof(cpu_set_t), &set);
	assert(CPU_ISSET(0, &set));
	assert(CPU_COUNT(&set) == 1);

	/* set the GS register for the current thread */
	arch_prctl(ARCH_SET_GS, (unsigned long)buf);
	
	fprintf(stderr, "PRE: Th#%d - buf = %d (%p)\n", id, *(int*)buf, buf );

	/* Do the checkpoint here */
	sleep(4);
	
	/* check again the affinity (need fix PR#606) */
	/* But the bug should be reproductible without this check */
	CPU_ZERO(&set);
	pthread_getaffinity_np(pthread_self(), sizeof(cpu_set_t), &set);
	assert(CPU_ISSET(0, &set));
	assert(CPU_COUNT(&set) == 1);

	/* Get the GS value back */
	arch_prctl(ARCH_GET_GS, (unsigned long*)&buf);
	fprintf(stderr, "POST: Th#%d - buf = %d (%p)\n", id, *(int*)buf, buf );
	
	assert(buf == &id);
	return NULL;
}

int main(int argc, char *argv[])
{
	pthread_t th[NB_TH];
	int i;
	cpu_set_t set;

	/* attach all threads to 0, creating race over GS register */
	for(i = 0; i < NB_TH; ++i)
	{
		pthread_attr_t attr;

		CPU_ZERO(&set);
		CPU_SET(0, &set);
		pthread_attr_init(&attr);
		pthread_attr_setaffinity_np(&attr, sizeof(cpu_set_t), &set);
		pthread_create(th+i, &attr, func, &i);
	}

	for (i = 0; i < NB_TH; ++i) 
	{
		pthread_join(th[i], NULL);
	}
	return 0;
}

@rohgarg
Copy link
Contributor

rohgarg commented Jul 27, 2017

@gweodoo: Thanks for sharing the code. Please go ahead and generate a pull request with your patch. I will approve and merge it.

Also, we should add your test case to make check, although I think the affinity bits are not required. If you want, I can try to clean up the test and generate a separate patch for make check.

@gweodoo
Copy link
Contributor Author

gweodoo commented Jul 28, 2017

Do you want a PR with the workaround or the more generic solution ? Because the first one is implemented, the second one was just prospective. Maybe, as a first step, the hot fix could be used to fix GS specifically.

Yes absolutely, feel free to use this test for your validation system. The affinity is not mandatory but it was allowing me to check that the checkpoint was done in a per-thread basis (=each thread effectively saves its own value) and not a per-CPU (someone iterating over CPUs to back registers up). But indeed, to demonstrate the issue described here, the thread pinning is not necessary.

Note: as arch_prctl is a Linux-x86_64 extension. Maybe the test should not be automatically run with make check, exposin the user to potential "false" failures.

@jiajuncao jiajuncao modified the milestones: 2.5.1, 2.5.2 Sep 5, 2017
@jiajuncao jiajuncao added 2.5.2 and removed 2.5.1 labels Sep 5, 2017
@jiajuncao jiajuncao modified the milestones: 2.5.2, 2.6.0 Nov 15, 2017
@jiajuncao jiajuncao added 2.6.0 and removed 2.5.2 labels Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants