Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unneeded stops and restarts in AD monitoring #1340

Merged
merged 10 commits into from Jul 11, 2018
Merged

Conversation

anodos325
Copy link
Contributor

Ticket #33453

There are a several distinct situations where our AD monitoring will either turn off the Active Directory service or restart the Active Directory service when it is not required.
The improvements are as follows:

  1. Use SRV records to enumerate the LDAP servers in the environment and then check to see if the LDAP port is listening on any of them. If we pass this check, then we do not turn off AD service.
    Simplify checks to see if the AD service is properly running, by reducing it to a single test "wbinfo -t". This reduces the liklihood of a false-positive that we are no longer joined to the domain.
  2. Add a low-impact recovery attempt if we fail the getStarted() check. Winbind in its current form establishes a connection to only a single DC. It has been observed in testing that if this DC goes down, then we will fail the getStarted() check. Restarting winbind was sufficient in these situations to put the domain join back into a healthy state. Try this first before performing a more disruptive recovery attempt.
  3. The default AD DNS timeout and lifetime values make it so that if Nameserver 1 in the FreeNAS configuration is down, then get_SRV_records() in freenasldap.py will always fail (we never try to get SRV records from Nameserver 2). If we reduce the timeout to 1/3 of the overall lifetime, then we will query other nameservers and AD will stay up.

@ghost ghost assigned anodos325 Jun 9, 2018
@ghost ghost added the review label Jun 9, 2018
@anodos325 anodos325 requested review from b-a-t and jhixson74 June 9, 2018 16:00
@b-a-t
Copy link
Contributor

b-a-t commented Jun 12, 2018

rebuild

s.connect((host, port))
connected = True
for i in range(0, max_tries):
# Make three attempts to get SRV records from DNS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Make max_tries attempts...." - current value of 3 is definable

r.lifetime = _fs().directoryservice.activedirectory.dns.lifetime
r.timeout = r.lifetime / 3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the reasons you made it this way and that you don't want to mess with sysctl() updates, but I'd keep ability to modify timeout through the sysctl(). How about:

r.timeout = _fs().directoryservice.activedirectory.dns.timeout / 2
Or
r.lifetime = _fs().directoryservice.activedirectory.dns.lifetime * 2

if host_list:
break
else:
self.logger.debug("[ServiceMonitorThread] Attempt %d to query SRV records failed " % (i))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, try to avoid legacy "%x" % (v) syntax. Use either f'{v}' or '{}'.format(v) instead.


if connected:
break
for h in host_list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit hard to see it here, but looks like host_list you are referencing here is declared, if any, in another scope. I actually commented about this to you before.
It's declared on an inner scope regarding this one. Also, if if self.name != 'activedirectory': - it's not declared at all. Move host_list = [] outside of the if.

elif service == 'ldap':
fnldap = FreeNAS_LDAP(flags=FLAGS_DBINIT)
else:
fnldap = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing, but you can just declare above fnldap = None and avoid having else: clause.

@@ -82,31 +85,36 @@ def isEnabled(self, service):
return enabled

@private
def tryConnect(self, host, port):
def tryConnect(self, host, port, fnldap):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to declare it as: def tryConnect(self, host, port, fnldap = None):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, as we only call get_ldap_servers we don't need all this code around fnldap, see below.

@b-a-t
Copy link
Contributor

b-a-t commented Jun 13, 2018

@miwi-fbsd says that Jenkins fails cause your branch is outdated and requires rebase. Can you try it with next iteration?

for i in range(0, max_tries):
# Make max_tries attempts to get SRV records from DNS
host_list = fnldap.get_ldap_servers(host)
if host_list:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, who is the host here? Originally that was one of the DCs, but get_ldap_servers() expects domain instead:

    def get_ldap_servers(domain, site=None):
        dcs = []
        if not domain:
            return dcs

        host = "_ldap._tcp.%s" % domain
        if site:
            host = "_ldap._tcp.%s._sites.%s" % (site, domain)

        dcs = FreeNAS_ActiveDirectory_Base.get_SRV_records(host)
        return dcs

Copy link
Contributor Author

@anodos325 anodos325 Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Host is ultimately from freenas-v1.db here:
sqlite> SELECT sm_host FROM services_servicemonitor ;
sm_host = THEGIBSON.LAN

            thread = ServiceMonitorThread(
                id=s['id'], frequency=s['sm_frequency'], retry=s['sm_retry'],
                host=s['sm_host'], port=s['sm_port'], name=thread_name,
                logger=self.logger, middleware=self.middleware
            )

It is the domain.

for i in range(0, max_tries):
# XXX What about UDP?
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.settimeout(timeout)
Copy link
Contributor

@b-a-t b-a-t Jun 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def port_is_listening(host, port, errors=[]): looks very similar to the given code, except it doesn't set any timeout for the socket connection. It may be reasonable to add:

   def port_is_listening(host, port, errors=[], timeout=0):
       ....
       if timeout:
          s.settimeout(timeout)

in the freenasldap.py.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we talked about this Andrew, timur is correct

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Made changes.

connected = True
for i in range(0, max_tries):
# Make max_tries attempts to get SRV records from DNS
host_list = fnldap.get_ldap_servers(host)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simply call:

host_list = FreeNAS_ActiveDirectory_Base.get_ldap_servers(host)

as it's a static method of the class. So no need for creating fnldap object at all, neither for passing it around.

@b-a-t
Copy link
Contributor

b-a-t commented Jun 13, 2018

Found in the dns.resolver.__init__ reference to the attribute self.rotate, which shuffles list of the nameservers, allowing to pick the random one for the start. Not sure, does that make any benefit in this situation.

Copy link

@jhixson74 jhixson74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like my bike sheds painted hot pink

@@ -639,6 +635,14 @@ def _started_notify(self, verb, what):
res = True
return res

async def _reload_activedirectory(self, **kwargs):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about naming this "_restart_activedirectory" since that is what it is actually doing ;-) If you are going to make a reload method, then have to reload samba_server ;-)

Copy link
Contributor Author

@anodos325 anodos325 Jun 16, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a _restart_active directory method that calls "/etc/directoryservice/ActiveDirectory/ctl restart". I don't see any reason to change the existing restart method because we want to start from a clean slate to ensure proper AD join.

What I would ideally prefer here is regenerating smb4.conf (if needed) and killing / restarting winbind process as an "AD reload". I.e. _reload_cifs, then restart winbind. This will force new DC connection, but not kill existing SMB sessions.

It should be something less than a full "AD restart", but more than just reloading a config file.

@@ -17,7 +16,10 @@
if not apps.ready:
django.setup()

from freenasUI.common.freenassysctl import freenas_sysctl as _fs
from freenasUI.common.freenasldap import (
FreeNAS_ActiveDirectory_Base,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just import FreeNAS_ActiveDirectory here, there is no need to use the base class

for i in range(0, max_tries):
# XXX What about UDP?
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.settimeout(timeout)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we talked about this Andrew, timur is correct

connected = True
for i in range(0, max_tries):
# Make max_tries attempts to get SRV records from DNS
host_list = FreeNAS_ActiveDirectory_Base.get_ldap_servers(host)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/FreeNAS_ActiveDirectory_Base/FreeNAS_ActiveDirectory/

s.settimeout(None)
s.close()
for h in host_list:
port_is_listening = FreeNAS_ActiveDirectory_Base.port_is_listening(str(h.target), h.port, errors=[])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/FreeNAS_ActiveDirectory_Base/FreeNAS_ActiveDirectory/

midclt call servicemonitor.stop resulted in self.finished.set() being called, which effectively removed the frequency control for service monitoring. As long as the service remained "connected", "started", and "enabled", then service monitoring would never stop.
Copy link

@jhixson74 jhixson74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go

@jhixson74 jhixson74 merged commit 172ac4d into master Jul 11, 2018
@ghost ghost removed the review label Jul 11, 2018
@jhixson74 jhixson74 deleted the FIX-33453-MASTER branch July 11, 2018 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants