# User Search
For use to:
1. Try to find an account based on random knowledge
2. List all orgs they belong to (from a subset)
  - You will need org owner permissions to perform these searches

# Boiler plate
Skip/hide this. Common usage is below.

If you see this text, you may want to enable the nbextension "Collapsable Headings", so you can hide this in common usage.

## Tune as needed

There are several lru_cache using functions. Many of them are called len(orgs_to_check) times. If they are under sized, run times will get quite long. (Only the first query should be delayed - after that, all data should be in the cache.)

See the "cache reporting" cell below.

#### Orgs to check

This should be replaced with a call for all orgs the creds has owner access for.

In [None]:
# use the output of ./get_org_info.py --names-only for below
orgs_to_check = set (
"""
Common-Voice
Mozilla-Commons
Mozilla-Games
Mozilla-JetPack
Mozilla-TWQA
MozillaDataScience
MozillaDPX
MozillaFoundation
MozillaReality
MozillaSecurity
MozillaWiki
Pocket
Thunderbird-client
devtools-html
firefox-devtools
fxos
fxos-eng
iodide-project
mdn
moz-pkg-testing
mozilla
mozilla-applied-ml
mozilla-archive
mozilla-b2g
mozilla-bteam
mozilla-conduit
mozilla-extensions
mozilla-frontend-infra
mozilla-iam
mozilla-it
mozilla-l10n
mozilla-lockbox
mozilla-lockwise
mozilla-metrics
mozilla-mobile
mozilla-partners
mozilla-platform-ops
mozilla-private
mozilla-rally
mozilla-releng
mozilla-services
mozilla-spidermonkey
mozilla-standards
mozilla-svcops
mozilla-tw
mozmeao
nss-dev
projectfluent
taskcluster
""".split())

print("{:3d} orgs to check.".format(len(orgs_to_check)))

#### Cache Tuning & Clearing

Various functions use lru_cache -- this outputs the values to see if they are tuned appropriately.

Note that these have no meaning until after 1 or more queries have been run.

In [None]:
print("_search_for_user")
print(_search_for_user.cache_info())
print("_search_for_org")
print(_search_for_org.cache_info())

print("get_collaborators")
print(get_collaborators.cache_info())
print("get_members")
print(get_members.cache_info())

print("get_org_owners")
print(get_org_owners.cache_info())
print("get_inspectable_org_object")
print(get_inspectable_org_object.cache_info())

In [None]:
print("clearing caches...")
_search_for_user.cache_clear()
_search_for_org.cache_clear()
get_collaborators.cache_clear()
get_members.cache_clear()
get_org_owners.cache_clear()
get_inspectable_org_object.cache_clear()


## Code

### main code (CIS/IAM)

Not every operator will have a valid token for the CIS system, so fail gently if not

In [None]:
def check_CIS(email):
    if _has_cis_access():
        login = _get_cis_info(email)
        display("CIS info for {} reports '{}'".format(email, login))
        return login
    else:
        display("Skipping CIS check, no token available.")

In [None]:
def _has_cis_access():
    import os
    return os.environ.get("CIS_CLIENT_ID", "") and os.environ.get("CIS_CLIENT_SECRET", "")

In [None]:
_cis_bearer_token = None
import requests

def _get_cis_bearer_token():
    global _cis_bearer_token
    if _cis_bearer_token:
        return _cis_bearer_token
    else:
        import requests
        url = "https://auth.mozilla.auth0.com/oauth/token"
        headers = {"Content-Type": "application/json"}
        payload = {
            "client_id": os.environ["CIS_CLIENT_ID"],
            "client_secret": os.environ["CIS_CLIENT_SECRET"],
            "audience": "api.sso.mozilla.com",
            "grant_type": "client_credentials"
        }
        resp = requests.post(url, json=payload, headers=headers)
        data = resp.json()
        _cis_bearer_token = data["access_token"]
        return _cis_bearer_token
    
def _get_cis_info(email):
    import urllib
    bearer_token = _get_cis_bearer_token()
    # first get the v4 id
    url = "https://person.api.sso.mozilla.com/v2/user/primary_email/{}?active=any".format(urllib.quote(email))
    headers = {"Authorization": "Bearer {}".format(bearer_token)}
    resp = requests.get(url, headers=headers)
    data = resp.json()
    login = v4id = None
    try:
        v4id = data["identities"]["github_id_v4"]["value"]
    except KeyError:
        pass
    if v4id:
        # if there was a v4 id, map it to a login, via graphQL
        query = """
            query id_lookup($id_to_check: ID!) {
              node(id: $id_to_check) {
                ... on User {
                  login
                  id
                  databaseId
                }
              }
            }
            """
        variables = '{ "id_to_check": "' + str(v4id) + '" }'
        url = 'https://api.github.com/graphql'
        headers = {"Authorization": "Token {}".format(api_key)}
        payload = {
            "query": query,
            "variables": variables,
        }
        resp = requests.post(url, headers=headers, json=payload)
        try:
            data = resp.json()
            login = data["data"]["node"]["login"]
        except KeyError:
            login = None
    return login


### main code (GitHub)

#### helpers

In [None]:
# print some debug information
import github3
print(github3.__version__)
print(github3.__file__)

In [None]:
# set values here - you can also override below

# get api key from environment, fall back to file
import os
api_key = os.environ.get("GITHUB_PAT", "")
if not api_key:
    api_key = open(".credentials", "r").readlines()[1].strip()
if not api_key:
    raise OSError("no GitHub PAT found")

In [None]:
import time

In [None]:

def print_limits(e=None, verbose=False):
    if e:
#         display("API limit reached, try again in 5 minutes.\n")
        display(str(e))

    reset_max = reset_min = 0
    limits = gh.rate_limit()
    resources = limits["resources"]
#     print("{:3d} keys: ".format(len(resources.keys())), resources.keys())
#     print(resources)
    for reset in resources.keys():
        reset_at = resources[reset]["reset"]
        reset_max = max(reset_at, reset_max)
        if not resources[reset]["remaining"]:
            reset_min = min(reset_at, reset_min if reset_min else reset_at)
            if verbose:
                print("EXPIRED for {} {}".format(reset, resources[reset]["remaining"]))
        else:
            if verbose:
                print("remaining for {} {}".format(reset, resources[reset]["remaining"]))

    if not reset_min:
        print("No limits reached currently.")
    else:
        print("Minimum reset at {} UTC ({})".format(time.asctime(time.gmtime(reset_min)),
                                                    time.asctime(time.localtime(reset_min))))
    print("All reset at {} UTC".format(time.asctime(time.gmtime(reset_max)),
                                                    time.asctime(time.localtime(reset_max))))

    
try:
    gh = github3.login(token=api_key)
    print("You are authenticated as {}".format(gh.me().login))
except (github3.exceptions.ForbiddenError, github3.exceptions.ConnectionError) as e:
    print(str(e))
    print_limits()
try:
    from functools import lru_cache
except ImportError:
    from backports.functools_lru_cache import lru_cache
    
print_limits()

From here on, use ``gh`` to access all data

In [None]:
@lru_cache(maxsize=128)
def _search_for_user(user):
    l = list(gh.search_users(query="type:user "+user))
    display(u"found {} potentials for {}".format(len(l), user))
    return l

@lru_cache(maxsize=512)
def _search_for_org(user):
    l = list(gh.search_users(query="type:org "+user))
    display(u"found {} potentials for {}".format(len(l), user))
    return l

def get_user_counts(user):
    # display(u"SEARCH '{}'".format(user))
    l = _search_for_user(user)
    for u in l:
        # display(u"   FOUND '{}'".format(u))
        yield u


In [None]:
displayed_users = set() # cache to avoid duplicate output
def show_users(user_list, search_term):
    global displayed_users
    unique_users = set(user_list)
    count = len(unique_users)
    if count >10:
        # Even if there are too many, we still want to check the 'root' term, if it matched
        try:
            seed_user = gh.user(search_term.encode('ascii', 'replace'))
            display(u"... too many to be useful, still trying '{}' ...".format(seed_user.login))
            displayed_users.add(seed_user)
#             print("search_term {}; seed_user {}; seed_user.login {}".format(search_term, seed_user, seed_user.login))
        except github3.exceptions.NotFoundError:
            display(u"... too many to be useful, '{}' is not a user".format(search_term))
    else:
        for u in [x for x in unique_users if not x in displayed_users]:
            displayed_users.add(u)
            user = u.user.refresh()
    if 0 < count <= 10:
        return [u.login for u in unique_users]
    else:
        return []

from itertools import permutations

def _permute_seeds(seeds):
    if len(seeds) == 1:
        yield seeds[0]
    else:
        for x, y in permutations(seeds, 2):
            permutation = " ".join([x,y])
            display(u"   trying phrase permutation {}".format(permutation))
            yield permutation
            permutation = "".join([x,y])
            display(u"   trying permutation {}".format(permutation))
            yield permutation
            
def gather_possibles(seeds):
    found = set()
    # sometimes get a phrase coming in - e.g. "First Last"
    for seed in _permute_seeds(seeds.split()):
        maybes = show_users(get_user_counts(seed), seed)
        found.update(maybes)
        # if it was an email addr, try again with the mailbox name
        if '@' in seed:
            seed2 = seed.split('@')[0]
            display(u"Searching for mailbox name '{}' (gather_possibles)".format(seed2))
            maybes = show_users(get_user_counts(seed2), seed2)
            found.update(maybes)
    return found


In [None]:
class OutsideCollaboratorIterator(github3.structs.GitHubIterator):
    def __init__(self, org):
        super(OutsideCollaboratorIterator, self).__init__(
            count=-1, #get all
            url=org.url + "/outside_collaborators",
            cls=github3.users.ShortUser,
            session=org.session,
        )

@lru_cache(maxsize=512)
def get_collaborators(org):
    collabs = [x.login.lower() for x in OutsideCollaboratorIterator(org)]
    return collabs

def is_collaborator(org, login):
    return bool(login.lower() in get_collaborators(org))

# provide same interface for members -- but the iterator is free :D
@lru_cache(maxsize=512)
def get_members(org):
    collabs = [x.login.lower() for x in org.members()]
    return collabs

def is_member(org, login):
    return bool(login.lower() in get_members(org))

In [None]:
@lru_cache(maxsize=64)
def get_org_owners(org):
    owners = org.members(role="admin")
    logins = [x.login for x in owners]
    return logins

@lru_cache(maxsize=128)
def get_inspectable_org_object(org_name):
    try:
        o = gh.organization(org_name)
        # make sure we have enough chops to inspect it
        get_org_owners(o)
        is_member(o, "qzu"*3)
        is_collaborator(o, "qzu"*3)
    except github3.exceptions.NotFoundError:
        o = None
        display("No such organization: '{}'".format(org_name))
    except github3.exceptions.ForbiddenError as e:
        o = None
        display("\n\nWARNING: Not enough permissions for org '{}'\n\n".format(org_name))
    except Exception as e:
        o = None
        display("didn't expect to get here")
    return o

def check_login_perms(logins, headers=None):
    any_perms = []
    any_perms.append("=" * 30)
    if headers:
        any_perms.extend(headers)
    if not len(logins):
        any_perms.append("\nFound no valid usernames")
    else:
        any_perms.append("\nChecking {} usernames for membership in {} orgs".format(len(logins), len(orgs_to_check)))
        for login in logins:
            start_msg_count = len(any_perms)
            for org in orgs_to_check:
                o = get_inspectable_org_object(org)
                if o is None:
                    continue
                if is_member(o, login):
                    url = "https://github.com/orgs/{}/people?utf8=%E2%9C%93&query={}".format(o.login, login)
                    msg = "FOUND! {} has {} as a member: {}".format(o.login, login, url)
                    owner_logins =  get_org_owners(o)
                    is_owner = login in owner_logins
                    if is_owner:
                        msg += "\n  NOTE: {} is an OWNER of {}".format(login, org)
                    any_perms.append(msg)
                if is_collaborator(o, login):
                    url = "https://github.com/orgs/{}/outside-collaborators?utf8=%E2%9C%93&query={}".format(o.login, login)
                    any_perms.append("FOUND! {} has {} as a collaborator: {}".format(o.login, login, url))
            else:
                end_msg_count = len(any_perms)
                if end_msg_count > start_msg_count:
                    # some found, put a header on it, the add blank line
                    any_perms.insert(start_msg_count, "\nFound {:d} orgs for {}:".format(end_msg_count-start_msg_count, login))
                    any_perms.append("")
                else:
                    any_perms.append("No permissions found for {}".format(login))
    return any_perms

In [None]:
def extract_addresses(text):
    """Get email addresses from text
    """
    # ASSUME that text is a list of email addresses (possibly empty)
    if not text: 
        return []
#     print("before: %s" % text)
    text = text.replace('[', '').replace(']','').replace("b'", "").replace("'", "")
#     print("after: %s" % text)
#     print(" split: %s" % text.split())
    return text.split()
    #raise ValueError("couldn't parse '{}'".format(text))


#### main driver

In [None]:
import re
import os

re_flags = re.MULTILINE | re.IGNORECASE

def process_from_email(email_body):
    # get rid of white space
    email_body = os.linesep.join(
        [s.strip() for s in email_body.splitlines() if s.strip()]
    )
    if not email_body:
        return

    user = set()
    
    # Extract data from internal email format
    match = re.search(r'^Full Name: (?P<full_name>\S.*)$', email_body, re_flags)
    if match:
        # add base and some variations
        full_name = match.group("full_name")
        user.add(full_name)
        # remove spaces, forward & reversed
        user.add(full_name.replace(' ', ''))
        user.add(''.join(full_name.split()[::-1]))
        # use hypens, forward & reversed
        user.add(full_name.replace(' ', '-'))
        user.add('-'.join(full_name.split()[::-1]))

    match = re.search(r'^Email: (?P<primary_email>.*)$', email_body, re_flags)
    primary_email = match.group("primary_email") if match else None
    user.add(primary_email)
    default_login = primary_email.split('@')[0] if primary_email else None
    if default_login:
        # add some common variations that may get discarded for "too many" matches
        user.update([
            "moz{}".format(default_login),
            "moz-{}".format(default_login),
            "mozilla{}".format(default_login),
            "mozilla-{}".format(default_login),
            "{}moz".format(default_login),
            "{}-moz".format(default_login),
        ])
        
    # let user start manual work before we do all the GitHub calls
    display("Check these URLs for Heroku activity:")
    display("  Heroku Access: https://people.mozilla.org/a/heroku-members/edit?section=members")
    display("     copy/paste for ^^ query:  :{}:  ".format(primary_email))
    display("  People: https://people.mozilla.org/s?who=all&query={}".format(primary_email.replace('@', '%40')))
    display("  Heroku: https://dashboard.heroku.com/teams/mozillacorporation/access?filter={}".format(primary_email.replace('@', '%40')))
    display(email_body)

    match = re.search(r'^Github Profile: (?P<github_profile>.*)$', email_body, re_flags)
    declared_github = match.group("github_profile") if match else None
    user.add(declared_github)
    display("Declared GitHub {}".format(declared_github))
    
    # check CIS for verified login (not all users will have creds)
    verified_github_login = check_CIS(primary_email)
    if verified_github_login:
        user.add(verified_github_login)
        display("Verified GitHub {}".format(verified_github_login))

    match = re.search(r'^Zimbra Alias: (?P<other_email>.*)$', email_body, re_flags)
    possible_aliases = extract_addresses(match.group("other_email") if match else None)
    user.update(possible_aliases)

    # new field: Email Alias -- list syntax (brackets)
    match = re.search(r'^Email Alias: \s*\[(?P<alias_email>.*)\]', email_body, re_flags)
    user.add(match.group("alias_email") if match else None)

    # we consider each token in the IM line as a possible GitHub login
    match = re.search(r'^IM:\s*(.*)$', email_body, re_flags)
    if match:
        im_line = match.groups()[0]
        matches = re.finditer(r'\W*((\w+)(?:\s+\w+)*)', im_line)
        user.update([x.group(1) for x in matches] if matches else None)

    match = re.search(r'^Bugzilla Email: (?P<bz_email>.*)$', email_body, re_flags)
    user.add(match.group("bz_email") if match else None)
    
    # grab the department name, for a heuristic on whether we expect to find perms
    expect_github_login = False
    match = re.search(r'^\s*Dept Name: (?P<dept_name>\S.*)$', email_body, re_flags)
    if match:
        department_name = match.groups()[0].lower()
        dept_keys_infering_github = ["firefox", "engineering", "qa", "operations"]
        for key in dept_keys_infering_github:
            if key in department_name:
                expect_github_login = True
                break
    

    # clean up some noise, case insensitively, "binary" markers
    user = {x.lower() for x in user if x and (len(x) > 2)}
    to_update = [x[2:-1] for x in user if (x.startswith("b'") and x.endswith("'"))]
    user.update(to_update)
    user = {x for x in user if not (x.startswith("b'") and x.endswith("'"))}

    # the tokens to ignore are added based on discovery,
    # they tend to cause the searches to get rate limited.
    user = user - {None, "irc", "slack", "skype", "b", 'hotmail', 'mozilla', 'ro', 'com', 'softvision', 'mail', 
                  'twitter', 'blog', 'https', 'jabber', 'net', 'github', 'gmail',
                  'facebook', 'guy', 'pdx', 'yahoo', 'aim', 'whatsapp', 'gtalk', 'google',
                  'gpg', 'telegram', 'keybase', 'zoom', 'name', }
    global displayed_users
    displayed_users = set()
    try:
        headers = [u"Search seeds: '{}'".format("', '".join(user)),]
        display(*headers)
        guesses = set()
        for term in user:
            possibles = gather_possibles(term)
            guesses.update({x.lower() for x in possibles})
        # include declared_github if it exists
        if declared_github:
            guesses.add(declared_github.lower())
        guesses.update({x.login.lower() for x in displayed_users})
        display("Checking logins {}".format(guesses))
        msgs = []
        msgs = check_login_perms(guesses, headers)
        found_perms = "FOUND!" in "".join(msgs)
        display("msgs {}; headers {}".format(len(msgs), len(headers)))
        display("found_perms {}; declared_github {} {}".format(found_perms, declared_github, bool(declared_github)))

        if declared_github and not found_perms:
            msgs.append("Even for declared login '{}'.".format(declared_github))
        if expect_github_login and not found_perms:
            msgs.append("WARNING: expected GitHub permissions for dept '{}'".format(department_name))
        msgs.append("Finished all reporting.")
        display(*msgs)
    except github3.exceptions.ForbiddenError as e:
        print_limits(e)
        raise e

In [None]:
from ipywidgets import interact_manual, Layout, widgets
from IPython.display import display
   
text = widgets.Textarea(
    value='email: \nim: ',
    placeholder='Paste ticket description here!',
    description='Email body:',
    layout=Layout(width='95%'),
    disabled=False
)

run_process = interact_manual.options(manual_name="Process")

In [None]:
def display(*args):
    # iPyWidgets don't like unicode - ensure everything we try to put there is ascii
    text = "\n".join([unicode(x) for x in args])  # deal with None values by casting to unicode
    cleaned = text.encode('ascii', 'replace')
    if cleaned.strip():
        print(cleaned)

In [None]:
def check_github_logins(logins):
    logins_to_check = set(logins.split())
    for login in logins_to_check:
        print("\nworking on %s:" % login)
        msgs = check_login_perms([login])
        display(*msgs)


#### EML file support

In [None]:
# read EML file support
import email
from ipywidgets import FileUpload
from pprint import pprint as pp
from IPython.display import display as display_widget


In [None]:




def extract_reply(body):
    extracted = []
    for l in body.split('\r\n'):
        if l.startswith('> --'):
            break
        elif l.startswith('> '):
            extracted.append(l[2:])
    return extracted

def process_from_file(uploader):
    # message = email.message_from_string()
    for file in uploader.value.keys():
        print("checking %s" % file)
        pp(uploader.value[file].keys())
        message = email.message_from_string(uploader.value[file]["content"])
        for part in message.walk():
            if part.get_content_maintype() == 'multipart':
                continue
            else:
                mime = part.get_content_type()
                if 'plain' in mime:
                    body = part.get_payload()
                    # this could be the original, or a reply
                    if re.search(r'''^Full Name:''', body, re_flags):
                        print("original email:")
                        process_from_email(body)
                    elif re.search(r'''^> Full Name:''', body, re_flags):
                        print("reply:")
                        process_from_email("\n".join(extract_reply(body)))
                    else:
                        print("no match!\n%s" % body)

# Start of common usage (How To)

Currently, there are three common use cases:
- processing an offboarding email (via downloaded EML file),
- processing an offboarding email (via message copy/paste), and
- adhoc lookup of GitHub login

For anything else, you're on your own!

All usage requires the following setup:
1. Fill in a way to load your PAT token in the first code cell
2. Fill in the list of orgs to check in the second code cell

## EML File parsing

Upload the file using the button below, then process that file by running the cell below the button. You can only process one file at a time, but the "file uploaded" count will continue to increase (ui glitch).

In [None]:
_uploader = FileUpload(accept="*.eml", multiple=False)
display_widget(_uploader)
#check_file(_uploader)

In [None]:
def check_file(f):
    try:
        #display_widget(_uploader)
        process_from_file(f)
        print "completed"
    except Exception as e:
        print(repr(e))
        raise
check_file(_uploader)

## Process offboarding email body text (copy/paste)

Usage steps - for each user:
    1. Copy entire text of email
    2. Paste into the text area below
    3. Click the "Process" button
    4. Use the generated links to check for Heroku authorization
    5. After "process finished" printed, copy/paste final output into email

In [None]:
@run_process(t=text)
def show_matches(t):
    try:
        process_from_email(t)
    except Exception as e:
        print(repr(e))
        pass

## Adhoc Lookup

Fill in list of the desired logins in the cell below

In [None]:
check_github_logins(
    """

 """
)
print("done")

# To Do

- check invites as well, using manage_invitations.py
- code doesn't handle hyphenated github logins, e.g. 'marco-c' (gets split)
- github lookup should strip https... so can use link from people.m.o
- dpreston, aka fzzy, doesn't have any GitHub perms
- fix permutations of names
- preprocess to remove all (colon separated) :b':':[:]: (maybe not the :b: & :':)
- add link to Heroku service accounts to check


- ~~GitHub login no longer part of email, but user id is available via CIS~~
- ~~add "clear cache" button to purge after long idle~~ _(in tuning section)_
- ~~add common login with 'moz{,illa}' taked on, sometimes with a dash~~
- ~~update link to view access group on people.m.o~~
- ~~add "trying" info to copy/paste output~~
- ~~double check that "even for declared login" code still active~~
- ~~add formatted output summary for copy/paste~~
- ~~when a guess is multiple words, each word should be tried separately as well~~
- ~~code should always search for stated github, even if search is "too many" (e.g. "past")~~
- ~~does not call out owner status (reports as member)~~
- ~~add short ldap name as an "always check"~~
- ~~always check stem when search gives too many (i.e. go for the exact match)~~
- ~~treat Zimbra Aliases as a potential multi valued list (or empty)~~
- ~~"-" is a valid character in GitHub logins. Try as separator first-last and last-first~~
