# People

What are they? How can we represent people in a meaningful way without having *real* data to draw from. In our organization we should have a few key pieces of information on each person:

* Unique identifier (a name would be great)
* Office/Location
* Job Title
* Team / Line of Business
* Manager (this is key to understanding how reporting works)
* Other info: programming languages, apps used, timezone, date hired, projects worked on, ...

## Faker

Let's generate some fake data! https://faker.readthedocs.io/en/master/

In [150]:
from faker import Faker
fake = Faker()

fake.profile() # lots of great stuff in here!

{'job': 'Scientific laboratory technician',
 'company': 'Davis-Garcia',
 'ssn': '676-50-6536',
 'residence': '782 Reyes Lake Suite 394\nLake Jenniferport, CA 73531',
 'current_location': (Decimal('-46.844331'), Decimal('173.534615')),
 'blood_group': 'A+',
 'website': ['https://wade-hughes.org/',
  'http://key.org/',
  'https://www.luna-hall.com/'],
 'username': 'sarahterrell',
 'name': 'Carrie Smith',
 'sex': 'F',
 'address': '7619 Lane Lane\nSouth Cynthiashire, GA 21547',
 'mail': 'maryroberts@hotmail.com',
 'birthdate': datetime.date(1965, 6, 13)}

We can also do things like ensure uniqueness for individual entries across all entries

In [151]:
from faker.exceptions import UniquenessException
import traceback

try:
    for i in range(10):
        print(fake.unique.prefix())
except UniquenessException:
    print("😟")

Mrs.
Mr.
Mx.
Dr.
Ms.
Ind.
Miss
Misc.
😟


Try generating a few people and see if it looks like a good representation of our organization

In [152]:
...

Ellipsis

This is a good start but ... it's kind of wonky. We have people all over the world with so many different jobs! Let's keep the spirit of this but implement some of our own limitations on fields to ensure things line up with what we'd expect a company org to look like


First, a few more interesting features: we can also register new `providers` if anything is missing. If needed these can be customized for different locales 

In [153]:
from faker.providers import DynamicProvider

employment_status_provider = DynamicProvider(
     provider_name="employment",
     elements=["Full Time", "Part Time", "Contract"],
)

fake.add_provider(employment_status_provider)

fake.employment()

'Contract'

We can customize this further by using the `Faker.BaseProvider`

In [154]:
# first, import a similar Provider or use the default one
from more_itertools import one
from faker.providers import BaseProvider

# create new provider class
class EmploymentStatus(BaseProvider):
    statuses = {"Full Time": 0.7, "Part Time": 0.05, "Contract": 0.3}
    def employment(self) -> str:
        return one(fake.random.choices(
            list(self.statuses), 
            weights=self.statuses.values()
        ))

# then add new provider to faker instance
fake.add_provider(EmploymentStatus)

fake.employment()


'Contract'

### A Tech Focused Person Data

To ground us in this task, let's define a new `Person` object that we can fill up with info (and a few other objects):

In [155]:
from dataclasses import dataclass, field
from typing import Literal
from enum import Enum, auto
import datetime

class timezone(str, Enum):
    EST = auto()
    PST = auto()
    UTC = auto()

@dataclass
class Location:
    city: str
    tz: timezone
    country: str

@dataclass
class Person:
    """Someone who works in our company!"""
    name: str
    hire_date: datetime.date
    status: Literal["Full Time", "Part Time", "Contract"]
    languages: list[str] = field(default_factory=list)
    manager:str = None
    team: str = None 
    title: str = None
    location: Location = None

In [156]:
Person(name="Employee #1",hire_date=datetime.date.today(), status="Full Time", location=Location("New York", "EST", "USA"))

Person(name='Employee #1', hire_date=datetime.date(2023, 4, 24), status='Full Time', languages=[], manager=None, team=None, title=None, location=Location(city='New York', tz='EST', country='USA'))

In [157]:
import numpy as np
import random

def choose_a_few(
    options: list[str],
    weights: list[int | float] = None,
    max_choices: int = None,
    min_choices: int = 0,
) -> list[str]:
    """A helpful function to pick a random number of choices from a list of options
    
    By default skews the weights toward the first options in the list"""
    max_choices = np.clip(max_choices or len(options), min_choices, len(options))
    
    # how many choices will we make this time?
    divisor = max_choices * (max_choices + 1) / 2    
    k_weights = [int(x) / divisor for x in range(max_choices, min_choices-1, -1)]
    n_choices = np.random.choice(list(range(min_choices,max_choices+1)), p=k_weights)
    
    # make the choices
    choices = random.choices(options, weights=weights, k=n_choices)
    return list(set(choices))


Now to make some people. Let's re-use whatever we can from `Faker` and then add some more of our own fields. We can also extend where needed to keep our code clear and consistent:

In [158]:
class ProgrammingLanguages(BaseProvider):    
    languages = {
        "Python": 0.25,
        "Scala": 0.1,
        "Go": 0.08,
        "JavaScript": 0.3,
        "Java": 0.3,
        "Typescript": 0.17,
        "Erlang": 0.01,
        "Elixir": 0.001,
    }
    def programming_languages(self) -> str:
        return choose_a_few(list(self.languages), weights=self.languages.values())

fake.add_provider(ProgrammingLanguages)


In [159]:
def make_person() -> Person:
    return Person(
        name = fake.name(),
        hire_date = fake.date_between(start_date="-3y", end_date="today"),
        status = fake.employment(),
        languages = fake.programming_languages(),
        team = None, # hrmmmm this is harder
        title = None, # let's be smarter with this
        location = None, # let's also be smarter with this
    )

make_person()

Person(name='Brandon Kennedy', hire_date=datetime.date(2022, 4, 12), status='Full Time', languages=['Java', 'JavaScript', 'Python'], manager=None, team=None, title=None, location=None)

Now we can generate more complex attributes in a smart way. Let's set up some rules about where offices are, what teams are in which offices, then pick titles based on other info (e.g. Developers probably know at least one language ... )

In [160]:
TEAM_TITLES:dict[str,list[str]] = {
    "DevX": ["Engineer", "Engineer", "Engineer", "Engineer", "Engineer", "AVP"],
    "DevOps": ["Engineer", "Senior Engineer", "Manager"],
    "Sales": ["Associate"],
    "Support": ["Analyst", "Manager"],
    "Platform": ["Engineer", "Senior Engineer","Managing Engineer", "AVP", "VP"],
    "Product": ["Engineer", "Manager", "Product Owner", "AVP", "VP"],
    "Internal Tools": ["Engineer", "Senior Engineer", "Manager", "AVP", "VP"],
    "Business": ["Analyst", "Associate", "Vice President", "Director", "Managing Director"]
}


def title_city_team():
    # just a few locations
    offices = {
        location.city: location
        for location in [
            Location("New York", tz="EST", country="USA"),
            Location("Seattle", tz="PST", country="USA"),
            Location("Toronto", tz="EST", country="CAN"),
            Location("London", tz="UTC", country="GBR"),
            Location("Fort Lauderdale", tz="EST", country="USA"),
            Location("Dublin", tz="UTC", country="IRL"),
        ]
    }
    # codify the hierarchical structure
    allowed_teams_per_office = {
        "New York": ["Sales", "Product", "Business"],
        "Toronto": ["Platform", "Product", "Internal Tools", "Sales", "Business"],
        "Fort Lauderdale": ["DevX"],
        "Dublin": ["DevOps", "Support"],
        "London": ["Sales", "Business"],
        "Seattle": ["Internal Tools", "Product", "Platform"],
    }
    allowed_titles_per_team = TEAM_TITLES

    city = random.choice(list(offices))
    team = random.choice(allowed_teams_per_office[city])
    title = choose_a_few(
        allowed_titles_per_team[team], max_choices=1, min_choices=1
    ).pop()
    
    return {
        "location": Location(city=city, tz=offices[city].tz, country=offices[city].country),
        "title": title,
        "team": team,
    }


title_city_team()


{'location': Location(city='London', tz='UTC', country='GBR'),
 'title': 'Analyst',
 'team': 'Business'}

After running this we should have a better balanced org in terms of region + titles. Then we just need to add the connections in -- i.e. who's the boss?!

In [161]:
def make_person() -> Person:
    title_city_team_ = title_city_team()
    technical = 1 if "Engineer" in title_city_team_["title"] else 0
    return Person(
        name = fake.name(),
        hire_date = fake.date_between(start_date="-3y", end_date="today"),
        status = fake.employment(),
        languages = fake.programming_languages(),
        **title_city_team_,
    )


In [162]:
import pandas as pd
people_df = pd.DataFrame((make_person() for _ in range(150)))
people_df.head()

Unnamed: 0,name,hire_date,status,languages,manager,team,title,location
0,Karen Sparks,2021-10-13,Full Time,[],,Support,Manager,"{'city': 'Dublin', 'tz': 'UTC', 'country': 'IRL'}"
1,Douglas Joseph,2022-05-02,Contract,"[Java, Typescript, JavaScript]",,DevX,Engineer,"{'city': 'Fort Lauderdale', 'tz': 'EST', 'coun..."
2,Jennifer Beltran,2023-01-18,Full Time,[],,DevX,Engineer,"{'city': 'Fort Lauderdale', 'tz': 'EST', 'coun..."
3,Jessica Miller,2022-10-28,Full Time,[Python],,Business,Managing Director,"{'city': 'New York', 'tz': 'EST', 'country': '..."
4,Sandra Fuller,2023-02-19,Full Time,"[Go, Java, JavaScript, Python]",,Platform,VP,"{'city': 'Toronto', 'tz': 'EST', 'country': 'C..."


So, let's group by Team and then pick a manager for everyone. Let's use these rules:

* People report to someone of a higher title if possible, else to a peer
* Reporting happens within a team
* We already ordered `TEAM_TITLES` based on *rank*
* Team leads should be listed as reporting to themselves (for now)

In [165]:
# calculate team ranks
ranks = {team: {title: rank + 1 for rank,title in enumerate(titles)} for team, titles in TEAM_TITLES.items()}
for team in ranks:
    people_df.loc[people_df.team==team, "rank"] = people_df.loc[people_df.team==team].title.map(ranks[team])
people_df = people_df.sort_values(by=["team","rank"])
people_df.sample(3)

Unnamed: 0,name,hire_date,status,languages,manager,team,title,location,rank
8,Brian Lewis,2021-09-28,Full Time,[Typescript],,Platform,VP,"{'city': 'Toronto', 'tz': 'EST', 'country': 'C...",5.0
81,Kayla Baker,2022-04-19,Contract,"[Erlang, Scala, JavaScript, Python]",,Business,Director,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",4.0
41,Vincent Rodriguez,2022-11-10,Contract,[],,Product,VP,"{'city': 'New York', 'tz': 'EST', 'country': '...",5.0


In [166]:
# determine supervisor
def naivereportsto(row, df):
    supervisor = (
        df[(df.index < row.name)].query(f"""rank <= {row["rank"]}-1""").tail(1)["name"]
    )
    supervisor = supervisor.item() if not supervisor.empty else None
    peer = df[(df.index < row.name)].query(f"""rank  == {row["rank"]}""").head(1)["name"]
    peer = peer.item() if not peer.empty else None
    return supervisor or peer or row["name"]


def reportsto(df):
    return df.assign(manager=df.apply(naivereportsto, df=df, axis=1))


def supervisors(df):
    df = df.groupby("team", group_keys=False).apply(reportsto).reset_index(drop=True)
    return df


people_df = people_df.pipe(supervisors)
people_df.head(5)


Unnamed: 0,name,hire_date,status,languages,manager,team,title,location,rank
0,Chad White,2020-11-05,Full Time,[JavaScript],Chad White,Business,Analyst,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",1.0
1,Charles Jones,2020-11-07,Full Time,[],Chad White,Business,Analyst,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",1.0
2,Lisa Nelson,2023-03-15,Full Time,"[Go, Typescript, JavaScript, Python]",Chad White,Business,Analyst,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",1.0
3,Austin Fowler,2022-06-16,Full Time,"[Java, JavaScript, Python]",Chad White,Business,Analyst,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",1.0
4,Michael Anderson,2020-05-06,Full Time,[Typescript],Chad White,Business,Analyst,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",1.0


Now we just need a CEO for all the team leads to report to. Set their manager as themselves to help us out later. We need to make sure to include all the other information in the DF that we just generated, namely `rank` and `manager`. Here let's also set the CEO as reporting to themselves 

In [169]:
CEO = make_person().__dict__ | {"team":"CEO", "title":"CEO", "status":"Full Time"}
CEO["location"] = CEO["location"].__dict__
people_df = pd.concat([people_df, pd.DataFrame([CEO])])
CEO_mask = people_df.name==CEO["name"]
people_df.loc[(people_df.manager == people_df.name) | CEO_mask ,"manager"]=CEO["name"]
people_df.loc[CEO_mask, "rank"] = people_df["rank"].max()+1

Alright, we have something now. Does this seems reasonably distributed? Let's use `plotly` to explore our people's dimensions and get a feel for the data

In [173]:
# let's flatten the nested pieces of the DataFrame (`people_df.location`)
expanded_df = people_df.assign(**people_df.location.apply(pd.Series))
expanded_df

Unnamed: 0,name,hire_date,status,languages,manager,team,title,location,rank,city,tz,country
0,Chad White,2020-11-05,Full Time,[JavaScript],Lindsey Miller,Business,Analyst,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",1.0,London,UTC,GBR
1,Charles Jones,2020-11-07,Full Time,[],Chad White,Business,Analyst,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",1.0,London,UTC,GBR
2,Lisa Nelson,2023-03-15,Full Time,"[Go, Typescript, JavaScript, Python]",Chad White,Business,Analyst,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",1.0,London,UTC,GBR
3,Austin Fowler,2022-06-16,Full Time,"[Java, JavaScript, Python]",Chad White,Business,Analyst,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",1.0,London,UTC,GBR
4,Michael Anderson,2020-05-06,Full Time,[Typescript],Chad White,Business,Analyst,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",1.0,London,UTC,GBR
...,...,...,...,...,...,...,...,...,...,...,...,...
146,Jacob Kaufman,2021-01-15,Contract,[Python],Michael Gallagher,Support,Manager,"{'city': 'Dublin', 'tz': 'UTC', 'country': 'IRL'}",2.0,Dublin,UTC,IRL
147,Alexander Gonzales,2020-07-22,Full Time,"[Java, Scala, Typescript, Python]",Charles Fox,Support,Manager,"{'city': 'Dublin', 'tz': 'UTC', 'country': 'IRL'}",2.0,Dublin,UTC,IRL
148,Elizabeth Wilkerson,2022-07-02,Full Time,[Typescript],Lisa Long,Support,Manager,"{'city': 'Dublin', 'tz': 'UTC', 'country': 'IRL'}",2.0,Dublin,UTC,IRL
149,Kathryn Smith,2020-06-29,Full Time,[Java],Lisa Long,Support,Manager,"{'city': 'Dublin', 'tz': 'UTC', 'country': 'IRL'}",2.0,Dublin,UTC,IRL


In [176]:
import plotly.express as px

fig = px.bar(
    expanded_df,
    x="title",
    color="team",
    hover_name="name",
    hover_data=["team", "tz", "city","manager","languages"],
    facet_col="country",
    template="plotly_dark",
)
fig.update_xaxes(matches=None, title_text=None)
