# People

What are they? How can we represent people in a meaningful way without having *real* data to draw from. In our organization we should have a few key pieces of information on each person:

* Unique identifier (a name would be great)
* Office/Location
* Job Title
* Team / Line of Business
* Manager (this is key to understanding how reporting works)
* Other info: programming languages, apps used, timezone, date hired, projects worked on, ...

## Faker

Let's generate some fake data! https://faker.readthedocs.io/en/master/

In [1]:
from faker import Faker
fake = Faker()

fake.profile() # lots of great stuff in here!

{'job': 'Rural practice surveyor',
 'company': 'Hoover-Wheeler',
 'ssn': '694-45-7892',
 'residence': '174 Kimberly Drives Suite 100\nDavidland, TX 11410',
 'current_location': (Decimal('-50.064406'), Decimal('37.725507')),
 'blood_group': 'O-',
 'website': ['http://decker-goodman.net/'],
 'username': 'bonnie57',
 'name': 'Melinda Smith',
 'sex': 'F',
 'address': 'PSC 3998, Box 2594\nAPO AE 46207',
 'mail': 'natasha33@gmail.com',
 'birthdate': datetime.date(1936, 1, 20)}

We can also do things like ensure uniqueness for individual entries across all entries

In [2]:
from faker.exceptions import UniquenessException
try:
    for i in range(10):
        print(fake.unique.prefix())
except UniquenessException:
    print("😟")

Mr.
Dr.
Mrs.
Mx.
Miss
Misc.
Ind.
Ms.
😟


Try generating a few people and see if it looks like a good representation of our organization

In [3]:
...

Ellipsis

This is a good start but ... it's kind of wonky. We have people all over the world with so many different jobs! Let's keep the spirit of this but implement some of our own limitations on fields to ensure things line up with what we'd expect a company org to look like


First, a few more interesting features: we can also register new `providers` if anything is missing. If needed these can be customized for different locales 

In [4]:
from faker.providers import DynamicProvider

employment_status_provider = DynamicProvider(
     provider_name="employment",
     elements=["Full Time", "Part Time", "Contract"],
)

fake.add_provider(employment_status_provider)

fake.employment()

'Full Time'

We can customize this further by using the `Faker.BaseProvider`

In [5]:
# first, import a similar Provider or use the default one
from more_itertools import one
from faker.providers import BaseProvider

# create new provider class
class EmploymentStatus(BaseProvider):
    statuses = {"Full Time": 0.7, "Part Time": 0.05, "Contract": 0.3}
    def employment(self) -> str:
        return one(fake.random.choices(
            list(self.statuses), 
            weights=self.statuses.values()
        ))

# then add new provider to faker instance
fake.add_provider(EmploymentStatus)

fake.employment()


'Full Time'

### A Tech Focused Person Data

To ground us in this task, let's define a new `Person` object that we can fill up with info (and a few other objects):

In [6]:
from dataclasses import dataclass, field
from typing import Literal
from enum import Enum, auto
import datetime

class timezone(str, Enum):
    EST = auto()
    PST = auto()
    UTC = auto()

@dataclass
class Location:
    city: str
    tz: timezone
    country: str

@dataclass
class Person:
    """Someone who works in our company!"""
    name: str
    hire_date: datetime.date
    status: Literal["Full Time", "Part Time", "Contract"]
    languages: list[str] = field(default_factory=list)
    manager:str = None
    team: str = None 
    title: str = None
    location: Location = None

In [7]:
Person(name="Employee #1",hire_date=datetime.date.today(), status="Full Time", location=Location("New York", "EST", "USA"))

Person(name='Employee #1', hire_date=datetime.date(2023, 12, 2), status='Full Time', languages=[], manager=None, team=None, title=None, location=Location(city='New York', tz='EST', country='USA'))

In [8]:
import numpy as np
import random

def choose_a_few(
    options: list[str],
    weights: list[int | float] = None,
    max_choices: int = None,
    min_choices: int = 0,
) -> list[str]:
    """A helpful function to pick a random number of choices from a list of options
    
    By default skews the weights toward the first options in the list"""
    max_choices = np.clip(max_choices or len(options), min_choices, len(options))
    
    # how many choices will we make this time?
    divisor = max_choices * (max_choices + 1) / 2    
    k_weights = [int(x) / divisor for x in range(max_choices, min_choices-1, -1)]
    n_choices = np.random.choice(list(range(min_choices,max_choices+1)), p=k_weights)
    
    # make the choices
    choices = random.choices(options, weights=weights, k=n_choices)
    return list(set(choices))


Now to make some people. Let's re-use whatever we can from `Faker` and then add some more of our own fields. We can also extend where needed to keep our code clear and consistent:

In [9]:
class ProgrammingLanguages(BaseProvider):    
    languages = {
        "Python": 0.25,
        "Scala": 0.1,
        "Go": 0.08,
        "JavaScript": 0.3,
        "Java": 0.3,
        "Typescript": 0.17,
        "Erlang": 0.01,
        "Elixir": 0.001,
    }
    def programming_languages(self) -> str:
        return choose_a_few(list(self.languages), weights=self.languages.values())

fake.add_provider(ProgrammingLanguages)


In [10]:
def make_person() -> Person:
    return Person(
        name = fake.name(),
        hire_date = fake.date_between(start_date="-3y", end_date="today"),
        status = fake.employment(),
        languages = fake.programming_languages(),
        team = None, # hrmmmm this is harder
        title = None, # let's be smarter with this
        location = None, # let's also be smarter with this
    )

make_person()

Person(name='Steven Wilson', hire_date=datetime.date(2023, 5, 31), status='Full Time', languages=['Go'], manager=None, team=None, title=None, location=None)

Now we can generate more complex attributes in a smart way. Let's set up some rules about where offices are, what teams are in which offices, then pick titles based on other info (e.g. Developers probably know at least one language ... )

In [11]:
TEAM_TITLES:dict[str,list[str]] = {
    "DevX": ["Engineer", "Engineer", "Engineer", "Engineer", "Engineer", "AVP"],
    "DevOps": ["Engineer", "Senior Engineer", "Manager", "Senior Manager"],
    "Sales": ["Associate", "VP"],
    "Support": ["Analyst", "Manager"],
    "Platform": ["Engineer", "Senior Engineer","Managing Engineer", "AVP", "VP"],
    "Product": ["Engineer", "Manager", "Product Owner", "AVP", "VP"],
    "Internal Tools": ["Engineer", "Senior Engineer", "Manager", "AVP", "VP"],
    "Business": ["Analyst", "Associate", "Vice President", "Director", "Managing Director"]
}

# codify the hierarchical structure
allowed_teams_per_office = {
    "New York": ["Sales", "Product", "Business"],
    "Toronto": ["Platform", "Product", "Internal Tools", "Sales", "Business"],
    "Fort Lauderdale": ["DevX"],
    "Dublin": ["DevOps", "Support"],
    "London": ["Sales", "Business"],
    "Seattle": ["Internal Tools", "Product", "Platform"],
}
offices = {
    location.city: location
    for location in [
        Location("New York", tz="EST", country="USA"),
        Location("Seattle", tz="PST", country="USA"),
        Location("Toronto", tz="EST", country="CAN"),
        Location("London", tz="UTC", country="GBR"),
        Location("Fort Lauderdale", tz="EST", country="USA"),
        Location("Dublin", tz="UTC", country="IRL"),
    ]
}

def title_city_team():
    # just a few locations
    allowed_titles_per_team = TEAM_TITLES
    city = random.choice(list(offices))
    team = random.choice(allowed_teams_per_office[city])
    title = choose_a_few(
        allowed_titles_per_team[team], max_choices=1, min_choices=1
    ).pop()
    
    return {
        "location": Location(city=city, tz=offices[city].tz, country=offices[city].country),
        "title": title,
        "team": team,
    }


title_city_team()


{'location': Location(city='London', tz='UTC', country='GBR'),
 'title': 'VP',
 'team': 'Sales'}

After running this we should have a better balanced org in terms of region + titles. Then we just need to add the connections in -- i.e. who's the boss?!

In [12]:
def make_person() -> Person:
    title_city_team_ = title_city_team()
    technical = 1 if "Engineer" in title_city_team_["title"] else 0
    return Person(
        name = fake.name(),
        hire_date = fake.date_between(start_date="-3y", end_date="today").strftime("%Y%m%d"),
        status = fake.employment(),
        languages = fake.programming_languages(),
        **title_city_team_,
    )


In [13]:
import pandas as pd
people_df = pd.DataFrame((make_person() for _ in range(150)))
people_df.head()

Unnamed: 0,name,hire_date,status,languages,manager,team,title,location
0,Jason Burton,20231119,Full Time,[],,Support,Manager,"{'city': 'Dublin', 'tz': 'UTC', 'country': 'IRL'}"
1,Aaron Brooks,20210714,Full Time,"[Python, Java]",,Platform,Senior Engineer,"{'city': 'Seattle', 'tz': 'PST', 'country': 'U..."
2,Michael Gates,20210623,Full Time,"[Scala, JavaScript, Java]",,DevOps,Senior Engineer,"{'city': 'Dublin', 'tz': 'UTC', 'country': 'IRL'}"
3,Gwendolyn Hays,20221110,Full Time,"[Python, JavaScript, Typescript]",,Business,Vice President,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}"
4,Kimberly Sanchez,20210112,Full Time,[],,Business,Analyst,"{'city': 'New York', 'tz': 'EST', 'country': '..."


So, let's group by Team and then pick a manager for everyone. Let's use these rules:

* People report to someone of a higher title if possible, else to a peer
* Reporting happens within a team
* We already ordered `TEAM_TITLES` based on *rank*
* Team leads should be listed as reporting to themselves (for now)

In [14]:
# calculate team ranks
ranks = {team: {title: rank + 1 for rank,title in enumerate(titles)} for team, titles in TEAM_TITLES.items()}
for team in ranks:
    people_df.loc[people_df.team==team, "rank"] = people_df.loc[people_df.team==team].title.map(ranks[team])
people_df = people_df.sort_values(by=["team","rank"])
people_df.sample(3)

Unnamed: 0,name,hire_date,status,languages,manager,team,title,location,rank
11,Jessica Perry,20201225,Contract,[],,Product,AVP,"{'city': 'Seattle', 'tz': 'PST', 'country': 'U...",4.0
52,Nicole Jenkins,20201228,Contract,"[Scala, Java]",,DevX,Engineer,"{'city': 'Fort Lauderdale', 'tz': 'EST', 'coun...",5.0
143,Richard Chandler,20220330,Full Time,[Scala],,Platform,Engineer,"{'city': 'Seattle', 'tz': 'PST', 'country': 'U...",1.0


In [15]:
# determine supervisor
def naivereportsto(row, df, allow_peer_reports:bool=False):
    supervisor = (
        df[(df.index < row.name)].query(f"""rank > {row["rank"]}""").tail(1)["name"]
    )
    supervisor = supervisor.item() if not supervisor.empty else None
    if not supervisor and allow_peer_reports:
        peer = df[(df.index < row.name)].query(f"""rank  == {row["rank"]}""").head(1)["name"]
        peer = peer.item() if not peer.empty else None
        return supervisor or peer or row["name"]
    return supervisor or row["name"]


def reportsto(df, allow_peer_reports:bool):
    return df.assign(manager=df.apply(naivereportsto, df=df, allow_peer_reports=allow_peer_reports, axis=1))


def supervisors(df, allow_peer_reports:bool):
    df = df.groupby("team", group_keys=False).apply(reportsto, allow_peer_reports=allow_peer_reports).reset_index(drop=True)
    return df


people_df = people_df.pipe(supervisors, allow_peer_reports=True)
people_df.sample(5)


Unnamed: 0,name,hire_date,status,languages,manager,team,title,location,rank
58,Tiffany Austin,20220906,Full Time,"[Scala, Go, Java]",Alyssa Rodriguez,DevX,Engineer,"{'city': 'Fort Lauderdale', 'tz': 'EST', 'coun...",5.0
92,Melanie Moore,20211001,Full Time,[Typescript],Melanie Moore,Platform,AVP,"{'city': 'Toronto', 'tz': 'EST', 'country': 'C...",4.0
136,Zachary Edwards,20210606,Full Time,[Python],Brian Lopez,Sales,VP,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",2.0
118,April Williams,20210128,Contract,"[JavaScript, Typescript, Java]",Kyle Graves,Sales,Associate,"{'city': 'New York', 'tz': 'EST', 'country': '...",1.0
38,Barbara Guzman,20221120,Full Time,[],Joanne Bridges,DevOps,Senior Engineer,"{'city': 'Dublin', 'tz': 'UTC', 'country': 'IRL'}",2.0


Now we just need a CEO for all the team leads to report to. Set their manager as themselves to help us out later. We need to make sure to include all the other information in the DF that we just generated, namely `rank` and `manager`. Here let's also set the CEO as reporting to themselves 

In [16]:
CEO = make_person().__dict__ | {"team":"CEO", "title":"CEO", "status":"Full Time"}
CEO["location"] = CEO["location"].__dict__
people_df = pd.concat([people_df, pd.DataFrame([CEO])])
CEO_mask = people_df.name==CEO["name"]
people_df.loc[(people_df.manager == people_df.name) | CEO_mask ,"manager"]=CEO["name"]
people_df.loc[CEO_mask, "rank"] = people_df["rank"].max()+1

Alright, we have something now. Does this seems reasonably distributed? Let's use `plotly` to explore our people's dimensions and get a feel for the data

In [17]:
# let's flatten the nested pieces of the DataFrame (`people_df.location`)
expanded_df = people_df.assign(**people_df.location.apply(pd.Series))
expanded_df

Unnamed: 0,name,hire_date,status,languages,manager,team,title,location,rank,city,tz,country
0,Kimberly Sanchez,20210112,Full Time,[],Gwendolyn Hays,Business,Analyst,"{'city': 'New York', 'tz': 'EST', 'country': '...",1.0,New York,EST,USA
1,Gabriel Payne,20221027,Part Time,"[Python, JavaScript]",Dr. Nathaniel Kim MD,Business,Analyst,"{'city': 'New York', 'tz': 'EST', 'country': '...",1.0,New York,EST,USA
2,Jacqueline Adkins,20230725,Contract,"[JavaScript, Java]",Dr. Nathaniel Kim MD,Business,Analyst,"{'city': 'Toronto', 'tz': 'EST', 'country': 'C...",1.0,Toronto,EST,CAN
3,Ashley Meyer,20230818,Full Time,"[JavaScript, Java]",Dr. Nathaniel Kim MD,Business,Analyst,"{'city': 'London', 'tz': 'UTC', 'country': 'GBR'}",1.0,London,UTC,GBR
4,Kim Rowland,20220513,Full Time,[Java],Gary Chambers,Business,Analyst,"{'city': 'Toronto', 'tz': 'EST', 'country': 'C...",1.0,Toronto,EST,CAN
...,...,...,...,...,...,...,...,...,...,...,...,...
146,Mary Contreras,20220918,Contract,"[JavaScript, Typescript, Java]",Jason Burton,Support,Manager,"{'city': 'Dublin', 'tz': 'UTC', 'country': 'IRL'}",2.0,Dublin,UTC,IRL
147,Alicia Lee,20220401,Contract,[Python],Jason Burton,Support,Manager,"{'city': 'Dublin', 'tz': 'UTC', 'country': 'IRL'}",2.0,Dublin,UTC,IRL
148,Miranda Mendoza,20221015,Full Time,[],Jason Burton,Support,Manager,"{'city': 'Dublin', 'tz': 'UTC', 'country': 'IRL'}",2.0,Dublin,UTC,IRL
149,Angela Phelps,20230930,Full Time,"[Python, JavaScript, Typescript, Java]",Jason Burton,Support,Manager,"{'city': 'Dublin', 'tz': 'UTC', 'country': 'IRL'}",2.0,Dublin,UTC,IRL


In [18]:
import plotly.express as px

fig = px.bar(
    expanded_df,
    x="title",
    color="team",
    hover_name="name",
    hover_data=["team", "tz", "city","manager","languages"],
    facet_col="country",
    template="plotly_dark",
)
fig.update_xaxes(matches=None, title_text=None)


## Synthetic Data

Essentially what we've done up until this point is to define distributions of values that our people should fall into and some rules about how those distributions overlap. Then we sample from those distributions, resulting in something that adheres to those distributions when we look at it statistically.

Another approach we could take if we had *real* data, but couldn't use that directly is to generate synthetic data from it using a GAN, that is rewarded by keeping the representativeness of the data but scrubs out the actual values.

[ydata-synthetic](https://docs.synthetic.ydata.ai/) implements a number of GAN-based synthetic data generators: 