Let's start by loading the data and taking a look at it.

In [1]:
import pandas as pd

employees_df = pd.read_csv("../data/employees.csv")
safehouses_df = pd.read_csv("../data/safehouses.csv")
divisions_df = pd.read_csv("../data/divisions.csv")
managers_df = pd.read_csv("../data/managers.csv")
actions_df = pd.read_csv("../data/actions.csv")


## Employees

In [2]:
employees_df.sample(5)


Unnamed: 0,EmployeeID,EmployeeName,JobTitle,Email,Phone,Manager
18217,18221,Jonathan Barber,Data Analyst,jonathan_barber@brlda.gov,(709)358-0654x450,Theresa Rhodes
25455,25460,Luis Hobbs,Scrum Master,luis_hobbs@brlda.gov,001-783-416-2922x9352,Dr. Charlene Hicks
9952,9955,Adrienne Green,Data Analyst,adrienne_green@brlda.gov,(357)404-7351,Jasmine Ford
9720,9723,Stephen Smith,Scrum Master,stephen_smith@brlda.gov,250.989.9731x1477,Donna Ross
2487,2489,Patrick Benitez,Project Manager,patrick_benitez@brlda.gov,\t+1-087-924-1347x1681,Terry Carroll


If they removed their users, we should see some missing EmployeeID values.

In [3]:
# Check min and max EmployeeID
min_eid = employees_df["EmployeeID"].min()
max_eid = employees_df["EmployeeID"].max()
print("Min EmployeeID: ", min_eid)
print("Max EmployeeID: ", max_eid)

# Check employees ids not in the min-max range
missing_employees = set(range(min_eid, max_eid + 1)) - set(employees_df["EmployeeID"])
print("Employees ids not in the min-max range: ", missing_employees)


Min EmployeeID:  1
Max EmployeeID:  26849
Employees ids not in the min-max range:  {14976, 22602, 26188, 1423, 4284}


We should check that all managers are also employees

In [4]:
managers = set(employees_df["Manager"])
employees = set(employees_df["EmployeeName"])

managers - employees


set()

Are these "missing employees" in other tables?

In [5]:
divisions_df[divisions_df["EmployeeID"].isin(missing_employees)]


Unnamed: 0,EmployeeID,EmployeeName,Division,Project,known_safehouses
1422,1423,,[Division 7],[Project e-enable_holistic_models],"[14, 214, 181, 219]"
4283,4284,,[Division 7],[Project repurpose_collaborative_methodologies...,"[10, 219]"
14975,14976,,[Division 7],[Project transform_24/365_functionalities],"[25, 154, 231, 33, 219]"
22601,22602,,[Division 7],"[Project monetize_one-to-one_mindshare, Projec...","[12, 221, 19, 18, 219]"
26187,26188,,[Division 7],[Project extend_robust_action-items],"[7, 219]"


What actions did they take?

In [6]:
actions_df[actions_df["EmployeeID"].isin(missing_employees)].sort_values("ActionDate")[
    "ActionDescription"
].values


array(['perform data mining on social media data for sentiment analysis through Advanced_coherent_architecture on Slovenia. Maintain strict confidentiality.',
       'construct algorithms for automatic gait recognition through Phased_background_model on Guinea. Maintain strict confidentiality. During the covert operation, intercepted communications hinted at the presence of the three devices.',
       'Initiate operation Networked_discrete_system_engine, targeting Yemen with objective to perform sentiment analysis on social media influencers In a confidential dossier, a defector mentioned the three devices being utilized to manipulate global financial markets.',
       'Operation Re-contextualized_attitude-oriented_protocol to perform sentiment analysis on online news articles on North Macedonia is in progress. While examining classified documents, references to the three devices were discovered.',
       "construct algorithms for automatic vein recognition through Upgradable_increment

They all mention the devices in their action description.

## Safehouses

In [7]:
safehouses_df.sample(5)


Unnamed: 0,ID,City,Address,Latitude,Longitude
49,150,London,"13 Graemesdyke Avenue, London, SW14 7BH, Unite...",51.465893,-0.275055
181,8,Tokyo,"34L-16R, Tokyo Wangan Road, Haneda Kukou, Ota,...",35.556423,139.772161
183,7,Paris,"12 Rue Victor Hugo, 91390 Morsang-sur-Orge, Fr...",48.664273,2.357661
165,16,Moscow,"улица Глинки 8, Firsanovka, Khimki, Moscow Obl...",55.948711,37.24212
84,83,Cartagena,"Centro, 472000 Cartagena, BOL, Colombia",10.428231,-75.570062


In [8]:
# Map of safehouses with Latitude and Longitude
import folium

safehouses_map = folium.Map(
    location=[safehouses_df["Latitude"].mean(), safehouses_df["Longitude"].mean()],
    zoom_start=4,
)

for index, row in safehouses_df.iterrows():
    folium.Marker([row["Latitude"], row["Longitude"]], popup=row["ID"]).add_to(
        safehouses_map
    )

safehouses_map


Cute, but not very useful for now, let's move to divisions.

## Divisions

In [9]:
divisions_df


Unnamed: 0,EmployeeID,EmployeeName,Division,Project,known_safehouses
0,1,Kelly Rios,[Division 1],[Project deliver_visionary_web-readiness],"[232, 1, 73, 217]"
1,2,Madison Barr,"[Division 6, Division 3]",[Project repurpose_collaborative_methodologies],"[192, 26, 118, 4]"
2,3,Sue Anderson,"[Division 5, Division 1, Division 10]",[Project repurpose_collaborative_methodologies],"[19, 8, 130, 50]"
3,4,Laura Carlson,"[Division 9, Division 9, Division 2]",[Project streamline_proactive_e-markets],"[15, 232]"
4,5,Carrie Ali,"[Division 3, Division 6]",[Project deliver_visionary_web-readiness],"[158, 118]"
...,...,...,...,...,...
26844,26845,Christopher Riley,"[Division 8, Division 10, Division 5]","[Project drive_value-added_mindshare, Project ...",[226]
26845,26846,Eric Chan,"[Division 2, Division 6, Division 1]",[Project deliver_visionary_web-readiness],"[15, 99, 58]"
26846,26847,Amy Vazquez,"[Division 8, Division 10, Division 5]","[Project embrace_transparent_networks, Project...",[22]
26847,26848,Clifford Reyes,"[Division 9, Division 2, Division 6]",[Project disintermediate_distributed_experienc...,[98]


In [10]:
# TODO: Explode Division and Project


### Managers

In [11]:
managers_df


Unnamed: 0,ManagerName,Employee_1,Employee_2,Employee_3,Employee_4,Employee_5,Employee_6,Employee_7,Employee_8,Employee_9,...,Employee_20,Employee_21,Employee_22,Employee_23,Employee_24,Employee_25,Employee_26,Employee_27,Employee_28,Employee_29
0,Raven Price,Carl Schmidt,Cristina Thompson,Mrs. Katherine Franklin,Blake Garcia,Drew Berger,Dave Dennis,Gary Castaneda,Joan Crawford,Nicole Johnson,...,,,,,,,,,,
1,David Smith,Anthony Flores,John Trujillo,Elizabeth Page,William Hicks,Gail Salazar,Linda Gonzalez,Christopher Miller,James Beard,Bradley Rowe,...,,,,,,,,,,
2,Michelle Frazier,Curtis Johnson,Michael Hancock,Renee Shaw,Lisa Lewis,Stephen Hunter,,,,,...,,,,,,,,,,
3,Melissa Conrad,Denise Gibson PhD,Christina Li,Rachel Snyder,James Taylor,Ivan Robles,Julie Adams,Michelle White,,,...,,,,,,,,,,
4,April Martin,Ryan Wright,Dr. Philip Jordan,Gary Wolfe,Deborah Green,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3830,Peter Burns,Kristin Ballard,Lisa Leach,Matthew Warner,Teresa Alexander,Robin Pierce,,,,,...,,,,,,,,,,
3831,Randall Martinez,Maria Ortiz,George Graves,Amy Baker,Deanna Cole,Samantha Collins,Melissa Aguilar,Michael Nguyen,,,...,,,,,,,,,,
3832,Peter Jackson,Christopher Cook,Shannon Solis,Matthew Davidson,Lorraine Moore,Grant Garcia,,,,,...,,,,,,,,,,
3833,Susan Nguyen,Kevin Parks,James Holmes,Danielle Long,Lynn Solomon,Justin Fisher,Sarah Moran,,,,...,,,,,,,,,,


In [12]:
# Transform to long format
clean_managers_df = managers_df.melt(
    id_vars=["ManagerName"], value_name="Employee", var_name="EmployeeNumber"
)


In [13]:
clean_managers_df


Unnamed: 0,ManagerName,EmployeeNumber,Employee
0,Raven Price,Employee_1,Carl Schmidt
1,David Smith,Employee_1,Anthony Flores
2,Michelle Frazier,Employee_1,Curtis Johnson
3,Melissa Conrad,Employee_1,Denise Gibson PhD
4,April Martin,Employee_1,Ryan Wright
...,...,...,...
111210,Peter Burns,Employee_29,
111211,Randall Martinez,Employee_29,
111212,Peter Jackson,Employee_29,
111213,Susan Nguyen,Employee_29,


In [14]:
managers_df[managers_df["Employee_1"].isna()]


Unnamed: 0,ManagerName,Employee_1,Employee_2,Employee_3,Employee_4,Employee_5,Employee_6,Employee_7,Employee_8,Employee_9,...,Employee_20,Employee_21,Employee_22,Employee_23,Employee_24,Employee_25,Employee_26,Employee_27,Employee_28,Employee_29
3686,Jessica Stone,,,,,,,,,,...,,,,,,,,,,


Jessica Stone is the only manager without any employees (unless there are NaNs gaps).

In [15]:
clean_managers_df[clean_managers_df["Employee"] == "Jessica Stone"]


Unnamed: 0,ManagerName,EmployeeNumber,Employee
17210,Christopher Mckenzie,Employee_5,Jessica Stone


In [16]:
employees_df[employees_df["EmployeeName"] == "Jessica Stone"]


Unnamed: 0,EmployeeID,EmployeeName,JobTitle,Email,Phone,Manager
11560,11563,Jessica Stone,Data Scientist,jessica_stone@brlda.gov,001-234-563-9331,Christopher Mckenzie


In [17]:
divisions_df[divisions_df["EmployeeName"] == "Jessica Stone"]


Unnamed: 0,EmployeeID,EmployeeName,Division,Project,known_safehouses
11562,11563,Jessica Stone,[Division 4],"[Project mesh_cutting-edge_experiences, Projec...","[18, 42, 44]"


In [18]:
clean_managers_df[clean_managers_df["Employee"] == "Christopher Mckenzie"]


Unnamed: 0,ManagerName,EmployeeNumber,Employee
21390,Kayla Lee,Employee_6,Christopher Mckenzie


In [19]:
employees_df[employees_df["EmployeeName"] == "Christopher Mckenzie"]


Unnamed: 0,EmployeeID,EmployeeName,JobTitle,Email,Phone,Manager
20679,20683,Christopher Mckenzie,Business Analyst,christopher_mckenzie@brlda.gov,3860450516,Kayla Lee
