Let's start by loading the data.

In [1]:
import pandas as pd

employees_df = pd.read_csv("../data/employees.csv")
safehouses_df = pd.read_csv("../data/safehouses.csv")
divisions_df = pd.read_csv("../data/divisions.csv")
managers_df = pd.read_csv("../data/managers.csv")
actions_df = pd.read_csv("../data/actions.csv")


## Employees

In [2]:
employees_df.sample(5)


Unnamed: 0,EmployeeID,EmployeeName,JobTitle,Email,Phone,Manager
14007,14010,Michael Welch,Business Analyst,michael_welch@brlda.gov,614.574.6626,Gregory Wallace
17813,17817,Melissa Delgado,Project Manager,melissa_delgado@brlda.gov,819-655-0418x56533,Sheila Mccormick
15459,15463,Michael Reid,Business Analyst,michael_reid@brlda.gov,001-982-363-3908x964,Charles Richards
16765,16769,Lawrence Davis,Business Analyst,lawrence_davis@brlda.gov,838-728-0122,Melinda Franklin
4937,4940,Lisa Cowan,Statistician,lisa_cowan@brlda.gov,(210)491-5493x0294,Victoria Willis


In [3]:
employees_df["EmployeeName"].value_counts().head(5)


EmployeeName
Michael Smith       12
David Smith         12
Lisa Smith          11
John Smith          11
Michael Williams    10
Name: count, dtype: int64

Some duplicate employees are present in the data...

In [4]:
employees_df[employees_df["EmployeeName"] == "Michael Smith"]


Unnamed: 0,EmployeeID,EmployeeName,JobTitle,Email,Phone,Manager
4448,4451,Michael Smith,Project Manager,michael_smith@brlda.gov,001-137-479-2502x05426,Nathan Guerra
5129,5132,Michael Smith,Scrum Master,michael_smith@brlda.gov,223-274-6121x5268,David Houston
7455,7458,Michael Smith,Quality Assurance Analyst,michael_smith@brlda.gov,(202)737-6861x908,Pamela Hale
10672,10675,Michael Smith,Data Scientist,michael_smith@brlda.gov,001-192-379-3454x580,Sarah Brown
11683,11686,Michael Smith,Quality Assurance Analyst,michael_smith@brlda.gov,001-038-692-4506x839,Patrick Cruz
13957,13960,Michael Smith,Program Manager,michael_smith@brlda.gov,(471)896-2210x7208,Heather Crawford
15668,15672,Michael Smith,Scrum Master,michael_smith@brlda.gov,668.014.9966x33380,Gregory Floyd
17442,17446,Michael Smith,Data Analyst,michael_smith@brlda.gov,\t+1-544-294-4533x69316,Richard Collins
17570,17574,Michael Smith,Program Manager,michael_smith@brlda.gov,9730334254,Michele Graham
19322,19326,Michael Smith,Machine Learning Engineer,michael_smith@brlda.gov,081-057-7739,Joshua James


Are they all the same employee? The email match but the ID and Phone don't.

Let's focus on the missing data. If they removed their users, we should see some missing EmployeeID values.

In [5]:
# Check min and max EmployeeID
min_eid = employees_df["EmployeeID"].min()
max_eid = employees_df["EmployeeID"].max()
print("Min EmployeeID: ", min_eid)
print("Max EmployeeID: ", max_eid)

# Check employees ids not in the min-max range
missing_employees = set(range(min_eid, max_eid + 1)) - set(employees_df["EmployeeID"])
print("Employees ids not in the min-max range: ", missing_employees)


Min EmployeeID:  1
Max EmployeeID:  26849
Employees ids not in the min-max range:  {14976, 22602, 26188, 1423, 4284}


These might be the "ghosts" agents. Let's quickly check that all managers are also employees just in case...

In [6]:
managers = set(employees_df["Manager"])
employees = set(employees_df["EmployeeName"])

managers - employees


set()

Are these "ghosts employees" in other tables?

In [7]:
divisions_df[divisions_df["EmployeeID"].isin(missing_employees)]


Unnamed: 0,EmployeeID,EmployeeName,Division,Project,known_safehouses
1422,1423,,[Division 7],[Project e-enable_holistic_models],"[14, 214, 181, 219]"
4283,4284,,[Division 7],[Project repurpose_collaborative_methodologies...,"[10, 219]"
14975,14976,,[Division 7],[Project transform_24/365_functionalities],"[25, 154, 231, 33, 219]"
22601,22602,,[Division 7],"[Project monetize_one-to-one_mindshare, Projec...","[12, 221, 19, 18, 219]"
26187,26188,,[Division 7],[Project extend_robust_action-items],"[7, 219]"


They all belong to Division 7. What actions did they take?

In [8]:
actions_df[actions_df["EmployeeID"].isin(missing_employees)].sort_values(["ActionDate"])


Unnamed: 0,EmployeeID,ActionType,ActionDate,ActionDescription,ActionLocation,ActionStatus,ActionSeverity,AssociatedProject,AssociatedDivision
41824,14976,Quantum Key Generation,1994-06-06 00:00:00,perform data mining on social media data for s...,Puerto Rico,completed,critical,Project transform_24/365_functionalities,Division 1
53036,26188,Predictive Modeling,1994-11-20 00:00:00,construct algorithms for automatic gait recogn...,Puerto Rico,failed,critical,Project extend_robust_action-items,Division 10
4283,4284,Data Clustering,1996-05-08 00:00:00,Initiate operation Networked_discrete_system_e...,Martinique,completed,high,Project repurpose_collaborative_methodologies,Division 6
81969,1423,User Profiling,1997-04-12 00:00:00,Operation Re-contextualized_attitude-oriented_...,Egypt,failed,medium,Project e-enable_holistic_models,Division 3
68673,14976,Natural Language Generation,2007-06-06 00:00:00,construct algorithms for automatic vein recogn...,Benin,completed,critical,Project transform_24/365_functionalities,Division 1
31132,4284,Quantum Resistant Cryptography,2007-09-17 00:00:00,Initiate operation Customizable_discrete_paral...,Kazakhstan,failed,critical,Project repurpose_collaborative_methodologies,Division 6
57981,4284,Machine Learning-based Intrusion Detection,2007-10-25 00:00:00,analyze communication patterns through Fully-c...,Albania,completed,high,Project repurpose_collaborative_methodologies,Division 6
28271,1423,Automated Surveillance,2009-10-18 00:00:00,Operation Down-sized_24/7_capability to develo...,Puerto Rico,failed,high,Project e-enable_holistic_models,Division 3
14975,14976,Natural Language Generation,2011-08-06 00:00:00,Initiate operation Centralized_upward-trending...,Bahrain,completed,low,Project transform_24/365_functionalities,Division 1
49450,22602,Quantum Key Generation,2012-10-22 00:00:00,Operation Digitized_methodical_structure to ap...,Ireland,failed,high,Project monetize_one-to-one_mindshare,Division 6


Some interesting things:
- They all mention the devices in their action description.
- They belong to a different Associated Division and not Division 7.

How common is for employees to do actions in other divisions? Time for OBT (one big table).

In [9]:
detailed_actions_df = actions_df.merge(
    employees_df, left_on="EmployeeID", right_on="EmployeeID", how="left"
).merge(divisions_df, left_on="EmployeeID", right_on="EmployeeID", how="left")
detailed_actions_df.sample(5)


Unnamed: 0,EmployeeID,ActionType,ActionDate,ActionDescription,ActionLocation,ActionStatus,ActionSeverity,AssociatedProject,AssociatedDivision,EmployeeName_x,JobTitle,Email,Phone,Manager,EmployeeName_y,Division,Project,known_safehouses
15083,15084,Quantum Key Generation,2016-11-29 00:00:00,Initiate operation Implemented_modular_attitud...,Philippines,completed,low,Project streamline_proactive_e-markets,Division 3,Jessica Jackson,Statistician,jessica_jackson@brlda.gov,\t+1-852-619-7576,Pamela Nelson,Jessica Jackson,"[Division 9, Division 3]","[Project unleash_front-end_models, Project str...","[4, 232, 110, 220]"
21036,21037,Network Covert Channel Analysis,2022-02-01 00:00:00,Operation Profit-focused_6thgeneration_install...,Western Sahara,in progress,high,Project streamline_proactive_e-markets,Division 4,Lisa Mccarty,Data Analyst,lisa_mccarty@brlda.gov,961.555.0296,Darius Davis,Lisa Mccarty,[Division 4],[Project streamline_proactive_e-markets],"[31, 33, 155, 152]"
21892,21893,Quantitative Credit Scoring,2019-02-14 00:00:00,Initiate operation Open-source_solution-orient...,Cambodia,completed,critical,Project deliver_visionary_web-readiness,Division 4,Terry Clark,Project Manager,terry_clark@brlda.gov,001-856-450-8917x357,Crystal Romero,Terry Clark,"[Division 3, Division 4, Division 8]","[Project monetize_one-to-one_mindshare, Projec...","[20, 36, 20]"
65991,12294,Digital Forensics,2019-05-15 00:00:00,develop algorithms for automatic speech recogn...,Paraguay,completed,high,Project mesh_cutting-edge_experiences,Division 4,Chris Smith,Program Manager,chris_smith@brlda.gov,001-606-260-6378x35048,Bryan Moore,Chris Smith,[Division 4],"[Project monetize_one-to-one_mindshare, Projec...",[63]
13255,13256,Covert Biometric Identification,2002-08-02 00:00:00,construct algorithms for intrusion detection i...,El Salvador,completed,critical,Project facilitate_mission-critical_ROI,Division 9,Timothy Johnson,Program Manager,timothy_johnson@brlda.gov,887-799-0591,James Rogers,Timothy Johnson,"[Division 9, Division 10]","[Project facilitate_mission-critical_ROI, Proj...","[13, 19, 1, 191]"


In [10]:
def check_division_associated_division(row):
    return row["AssociatedDivision"] in row["Division"]


detailed_actions_df[
    ~detailed_actions_df.apply(check_division_associated_division, axis=1)
].sample(5)


Unnamed: 0,EmployeeID,ActionType,ActionDate,ActionDescription,ActionLocation,ActionStatus,ActionSeverity,AssociatedProject,AssociatedDivision,EmployeeName_x,JobTitle,Email,Phone,Manager,EmployeeName_y,Division,Project,known_safehouses
35605,8757,Pattern-based Malware Detection,2001-10-17 00:00:00,Initiate operation Adaptive_next_generation_le...,Tuvalu,failed,low,Project repurpose_collaborative_methodologies,Division 7,Kimberly Mccann,Project Manager,kimberly_mccann@brlda.gov,\t+1-751-319-4440x1839,Christopher Sanchez,Kimberly Mccann,"[Division 9, Division 2, Division 2]",[Project repurpose_collaborative_methodologies],[77]
77354,23657,Quantum Key Generation,2021-12-01 00:00:00,Operation Cross-group_system-worthy_function t...,Marshall Islands,failed,medium,Project facilitate_mission-critical_ROI,Division 7,Michael Garza,Business Analyst,michael_garza@brlda.gov,001-960-646-5296,Brandon Wiley Jr.,Michael Garza,"[Division 10, Division 4, Division 6]","[Project scale_back-end_interfaces, Project di...","[111, 46, 149]"
23463,23464,Malware Analysis,2008-07-20 00:00:00,"Initiate operation Expanded_24hour_firmware, t...",British Virgin Islands,completed,high,Project monetize_one-to-one_mindshare,Division 7,Matthew Boyd,Program Manager,matthew_boyd@brlda.gov,953-103-4209x1001,Crystal Harris,Matthew Boyd,"[Division 2, Division 2, Division 10]",[Project monetize_one-to-one_mindshare],"[19, 121, 152]"
31975,5127,Data Steganalysis,2018-03-08 00:00:00,Initiate operation Synchronized_responsive_kno...,Greece,failed,high,Project embrace_magnetic_systems,Division 7,Audrey Peters,Business Analyst,audrey_peters@brlda.gov,\t+1-512-390-9160x5974,Shane Cowan,Audrey Peters,"[Division 8, Division 9, Division 2]","[Project drive_value-added_mindshare, Project ...","[224, 193, 129]"
18674,18675,Data Access Control,2006-04-25 00:00:00,Operation Pre-emptive_object-oriented_adapter ...,Syrian Arab Republic,failed,critical,Project engineer_killer_applications,Division 7,John Willis,Scrum Master,john_willis@brlda.gov,001-930-326-5306,James Frost,John Willis,"[Division 10, Division 5, Division 9]",[Project engineer_killer_applications],"[21, 50, 110, 152]"


Some people did actions associated to Division 7 but is not reflected in the Employee table Divisions.

In [11]:
detailed_actions_df[
    ~detailed_actions_df.apply(check_division_associated_division, axis=1)
]["AssociatedDivision"].value_counts()


AssociatedDivision
Division 7     2284
Division 6        7
Division 3        4
Division 1        4
Division 10       3
Name: count, dtype: int64

Now, most of the people that did actions in Division 7 are missclassified, but there are some that did actions in the divisions 1, 3, 6, and 10. These are our suspects.

## Safehouses

In [12]:
safehouses_df.sample(5)


Unnamed: 0,ID,City,Address,Latitude,Longitude
41,158,Moscow,"Izmaylovo Manor, Первомайская улица, Izmaylovo...",55.794138,37.75796
169,13,Moscow,"Детский оздоровительный лагерь ""Искорка"", М-2,...",55.363,37.727959
25,192,Rome,"Maiorana, Via Bolognola, 63, 00138 Rome RM, Italy",41.987777,12.503599
120,43,Jakarta,"Jalan Zeni AD I, Pancoran, Special Capital Reg...",-6.260742,106.85698
2,231,Luang Prabang,"13, 10554 Khouathineung, Laos",19.843087,102.159075


In [13]:
# Map of safehouses with Latitude and Longitude
import folium

safehouses_map = folium.Map(
    location=[safehouses_df["Latitude"].mean(), safehouses_df["Longitude"].mean()],
    zoom_start=4,
)

for index, row in safehouses_df.iterrows():
    folium.Marker([row["Latitude"], row["Longitude"]], popup=row["ID"]).add_to(
        safehouses_map
    )

safehouses_map


Cute, but not very useful for now, let's explore divisions.

## Divisions

In [14]:
divisions_df


Unnamed: 0,EmployeeID,EmployeeName,Division,Project,known_safehouses
0,1,Kelly Rios,[Division 1],[Project deliver_visionary_web-readiness],"[232, 1, 73, 217]"
1,2,Madison Barr,"[Division 6, Division 3]",[Project repurpose_collaborative_methodologies],"[192, 26, 118, 4]"
2,3,Sue Anderson,"[Division 5, Division 1, Division 10]",[Project repurpose_collaborative_methodologies],"[19, 8, 130, 50]"
3,4,Laura Carlson,"[Division 9, Division 9, Division 2]",[Project streamline_proactive_e-markets],"[15, 232]"
4,5,Carrie Ali,"[Division 3, Division 6]",[Project deliver_visionary_web-readiness],"[158, 118]"
...,...,...,...,...,...
26844,26845,Christopher Riley,"[Division 8, Division 10, Division 5]","[Project drive_value-added_mindshare, Project ...",[226]
26845,26846,Eric Chan,"[Division 2, Division 6, Division 1]",[Project deliver_visionary_web-readiness],"[15, 99, 58]"
26846,26847,Amy Vazquez,"[Division 8, Division 10, Division 5]","[Project embrace_transparent_networks, Project...",[22]
26847,26848,Clifford Reyes,"[Division 9, Division 2, Division 6]",[Project disintermediate_distributed_experienc...,[98]


### Managers

In [15]:
managers_df.sample(5)


Unnamed: 0,ManagerName,Employee_1,Employee_2,Employee_3,Employee_4,Employee_5,Employee_6,Employee_7,Employee_8,Employee_9,...,Employee_20,Employee_21,Employee_22,Employee_23,Employee_24,Employee_25,Employee_26,Employee_27,Employee_28,Employee_29
601,Michael Payne DDS,Susan Hanson,Heather Brown,Megan Mayo,Rachel Nguyen,Nancy West,Jennifer Rodriguez,Carol Norton,,,...,,,,,,,,,,
2542,Mark Marshall,Christopher Lang,Dawn Diaz,Teresa Woods,Anthony Garrett,Robert Hodges,Holly Kim,,,,...,,,,,,,,,,
2413,Sharon Salazar,Jacob Stanley,Brooke Perry,Lori Riddle,Kelly Lin,Jennifer Matthews,Anthony West,Jason Hawkins,Kimberly Wood,Jared Zimmerman,...,,,,,,,,,,
3246,Amanda Daniels,Abigail Acosta,Connie Davis,Christina Vang,Cynthia Morris,Lauren Walsh,Claire Chandler,Joseph Nelson,Tiffany Mckee,,...,,,,,,,,,,
3445,Renee Gonzales,Caroline Rogers,Adam Sanchez,Mr. James Melton,Tanner Villa,Candice Mann,,,,,...,,,,,,,,,,


In [16]:
# Transform to long format
clean_managers_df = (
    managers_df.melt(
        id_vars=["ManagerName"], value_name="EmployeeName", var_name="EmployeeNumber"
    )
    .dropna()
    .drop(columns=["EmployeeNumber"])
)

clean_managers_df.sample(5)


Unnamed: 0,ManagerName,EmployeeName
17713,Andrew Collins,Amber Myers
10222,Todd Ayala,Antonio Steele
26918,Brandon Wiley Jr.,Herbert Moore
6747,Todd Wilson,Kenneth Reed
24041,Lisa Bowman,Tracy Burns


In [17]:
managers_df[managers_df["Employee_1"].isna()]


Unnamed: 0,ManagerName,Employee_1,Employee_2,Employee_3,Employee_4,Employee_5,Employee_6,Employee_7,Employee_8,Employee_9,...,Employee_20,Employee_21,Employee_22,Employee_23,Employee_24,Employee_25,Employee_26,Employee_27,Employee_28,Employee_29
3686,Jessica Stone,,,,,,,,,,...,,,,,,,,,,


Jessica Stone is the only manager without any employees (unless there are NaNs gaps).

In [18]:
clean_managers_df[clean_managers_df["EmployeeName"] == "Jessica Stone"]


Unnamed: 0,ManagerName,EmployeeName
17210,Christopher Mckenzie,Jessica Stone


In [19]:
employees_df[employees_df["EmployeeName"] == "Jessica Stone"]


Unnamed: 0,EmployeeID,EmployeeName,JobTitle,Email,Phone,Manager
11560,11563,Jessica Stone,Data Scientist,jessica_stone@brlda.gov,001-234-563-9331,Christopher Mckenzie


In [20]:
divisions_df[divisions_df["EmployeeName"] == "Jessica Stone"]


Unnamed: 0,EmployeeID,EmployeeName,Division,Project,known_safehouses
11562,11563,Jessica Stone,[Division 4],"[Project mesh_cutting-edge_experiences, Projec...","[18, 42, 44]"


In [21]:
actions_df[actions_df["EmployeeID"] == 11563]


Unnamed: 0,EmployeeID,ActionType,ActionDate,ActionDescription,ActionLocation,ActionStatus,ActionSeverity,AssociatedProject,AssociatedDivision
11562,11563,Gesture Recognition,2016-06-21 00:00:00,build systems for automatic object detection i...,Syrian Arab Republic,failed,critical,Project unleash_front-end_models,Division 4
38411,11563,Automated Social Media Profiling,2012-08-31 00:00:00,Operation Optimized_real-time_artificial_intel...,Botswana,failed,low,Project unleash_front-end_models,Division 4
65260,11563,Object Recognition,1996-05-13 00:00:00,perform data mining on financial transactions ...,Congo,completed,medium,Project unleash_front-end_models,Division 4
92109,11563,Covert Facial Recognition,1999-05-26 00:00:00,Initiate operation Visionary_coherent_architec...,Italy,completed,critical,Project unleash_front-end_models,Division 4


# Random

In [22]:
event_log = pd.read_csv("../data/event_log.csv")
event_log


Unnamed: 0,timestamp,event_type,latitude,longitude
0,1973-01-11,device_sighting,36.120243,-115.165881
1,1973-01-11,device_sighting,36.121162,-115.166821
2,1973-01-11,device_sighting,36.122059,-115.167754
3,1973-01-11,device_sighting,36.122927,-115.168672
4,1973-01-11,device_sighting,36.123756,-115.169568
...,...,...,...,...
239493,1997-11-14,intel_analysis,12.780501,-110.231854
239494,1981-08-27,intel_analysis,12.352336,23.379725
239495,1998-05-04,intel_analysis,-88.566131,-144.912711
239496,2009-11-08,intel_analysis,-6.631462,-74.687310


In [23]:
# Get devide_sighting type
device_sighting = event_log[event_log["event_type"] == "device_sighting"]


In [24]:
# Plot map with lines between device_sighting
device_sighting_map = folium.Map(
    location=[device_sighting["latitude"].mean(), device_sighting["longitude"].mean()],
    zoom_start=4,
)

for index, row in device_sighting.iterrows():
    previous_row = device_sighting.iloc[index - 1]

    folium.CircleMarker(
        [row["latitude"], row["longitude"]],
        radius=3,
        popup=row["timestamp"],
        color="#3186cc",
        fill=True,
        fill_color="#3186cc",
    ).add_to(device_sighting_map)

    # Add Lines
    if index > 0:
        folium.PolyLine(
            [
                (row["latitude"], row["longitude"]),
                (previous_row["latitude"], previous_row["longitude"]),
            ],
            color="red",
            weight=1,
            opacity=0.5,
        ).add_to(device_sighting_map)
# Add safehouses
for index, safehouse in safehouses_df.iterrows():
    folium.Marker(
        [safehouse["Latitude"], safehouse["Longitude"]],
        popup=safehouse["ID"],
        icon=folium.Icon(color="green", icon="home"),
    ).add_to(device_sighting_map)

device_sighting_map


In [25]:
actions_df[actions_df["EmployeeID"] == 22602]


Unnamed: 0,EmployeeID,ActionType,ActionDate,ActionDescription,ActionLocation,ActionStatus,ActionSeverity,AssociatedProject,AssociatedDivision
22601,22602,Covert Behavioral Analytics,2019-01-30 00:00:00,Operation Vision-oriented_explicit_flexibility...,Ireland,failed,medium,Project monetize_one-to-one_mindshare,Division 6
49450,22602,Quantum Key Generation,2012-10-22 00:00:00,Operation Digitized_methodical_structure to ap...,Ireland,failed,high,Project monetize_one-to-one_mindshare,Division 6
76299,22602,Cryptocurrency Tracing,2014-08-08 00:00:00,analyze network topology for vulnerabilities t...,San Marino,completed,medium,Project monetize_one-to-one_mindshare,Division 6
