# Talent Intelligence System ( Data Exploration)

This notebook serves as the primary analysis environment for **Step 1: Discover the Pattern of Success**.

**Objective:** The primary goal of this notebook is to perform a deep exploratory data analysis (EDA) to uncover the key attributes, competencies, and behavioral patterns that differentiate high-performing employees (those with a `rating = 5`) from their peers.

**Process:**
The analysis will follow these key steps:
1.  **Setup:** Import all necessary libraries and establish a connection to the Supabase database.
2.  **Data Loading:** Load all raw tables from the database into Pandas DataFrames.
3.  **Data Cleaning & Preprocessing:** Handle missing values, standardize inconsistent text data (like `mbti`), and create a master analysis DataFrame.
4.  **Exploratory Data Analysis (EDA):** Compare the "High Performer" group against all others across several key dimensions:
    * Psychometric & Cognitive (`iq`, `pauli`, `disc`, etc.)
    * Competencies (`competencies_yearly`)
    * Behavioral Strengths (`strengths`)
    * Contextual Factors (`grade_id`, `years_of_service_months`)
5.  **Synthesis:** Use visualizations and statistical summaries to identify the most significant factors.

**End Goal:** The insights gathered here will be synthesized into a **"Final Success Formula"**. This formula will provide the logical foundation for the SQL matching algorithm to be built in Step 2.

## Data Loading & Initial Inspection

In [1]:
# Install neccessary ilbraries
import pandas as pd
import numpy as np
import os
from dotenv import load_dotenv
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import seaborn as sns

In [3]:
# Load the secret credentials from  .env file

load_dotenv()

DB_HOST = os.getenv('DB_HOST')
DB_PORT = os.getenv('DB_PORT')
DB_NAME = os.getenv('DB_NAME')
DB_USER = os.getenv('DB_USER')
DB_PASSWORD = os.getenv('DB_PASSWORD')

print("Environment variables loaded.")


Environment variables loaded.


In [5]:
# Create the connection 'engine' to  Supabase database

connection_string = f"postgresql://{DB_USER}:{DB_PASSWORD}@{DB_HOST}:{DB_PORT}/{DB_NAME}"
engine = create_engine(connection_string)

print("Database connection engine created successfully.")

Database connection engine created successfully.


In [13]:
# Overview of certain tables for intial checking
df_employees = pd.read_sql(
  
  """
  SELECT 
  * 
  FROM employees
  """

  , engine)
print("\n--- Employee Data Info ---")
display(df_employees.info(), df_employees.head())


--- Employee Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2010 entries, 0 to 2009
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   employee_id              2010 non-null   object
 1   fullname                 2010 non-null   object
 2   nip                      2010 non-null   object
 3   company_id               2010 non-null   int64 
 4   area_id                  2010 non-null   int64 
 5   position_id              2010 non-null   int64 
 6   department_id            2010 non-null   int64 
 7   division_id              2010 non-null   int64 
 8   directorate_id           2010 non-null   int64 
 9   grade_id                 2010 non-null   int64 
 10  education_id             2010 non-null   int64 
 11  major_id                 2010 non-null   int64 
 12  years_of_service_months  2010 non-null   int64 
dtypes: int64(10), object(3)
memory usage: 204.3+ KB


None

Unnamed: 0,employee_id,fullname,nip,company_id,area_id,position_id,department_id,division_id,directorate_id,grade_id,education_id,major_id,years_of_service_months
0,EMP100000,Rendra Pratama,806137,1,4,1,6,3,3,2,3,1,64
1,EMP100001,Wulan Setiawan,476388,4,4,4,5,5,3,1,1,2,16
2,EMP100002,Julia Jatmiko Situmorang,941921,1,3,5,1,1,3,1,3,1,58
3,EMP100003,Oka Halim,751615,1,3,4,2,1,1,2,2,5,15
4,EMP100004,Dwi Pratama,443809,3,4,6,5,3,3,1,1,5,34
