#Phase 1: Data Acquisition & Preparation
This foundational phase focuses on gathering and transforming raw data into a clean, usable format suitable for analysis.

* Data Loading:
    * Load the HRDataset_v14.csv file into a pandas DataFrame to begin using it.
* Initial Data Inspection:
    * Examine the first 5 rows (df.head()) to quickly grasp the data structure.
    * Inspect column names, data types, and identify non-null counts using df.info() and df.dtypes.
    * Confirm the overall dimensions of the dataset (number of rows and columns).





In [None]:
import pandas as pd
df = pd.read_csv('HRDataset_v14.csv')
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311 entries, 0 to 310
Data columns (total 36 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Employee_Name               311 non-null    object 
 1   EmpID                       311 non-null    int64  
 2   MarriedID                   311 non-null    int64  
 3   MaritalStatusID             311 non-null    int64  
 4   GenderID                    311 non-null    int64  
 5   EmpStatusID                 311 non-null    int64  
 6   DeptID                      311 non-null    int64  
 7   PerfScoreID                 311 non-null    int64  
 8   FromDiversityJobFairID      311 non-null    int64  
 9   Salary                      311 non-null    int64  
 10  Termd                       311 non-null    int64  
 11  PositionID                  311 non-null    int64  
 12  Position                    311 non-null    object 
 13  State                       311 non

In [None]:
# Convert date objects into datetime data types for later analysis
df['DateofTermination'] = pd.to_datetime(df['DateofTermination'], errors='coerce')
df['DOB'] = pd.to_datetime(df['DOB'], errors='coerce')
df['DateofHire'] = pd.to_datetime(df['DateofHire'], errors='coerce')
df['LastPerformanceReview_Date'] = pd.to_datetime(df['LastPerformanceReview_Date'], errors='coerce')
# Check that the date times were succesfully re-formated
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311 entries, 0 to 310
Data columns (total 36 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Employee_Name               311 non-null    object        
 1   EmpID                       311 non-null    int64         
 2   MarriedID                   311 non-null    int64         
 3   MaritalStatusID             311 non-null    int64         
 4   GenderID                    311 non-null    int64         
 5   EmpStatusID                 311 non-null    int64         
 6   DeptID                      311 non-null    int64         
 7   PerfScoreID                 311 non-null    int64         
 8   FromDiversityJobFairID      311 non-null    int64         
 9   Salary                      311 non-null    int64         
 10  Termd                       311 non-null    int64         
 11  PositionID                  311 non-null    int64         

  df['DOB'] = pd.to_datetime(df['DOB'], errors='coerce')


In [None]:
# Select only the rows that show a missing value for the ManagerID
df[df['ManagerID'].isnull()][['Employee_Name', 'Position']]
# Convert the NaN values to 0
df['ManagerID'] = df['ManagerID'].fillna(0)
# Confirm that the new datatype is int64 not float64
df['ManagerID'] = df['ManagerID'].astype(int)
# Check that the missing values were imputated
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311 entries, 0 to 310
Data columns (total 36 columns):
 #   Column                      Non-Null Count  Dtype         
---  ------                      --------------  -----         
 0   Employee_Name               311 non-null    object        
 1   EmpID                       311 non-null    int64         
 2   MarriedID                   311 non-null    int64         
 3   MaritalStatusID             311 non-null    int64         
 4   GenderID                    311 non-null    int64         
 5   EmpStatusID                 311 non-null    int64         
 6   DeptID                      311 non-null    int64         
 7   PerfScoreID                 311 non-null    int64         
 8   FromDiversityJobFairID      311 non-null    int64         
 9   Salary                      311 non-null    int64         
 10  Termd                       311 non-null    int64         
 11  PositionID                  311 non-null    int64         

# Feature Engineering:
* **Tenure**: Calculate the length of an employee's service.
  * **Age**: Compute the employee's age from DOB.
  * **High-Absence Flag**: Create a new binary categorical variable (e.g., is_high_absent) for employees whose Absences or DaysLateLast30 exceed a defined threshold.
  * **Performance Score Numerical Mapping**: Transform the categorical PerformanceScore into a numerical scale (e.g., 4, 3, 2, 1).

* Categorical Variable Encoding:

  * Prepare nominal categorical variables (e.g., Department, Position, RaceDesc) for machine learning models using One-Hot Encoding.



In [None]:
# Conver NaN values for current employees to the current date.
df['DateofTermination'] = df['DateofTermination'].fillna(pd.Timestamp('now'))

# Create new variable 'Tenure' that counts the number of years each employee has been employed
df['Tenure'] = ((df['DateofTermination'] - df['DateofHire']).dt.days) / 365.25

# Create new variable 'Age' that counts the number of years each employee has been employed
df['Age'] = ((pd.Timestamp('now') - df['DOB']).dt.days) / 365.25

# Create a new variable 'High_Absence_Flag' that uses the 75th percentile to determine employees with "high" absences and assign a value of 1 (High) or 0 (Not High)
df[['Absences', 'DaysLateLast30']].describe()

# Use 15 or more absences or greater than 3 late arrivals as a 1 for the new variable and 0 otherwise
df['High_Absence_Flag'] = ((df['Absences'] > 15) | (df['DaysLateLast30'] > 3)).astype(int)

# Check how many employees are flagged as having high absences
df['High_Absence_Flag'].value_counts()

In [None]:
# Create a new dataframe that is One-Hot Encoded so that the data can be used in machine learning later on while still preserving the original dataset.
df_encoded = pd.get_dummies(df, columns = ['Department', 'Position', 'RaceDesc'])

In [None]:
# Save the human-readable dataframe with text columns
df.to_feather('hr_data_cleaned.feather')

# Save the fully encoded dataframe for machine learning
df_encoded.to_feather('hr_data_encoded.feather')