# Data Ingestion

**Objective:** Load the IBM HR Analytics dataset and perform initial exploration

**Dataset:** IBM HR Analytics Employee Attrition & Performance
- Source: Kaggle
- Records: 1,470 employees
- Features: 35 columns

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")

Libraries imported successfully!


In [2]:
# Load the dataset
print("Loading IBM HR Analytics dataset...")

df = pd.read_csv('../data/raw/hr_data.csv')

print(f"Data loaded successfully!")
print(f"Total Employees: {len(df)}")
print(f"Total Features: {df.shape[1]}")

Loading IBM HR Analytics dataset...
Data loaded successfully!
Total Employees: 1470
Total Features: 35


In [3]:
# Display first few rows
print("\nFirst 5 rows of the dataset:")
print("="*80)
df.head()


First 5 rows of the dataset:


Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


In [4]:
# Check all column names
print("All Columns in Dataset:")
print("="*80)
for i, col in enumerate(df.columns, 1):
    print(f"{i:2d}. {col}")

All Columns in Dataset:
 1. Age
 2. Attrition
 3. BusinessTravel
 4. DailyRate
 5. Department
 6. DistanceFromHome
 7. Education
 8. EducationField
 9. EmployeeCount
10. EmployeeNumber
11. EnvironmentSatisfaction
12. Gender
13. HourlyRate
14. JobInvolvement
15. JobLevel
16. JobRole
17. JobSatisfaction
18. MaritalStatus
19. MonthlyIncome
20. MonthlyRate
21. NumCompaniesWorked
22. Over18
23. OverTime
24. PercentSalaryHike
25. PerformanceRating
26. RelationshipSatisfaction
27. StandardHours
28. StockOptionLevel
29. TotalWorkingYears
30. TrainingTimesLastYear
31. WorkLifeBalance
32. YearsAtCompany
33. YearsInCurrentRole
34. YearsSinceLastPromotion
35. YearsWithCurrManager


In [5]:
# Dataset info
print("\nDataset Information:")
print("="*80)
df.info()


Dataset Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  Jo

In [6]:
# Check for missing values
print("\nMissing Values Check:")
print("="*80)

missing = df.isnull().sum()
if missing.sum() == 0:
    print("Great! No missing values found in the dataset.")
else:
    print("Missing values found:")
    print(missing[missing > 0])


Missing Values Check:
Great! No missing values found in the dataset.


In [7]:
# Check for duplicates
print("\nDuplicate Records Check:")
print("="*80)

duplicates = df.duplicated().sum()
if duplicates == 0:
    print("No duplicate records found.")
else:
    print(f"Found {duplicates} duplicate records.")


Duplicate Records Check:
No duplicate records found.


In [8]:
# Basic statistics for numeric columns
print("\nStatistical Summary (Numeric Features):")
print("="*80)
df.describe().round(2)


Statistical Summary (Numeric Features):


Unnamed: 0,Age,DailyRate,DistanceFromHome,Education,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,HourlyRate,JobInvolvement,JobLevel,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
count,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,...,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0,1470.0
mean,36.92,802.49,9.19,2.91,1.0,1024.87,2.72,65.89,2.73,2.06,...,2.71,80.0,0.79,11.28,2.8,2.76,7.01,4.23,2.19,4.12
std,9.14,403.51,8.11,1.02,0.0,602.02,1.09,20.33,0.71,1.11,...,1.08,0.0,0.85,7.78,1.29,0.71,6.13,3.62,3.22,3.57
min,18.0,102.0,1.0,1.0,1.0,1.0,1.0,30.0,1.0,1.0,...,1.0,80.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
25%,30.0,465.0,2.0,2.0,1.0,491.25,2.0,48.0,2.0,1.0,...,2.0,80.0,0.0,6.0,2.0,2.0,3.0,2.0,0.0,2.0
50%,36.0,802.0,7.0,3.0,1.0,1020.5,3.0,66.0,3.0,2.0,...,3.0,80.0,1.0,10.0,3.0,3.0,5.0,3.0,1.0,3.0
75%,43.0,1157.0,14.0,4.0,1.0,1555.75,4.0,83.75,3.0,3.0,...,4.0,80.0,1.0,15.0,3.0,3.0,9.0,7.0,3.0,7.0
max,60.0,1499.0,29.0,5.0,1.0,2068.0,4.0,100.0,4.0,5.0,...,4.0,80.0,3.0,40.0,6.0,4.0,40.0,18.0,15.0,17.0


In [9]:
# Check target variable distribution
print("\nTarget Variable Distribution (Attrition):")
print("="*80)

attrition_counts = df['Attrition'].value_counts()
attrition_pct = df['Attrition'].value_counts(normalize=True) * 100

print(attrition_counts)
print(f"\nAttrition Rate: {attrition_pct['Yes']:.2f}%")
print(f"Retention Rate: {attrition_pct['No']:.2f}%")


Target Variable Distribution (Attrition):
Attrition
No     1233
Yes     237
Name: count, dtype: int64

Attrition Rate: 16.12%
Retention Rate: 83.88%


In [10]:
# Check categorical columns
print("\nCategorical Features Summary:")
print("="*80)

categorical_cols = df.select_dtypes(include=['object']).columns

for col in categorical_cols:
    print(f"\n{col}:")
    print(df[col].value_counts())
    print(f"Unique values: {df[col].nunique()}")


Categorical Features Summary:

Attrition:
Attrition
No     1233
Yes     237
Name: count, dtype: int64
Unique values: 2

BusinessTravel:
BusinessTravel
Travel_Rarely        1043
Travel_Frequently     277
Non-Travel            150
Name: count, dtype: int64
Unique values: 3

Department:
Department
Research & Development    961
Sales                     446
Human Resources            63
Name: count, dtype: int64
Unique values: 3

EducationField:
EducationField
Life Sciences       606
Medical             464
Marketing           159
Technical Degree    132
Other                82
Human Resources      27
Name: count, dtype: int64
Unique values: 6

Gender:
Gender
Male      882
Female    588
Name: count, dtype: int64
Unique values: 2

JobRole:
JobRole
Sales Executive              326
Research Scientist           292
Laboratory Technician        259
Manufacturing Director       145
Healthcare Representative    131
Manager                      102
Sales Representative          83
Research Direct

In [11]:
# Save a copy for next steps
print("\nSaving data for next notebook...")

# Save to external folder (we'll enrich this later)
df.to_csv('../data/external/enriched_data.csv', index=False)

print("Data saved to: data/external/enriched_data.csv")
print("\nNotebook 01 Complete! Ready for data cleaning.")


Saving data for next notebook...
Data saved to: data/external/enriched_data.csv

Notebook 01 Complete! Ready for data cleaning.


---

## Findings from Data Ingestion:

1. **Dataset Size:** 1,470 employees with 35 features
2. **Data Quality:** No missing values, no duplicates
3. **Attrition Rate:** 16.12% (industry standard is 10-15%)
4. **Data Types:** Mix of numeric and categorical
5. **Ready for Cleaning:** Minimal preprocessing needed

---

**Next Step:** Proceed to `02_data_cleaning.ipynb`