# AReL Data Analytics Capstone – Final Project (14th–22nd July 2025)


This capstone project aims to analyze the effectiveness of youth skills training
programs in Kakuma Refugee Camp by examining their impact on employment
outcomes. Students will apply data analytics techniques to understand which skill
types are most in demand, the earning potential associated with different skills, and
the performance of various training centers. The goal is to derive actionable insights
that can inform program improvements and better support youth in securing
livelihoods. The project will involve data cleaning, transformation, and visualization,
culminating in a comprehensive Power BI dashboard and a professional
presentation of findings.

 ## Data Cleaning

In [1]:
# importing pandas library for data manipulation in python

import pandas as pd

import numpy as np

print(f"\n Pandas and numpy successfully imported")


 Pandas and numpy successfully imported


In [2]:
# datasets loaded to see the first 10 head of the whole datasets

df = pd.read_csv("datasets\kakuma_skills_training.csv")

df.head(20)


Unnamed: 0,Trainee_ID,Gender,Age,Training_Center,Skill_Type,Completed_Training,Employment_Status,Monthly_Income_KES
0,TR0001,Male,37.0,Windle Trust,Computing,Yes,Unemployed,2000.0
1,TR0002,Female,32.0,,Hairdressing,,,0.0
2,TR0003,Male,24.0,LWF Skills Hub,Computing,Yes,Unemployed,2000.0
3,TR0004,Male,16.0,LWF Skills Hub,Agriculture,No,,5000.0
4,TR0005,Male,36.0,NRC Center,Tailoring,Yes,Employed,2000.0
5,TR0006,,35.0,NRC Center,Hairdressing,Yes,Self-Employed,7000.0
6,TR0007,Male,28.0,Windle Trust,Tailoring,No,Employed,2000.0
7,TR0008,Male,43.0,LWF Skills Hub,Hairdressing,Yes,Unemployed,5000.0
8,TR0009,Male,31.0,LWF Skills Hub,Agriculture,Yes,Unemployed,5000.0
9,TR0010,Female,44.0,LWF Skills Hub,Tailoring,No,Unemployed,5000.0


In [3]:
df.isnull().sum() # exploring the field with missing data.

Trainee_ID              0
Gender                100
Age                   100
Training_Center       100
Skill_Type            100
Completed_Training    100
Employment_Status     100
Monthly_Income_KES    100
dtype: int64

In [4]:
# Identifying and handling missing values appropriately for specific columns such as "Age", "Employment_Status", "Monthly_Income_KES" .

df[["Age", "Employment_Status", "Monthly_Income_KES"]].isnull().sum()


Age                   100
Employment_Status     100
Monthly_Income_KES    100
dtype: int64

In [5]:
# a) Handle missing age, by Filling the missing ages with median age.

df_clean = df.copy()   # create a copy


df_clean["Age"] = pd.to_numeric(df_clean["Age"])   # Convert age from float to integer


In [6]:
median_age = df_clean["Age"].median()      # calculate the median age using a median () function
median_age

30.0

In [7]:

df_clean["Age"]=df_clean["Age"].fillna(median_age)  # And fill the missing age with the calculated median age

In [8]:
df_clean.isnull().sum()   # missing will be filled with median age which is 30.0

Trainee_ID              0
Gender                100
Age                     0
Training_Center       100
Skill_Type            100
Completed_Training    100
Employment_Status     100
Monthly_Income_KES    100
dtype: int64

In [9]:
# b) Handle missing employment_status by Filling Unknown employment_status with a NaN 

df_clean["Employment_Status"] = df_clean["Employment_Status"].fillna("Unknown")


In [10]:
df_clean["Employment_Status"].head(5)

0    Unemployed
1       Unknown
2    Unemployed
3       Unknown
4      Employed
Name: Employment_Status, dtype: object

In [11]:
# Rename the Monthly_Income_KES column to Monthly_Income

df_clean.rename(columns={"Monthly_Income_KES": "Monthly_Income"}, inplace=True)  # renaming "Monthly_Income_KES" field to  "Monthly_Income"

# Check out the changes

print(df_clean.columns) 


Index(['Trainee_ID', 'Gender', 'Age', 'Training_Center', 'Skill_Type',
       'Completed_Training', 'Employment_Status', 'Monthly_Income'],
      dtype='object')


In [12]:
# c) Handle missing monthly income

df_clean = df_clean.copy()  # create new copy 



In [13]:
df_clean["Monthly_Income"] = pd.to_numeric(df_clean["Monthly_Income"], errors="coerce")  # converting to numreical values



In [14]:
median_income = df_clean["Monthly_Income"].median()  # Calculate median monthly income value 

median_income


2000.0

In [15]:
 df_clean["Monthly_Income"] = df_clean["Monthly_Income"].fillna(median_income)   # Fill the missing values with the monthly median income

In [16]:
print(df_clean["Monthly_Income"].isnull().sum())   # check again for the missing values in monthly_income column

0


In [17]:

df_clean['Monthly_Income'] = df_clean["Monthly_Income"].replace(0.0, np.nan) # Replace 0.0 with NaN

  

In [18]:
# Fill the missing monthly income with median monthly income.

df_clean["Monthly_Income"] = df_clean["Monthly_Income"].fillna(median_income)


In [19]:
df_clean.isnull().sum()

Trainee_ID              0
Gender                100
Age                     0
Training_Center       100
Skill_Type            100
Completed_Training    100
Employment_Status       0
Monthly_Income          0
dtype: int64

In [20]:
# Ensure consistent data types and formats across all relevant columns
# (e.g., numerical values for age and income, categorical for skill type and completion status).


df_clean["Skill_Type"] = df_clean["Skill_Type"].astype("category")   # Convert Skill_Type to categorical for cleaner analysis



In [21]:

df_clean["Completed_Training"] = df_clean["Completed_Training"].astype('category')  # Convert Completed_Training to categorical for cleaner analysis


In [22]:
 
df_clean["Skill_Type"] = df_clean["Skill_Type"].str.strip().str.title()  #  cleanup for string columns (remove spaces, fix casing)


In [23]:
# Fill NaN (missing) values in the 'Completed_Training' column by forward filling non-NaN value

df_clean["Skill_Type"] = df_clean["Skill_Type"].ffill()

In [24]:
df_clean["Completed_Training"] = df_clean["Completed_Training"].str.strip()

In [25]:
print(df_clean.dtypes)   # check updated types

Trainee_ID             object
Gender                 object
Age                   float64
Training_Center        object
Skill_Type             object
Completed_Training     object
Employment_Status      object
Monthly_Income        float64
dtype: object


In [26]:
# Address any inconsistencies or errors in textual data (e.g., skill typedescriptions, training center names).

df_clean["Skill_Type"].unique()


array(['Computing', 'Hairdressing', 'Agriculture', 'Tailoring',
       'Carpentry'], dtype=object)

In [27]:
# replacing the typeos datatype in the Skill_Type column by creating dictionary

df_clean["Skill_Type"] = df_clean["Skill_Type"].replace({
    
    "Hairdressing": "Hair Dressing",
    
    "Computing": "Computer Skills",
    
    "Tailoring ": "Tailoring",  # trailing space
    
})


In [28]:
df_clean["Skill_Type"].unique()  # check for the update of unique categorical values 

array(['Computer Skills', 'Hair Dressing', 'Agriculture', 'Tailoring',
       'Carpentry'], dtype=object)

In [29]:
# replacing the NaN with "LWF Skills Hub" in the Training center column

df_clean = df_clean.copy()

df_clean["Training_Center"] = df_clean["Training_Center"].ffill()



In [30]:
df_clean.head(10)

Unnamed: 0,Trainee_ID,Gender,Age,Training_Center,Skill_Type,Completed_Training,Employment_Status,Monthly_Income
0,TR0001,Male,37.0,Windle Trust,Computer Skills,Yes,Unemployed,2000.0
1,TR0002,Female,32.0,Windle Trust,Hair Dressing,,Unknown,2000.0
2,TR0003,Male,24.0,LWF Skills Hub,Computer Skills,Yes,Unemployed,2000.0
3,TR0004,Male,16.0,LWF Skills Hub,Agriculture,No,Unknown,5000.0
4,TR0005,Male,36.0,NRC Center,Tailoring,Yes,Employed,2000.0
5,TR0006,,35.0,NRC Center,Hair Dressing,Yes,Self-Employed,7000.0
6,TR0007,Male,28.0,Windle Trust,Tailoring,No,Employed,2000.0
7,TR0008,Male,43.0,LWF Skills Hub,Hair Dressing,Yes,Unemployed,5000.0
8,TR0009,Male,31.0,LWF Skills Hub,Agriculture,Yes,Unemployed,5000.0
9,TR0010,Female,44.0,LWF Skills Hub,Tailoring,No,Unemployed,5000.0


In [31]:
df_clean["Completed_Training"] = df_clean["Completed_Training"].ffill()

In [32]:
df_clean.isnull().sum()  # check the naN values are ffill

Trainee_ID              0
Gender                100
Age                     0
Training_Center         0
Skill_Type              0
Completed_Training      0
Employment_Status       0
Monthly_Income          0
dtype: int64

In [33]:
df_clean.head(20)

Unnamed: 0,Trainee_ID,Gender,Age,Training_Center,Skill_Type,Completed_Training,Employment_Status,Monthly_Income
0,TR0001,Male,37.0,Windle Trust,Computer Skills,Yes,Unemployed,2000.0
1,TR0002,Female,32.0,Windle Trust,Hair Dressing,Yes,Unknown,2000.0
2,TR0003,Male,24.0,LWF Skills Hub,Computer Skills,Yes,Unemployed,2000.0
3,TR0004,Male,16.0,LWF Skills Hub,Agriculture,No,Unknown,5000.0
4,TR0005,Male,36.0,NRC Center,Tailoring,Yes,Employed,2000.0
5,TR0006,,35.0,NRC Center,Hair Dressing,Yes,Self-Employed,7000.0
6,TR0007,Male,28.0,Windle Trust,Tailoring,No,Employed,2000.0
7,TR0008,Male,43.0,LWF Skills Hub,Hair Dressing,Yes,Unemployed,5000.0
8,TR0009,Male,31.0,LWF Skills Hub,Agriculture,Yes,Unemployed,5000.0
9,TR0010,Female,44.0,LWF Skills Hub,Tailoring,No,Unemployed,5000.0


In [34]:
# Fill NaN (missing) values in the "Gender" column by forward filling non-NaN value

df_clean["Gender"] = df_clean["Gender"].bfill()

In [35]:
df_clean.isnull().sum()  # check the naN values are bfilled

Trainee_ID            0
Gender                0
Age                   0
Training_Center       0
Skill_Type            0
Completed_Training    0
Employment_Status     0
Monthly_Income        0
dtype: int64

In [36]:
df_clean.head(30)

Unnamed: 0,Trainee_ID,Gender,Age,Training_Center,Skill_Type,Completed_Training,Employment_Status,Monthly_Income
0,TR0001,Male,37.0,Windle Trust,Computer Skills,Yes,Unemployed,2000.0
1,TR0002,Female,32.0,Windle Trust,Hair Dressing,Yes,Unknown,2000.0
2,TR0003,Male,24.0,LWF Skills Hub,Computer Skills,Yes,Unemployed,2000.0
3,TR0004,Male,16.0,LWF Skills Hub,Agriculture,No,Unknown,5000.0
4,TR0005,Male,36.0,NRC Center,Tailoring,Yes,Employed,2000.0
5,TR0006,Male,35.0,NRC Center,Hair Dressing,Yes,Self-Employed,7000.0
6,TR0007,Male,28.0,Windle Trust,Tailoring,No,Employed,2000.0
7,TR0008,Male,43.0,LWF Skills Hub,Hair Dressing,Yes,Unemployed,5000.0
8,TR0009,Male,31.0,LWF Skills Hub,Agriculture,Yes,Unemployed,5000.0
9,TR0010,Female,44.0,LWF Skills Hub,Tailoring,No,Unemployed,5000.0


In [37]:
# Check for duplicates

print(f"Total rows: {len(df_clean)}")

print(f"Duplicate rows: {df_clean.duplicated().sum()}")

Total rows: 1000
Duplicate rows: 0


# Data Transformation

In [76]:

df_transform = df_clean.copy()  # create a copy for data transformation

df_transform.head()


Unnamed: 0,Trainee_ID,Gender,Age,Training_Center,Skill_Type,Completed_Training,Employment_Status,Monthly_Income
0,TR0001,Male,37.0,Windle Trust,Computer Skills,Yes,Unemployed,2000.0
1,TR0002,Female,32.0,Windle Trust,Hair Dressing,Yes,Unknown,2000.0
2,TR0003,Male,24.0,LWF Skills Hub,Computer Skills,Yes,Unemployed,2000.0
3,TR0004,Male,16.0,LWF Skills Hub,Agriculture,No,Unknown,5000.0
4,TR0005,Male,36.0,NRC Center,Tailoring,Yes,Employed,2000.0


In [77]:
# Group trainees by Skill_Type to analyze performance and outcomes for specific vocational areas.

# Group trainees by Skill_Type and summarize completions and counts

skill_summary = df_transform.groupby("Skill_Type").agg(
    
    Total_Trainees= ("Completed_Training", "count"),
    
    Completed= ("Completed_Training", lambda x: (x == "Yes").sum()),
    
    Avg_Age= ("Age", "mean"),
    
    Median_Income= ("Monthly_Income", "median")
    
)


In [78]:
# Calculate completion rate by Skill_Type

skill_summary["Completion_Rate (%)"] = round((skill_summary["Completed"] / skill_summary["Total_Trainees"]) * 100, 1)


In [79]:
skill_summary

Unnamed: 0_level_0,Total_Trainees,Completed,Avg_Age,Median_Income,Completion_Rate (%)
Skill_Type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Agriculture,212,192,29.768868,2000.0,90.6
Carpentry,222,194,30.707207,2000.0,87.4
Computer Skills,166,138,28.283133,2000.0,83.1
Hair Dressing,204,174,30.240196,2000.0,85.3
Tailoring,196,156,29.852041,2000.0,79.6


In [80]:
# Group by Skill_Type and Gender, then compute mean income

income_summary = df_transform.groupby(["Skill_Type", "Gender"])["Monthly_Income"].mean().reset_index()


In [81]:
# Rename column for clarity

df_transform = df_transform.rename(columns= {"Monthly_Income": "Avg_Monthly_Income"})


In [82]:
df_transform

Unnamed: 0,Trainee_ID,Gender,Age,Training_Center,Skill_Type,Completed_Training,Employment_Status,Avg_Monthly_Income
0,TR0001,Male,37.0,Windle Trust,Computer Skills,Yes,Unemployed,2000.0
1,TR0002,Female,32.0,Windle Trust,Hair Dressing,Yes,Unknown,2000.0
2,TR0003,Male,24.0,LWF Skills Hub,Computer Skills,Yes,Unemployed,2000.0
3,TR0004,Male,16.0,LWF Skills Hub,Agriculture,No,Unknown,5000.0
4,TR0005,Male,36.0,NRC Center,Tailoring,Yes,Employed,2000.0
...,...,...,...,...,...,...,...,...
995,TR0996,Male,30.0,LWF Skills Hub,Hair Dressing,Yes,Employed,5000.0
996,TR0997,Male,17.0,LWF Skills Hub,Hair Dressing,No,Self-Employed,1000.0
997,TR0998,Female,38.0,Windle Trust,Hair Dressing,No,Employed,2000.0
998,TR0999,Male,18.0,Windle Trust,Computer Skills,Yes,Self-Employed,5000.0


In [83]:
# Calculate completion rates for each Training_Center and Skill_Type.

# Count completions and grouped by training_center and skill_type

training_centers_summary = df_transform.groupby(["Training_Center", "Skill_Type"])["Completed_Training"].apply(
    
    lambda x: round((x == 'Yes').sum() / x.count() * 100, 1)
    
).reset_index(name='Completion_Rate (%)')



In [84]:
training_centers_summary

Unnamed: 0,Training_Center,Skill_Type,Completion_Rate (%)
0,LWF Skills Hub,Agriculture,89.5
1,LWF Skills Hub,Carpentry,89.3
2,LWF Skills Hub,Computer Skills,88.6
3,LWF Skills Hub,Hair Dressing,85.1
4,LWF Skills Hub,Tailoring,77.4
5,NRC Center,Agriculture,95.1
6,NRC Center,Carpentry,89.3
7,NRC Center,Computer Skills,85.5
8,NRC Center,Hair Dressing,80.3
9,NRC Center,Tailoring,84.4


In [85]:
# Implement any other necessary transformations, such as categorizing
# age groups or standardizing income status categories.

# Create age group bins

bins = [0, 18, 25, 35, 50, 100]

labels = ["<18", "18–25", "26–35", "36–50", "50+"]

df_transform["Age_Group"] = pd.cut(df_transform["Age"], bins=bins, labels=labels)



In [87]:
# View sample
df_transform[["Age", "Age_Group"]].head(10)

Unnamed: 0,Age,Age_Group
0,37.0,36–50
1,32.0,26–35
2,24.0,18–25
3,16.0,<18
4,36.0,36–50
5,35.0,26–35
6,28.0,26–35
7,43.0,36–50
8,31.0,26–35
9,44.0,36–50


In [90]:
df_transform.head(10)

Unnamed: 0,Trainee_ID,Gender,Age,Training_Center,Skill_Type,Completed_Training,Employment_Status,Avg_Monthly_Income,Age_Group
0,TR0001,Male,37.0,Windle Trust,Computer Skills,Yes,Unemployed,2000.0,36–50
1,TR0002,Female,32.0,Windle Trust,Hair Dressing,Yes,Unknown,2000.0,26–35
2,TR0003,Male,24.0,LWF Skills Hub,Computer Skills,Yes,Unemployed,2000.0,18–25
3,TR0004,Male,16.0,LWF Skills Hub,Agriculture,No,Unknown,5000.0,<18
4,TR0005,Male,36.0,NRC Center,Tailoring,Yes,Employed,2000.0,36–50
5,TR0006,Male,35.0,NRC Center,Hair Dressing,Yes,Self-Employed,7000.0,26–35
6,TR0007,Male,28.0,Windle Trust,Tailoring,No,Employed,2000.0,26–35
7,TR0008,Male,43.0,LWF Skills Hub,Hair Dressing,Yes,Unemployed,5000.0,36–50
8,TR0009,Male,31.0,LWF Skills Hub,Agriculture,Yes,Unemployed,5000.0,26–35
9,TR0010,Female,44.0,LWF Skills Hub,Tailoring,No,Unemployed,5000.0,36–50


In [93]:
df_transform.to_csv('datasets/cleaned_kakuma_skills_training.csv')

print('Dataset exported successfully!!')

Dataset exported successfully!!
