### Data Dictionary

Before we proceed with analysis, let's review the data dictionary provided to understand the variables and indicators in the datasets.

Variables in kenyan_df:

    ref_area: ISO 3 country code representing geographical location/region.
    ref_area.label: Full name of the country.
    source: Source of the data.
    ... (other variables listed)

Variables in population_df:

    ccode: Country code.
    name: Name of the country.
    num_code: Numeric code.
    
Indicators:

    Inactivity rate by sex and age -- ILO modelled estimates, Nov. 2022 (%)
    Unemployment rate by sex and age -- ILO modelled estimates, Nov. 2022 (%)
    Employment by sex and economic activity -- ILO modelled estimates, Nov. 2022 (thousands)
    ... (other indicators listed)    

In [152]:
# import libraries
import pandas as pd
import numpy as np
#ignore warning
import warnings
warnings.filterwarnings('ignore')


#### Data Importing

In [153]:
# a function to read any csv file.
def read_csv_file(file_path):
    dataframe = pd.read_csv(file_path)
    return dataframe



#### Understanding Population Data

In [154]:
population_df = read_csv_file("Datasets/Population_data.csv")
population_df

Unnamed: 0,ccode,name,num_code,year,age,sex,population
0,GHA,Ghana,288,2015,15,female,184146.56
1,GHA,Ghana,288,2015,15,male,193092.96
2,GHA,Ghana,288,2015,16,female,269206.88
3,GHA,Ghana,288,2015,16,male,282046.40
4,GHA,Ghana,288,2015,17,female,323915.68
...,...,...,...,...,...,...,...
3271,RWA,Rwanda,646,2040,33,male,143553.44
3272,RWA,Rwanda,646,2040,34,female,143317.12
3273,RWA,Rwanda,646,2040,34,male,138332.32
3274,RWA,Rwanda,646,2040,35,female,138679.04


In [155]:
#check for column names, dtypes and missing values
print(population_df.shape) # prints the dimension of the df
population_df.info()

(3276, 7)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3276 entries, 0 to 3275
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   ccode       3276 non-null   object 
 1   name        3276 non-null   object 
 2   num_code    3276 non-null   int64  
 3   year        3276 non-null   int64  
 4   age         3276 non-null   int64  
 5   sex         3276 non-null   object 
 6   population  3276 non-null   float64
dtypes: float64(1), int64(3), object(3)
memory usage: 179.3+ KB


Based on the provided information about the columns in a dataset:
- The dataset contains information for `3,276` entries.
- The dataset seems to contain demographic information, possibly related to different countries or regions, given the presence of codes, names, and population counts.
- The dataset's memory usage is approximately `179.3 KB`.
- Since all columns have non-null values for all entries, there don't appear to be any missing values in the dataset.
- The dataset appears to have a mix of categorical and numerical data, with information about population, age, and sex, among other variables.
- Without more context or specific analysis, it's not possible to draw more detailed conclusions about the dataset's content or purpose.

For now we filter for the **Kenyan youth population the year 2015-2024**, which will act as our base for analysis

In [156]:
# a function that filters for the kenyan populaion, the youth bracket and for the years 2015-2024
def process_population_data(population, country_code):
    # Step 1: Filter population data for Kenya
    kenya_pop = population[population['ccode'] == country_code]
    
    # Step 2: Create age groups using bins and labels
    bins = [14, 24, 35]
    labels = ['15-24', '25-35']
    kenya_pop['age_group'] = pd.cut(kenya_pop['age'], bins=bins, labels=labels, right=True)
    
    # Step 3: Group by name, age group, sex, and year to sum up population
    grouped_pop_df = kenya_pop.groupby(['name', 'age_group', 'sex', 'year'])['population'].sum().reset_index()
    
    # Step 4: Filter years to include only 2015-2024
    years_to_filter = list(range(2015, 2025))
    filtered_df = grouped_pop_df[grouped_pop_df['year'].isin(years_to_filter)]
    
    return filtered_df

# Country code for Kenya
kenya_country_code = 'KEN'

# Call the function to process population data for Kenya
filtered_population = process_population_data(population_df, kenya_country_code)

# Display the first few rows of the processed data
filtered_population.head()


Unnamed: 0,name,age_group,sex,year,population
0,Kenya,15-24,female,2015,4742700.0
1,Kenya,15-24,female,2016,4887980.0
2,Kenya,15-24,female,2017,5033260.0
3,Kenya,15-24,female,2018,5178540.0
4,Kenya,15-24,female,2019,5323820.0


In [157]:
filtered_population.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40 entries, 0 to 87
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   name        40 non-null     object  
 1   age_group   40 non-null     category
 2   sex         40 non-null     object  
 3   year        40 non-null     int64   
 4   population  40 non-null     float64 
dtypes: category(1), float64(1), int64(1), object(2)
memory usage: 1.7+ KB


Our data size has been reduced to `40` entries from the previous one. 

#### Understanding the Kenyan_National Data

In [158]:
#import the kenyan dataset
kenyan_df = read_csv_file('Datasets/Kenya_National.csv')
#check the columns, missing values and dtypes
print(kenyan_df.info())

#prints the first few rows
kenyan_df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11869 entries, 0 to 11868
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   ref_area              11869 non-null  object 
 1   ref_area.label        11869 non-null  object 
 2   source                11869 non-null  object 
 3   source.label          11869 non-null  object 
 4   indicator             11869 non-null  object 
 5   indicator.label       11869 non-null  object 
 6   sex                   11869 non-null  object 
 7   sex.label             11869 non-null  object 
 8   classif1              11869 non-null  object 
 9   classif1.label        11869 non-null  object 
 10  classif2              9703 non-null   object 
 11  classif2.label        9703 non-null   object 
 12  time                  11869 non-null  int64  
 13  obs_value             11278 non-null  float64
 14  obs_status            5816 non-null   object 
 15  obs_status.label   

Unnamed: 0,ref_area,ref_area.label,source,source.label,indicator,indicator.label,sex,sex.label,classif1,classif1.label,...,time,obs_value,obs_status,obs_status.label,note_classif,note_classif.label,note_indicator,note_indicator.label,note_source,note_source.label
0,KEN,Kenya,XA:1909,ILO - Modelled Estimates,EMP_2EMP_SEX_ECO_NB,Employment by sex and economic activity -- ILO...,SEX_T,Sex: Total,ECO_SECTOR_TOTAL,Economic activity (Broad sector): Total,...,2021,22755.313,,,,,,,,
1,KEN,Kenya,XA:1909,ILO - Modelled Estimates,EMP_2EMP_SEX_ECO_NB,Employment by sex and economic activity -- ILO...,SEX_T,Sex: Total,ECO_SECTOR_AGR,Economic activity (Broad sector): Agriculture,...,2021,7516.948,,,,,,,,
2,KEN,Kenya,XA:1909,ILO - Modelled Estimates,EMP_2EMP_SEX_ECO_NB,Employment by sex and economic activity -- ILO...,SEX_T,Sex: Total,ECO_SECTOR_IND,Economic activity (Broad sector): Industry,...,2021,3578.747,,,,,,,,
3,KEN,Kenya,XA:1909,ILO - Modelled Estimates,EMP_2EMP_SEX_ECO_NB,Employment by sex and economic activity -- ILO...,SEX_T,Sex: Total,ECO_SECTOR_SER,Economic activity (Broad sector): Services,...,2021,11659.617,,,,,,,,
4,KEN,Kenya,XA:1909,ILO - Modelled Estimates,EMP_2EMP_SEX_ECO_NB,Employment by sex and economic activity -- ILO...,SEX_M,Sex: Male,ECO_SECTOR_TOTAL,Economic activity (Broad sector): Total,...,2021,11502.002,,,,,,,,


The DataFrame contains `11,869` entries.
There are `22` columns in the DataFrame, each with specific data:
The DataFrame includes a mix of categorical and numerical data, with various kinds of labels and notes associated with different categories.
- There are missing values in columns such as `"classif2,"` `"obs_value,"` `"obs_status,"` `"note_classif,"` `"note_indicator,"` and `"note_source,"` which might need to be handled depending on the analysis or use case.
- The data seems to be organized around observations of different indicators over time, with various dimensions like reference area, source, sex, and classification categories.
 The `"time"` column likely represents the time period in which the observations were made, and the `"obs_value"` column represents the numerical values of the observed data.
- The columns with `".label"` in their names seem to provide labels associated with the corresponding categorical columns, possibly providing more descriptive information.
- The presence of note-related columns suggests additional textual information associated with the data, possibly explaining classifications, indicators, and sources.

 #### 2. Apply Inactivity Rate by Age and Gender

We'll apply the ILO inactive rate by age and gender to the corresponding population for 2015 - 2024 to get the total inactive population.

In [159]:
def process_kenya_data(data, years_to_filter):
    # Step 1: Filtering the data for specific age groups
    data = data[data['classif1.label'].isin(['Age (Youth, adults): 15-24', 'Age (Youth, adults): 25+'])]
    
    # Step 2: Further filtering for specific genders
    data = data[data['sex.label'].isin(['Sex: Male', 'Sex: Female'])]

    # Step 3: Cleaning and standardizing age group labels
    data['classif1.label'] = data['classif1.label'].str.replace('Age (Youth, adults): 25+', '25-35', regex=False).str.strip()
    data['classif1.label'] = data['classif1.label'].str.replace('Age (Youth, adults): 15-24', '15-24', regex=False).str.strip()

    # Step 4: Cleaning and standardizing gender labels
    data['sex.label'] = data['sex.label'].str.replace('Sex: Male', 'male', regex=False).str.strip()
    data['sex.label'] = data['sex.label'].str.replace('Sex: Female', 'female', regex=False).str.strip()

    # Step 5: Filtering data based on specified years
    data = data[data['time'].isin(years_to_filter)]
    
    return data
# Example usage:
years_to_filter =  list(range(2015,2025)) 
filtered_kenyan_df = process_kenya_data(kenyan_df, years_to_filter)
#print(filtered_kenyan_df.info())

In [160]:
def calculate_inactive_population(data, data2):
    # Step 6: Filtering data for a specific indicator
    inactivity_sex_age = data[data['indicator.label'] == 'Inactivity rate by sex and age -- ILO modelled estimates, Nov. 2022 (%)']

    # Step 7: Creating a mapping dictionary based on specific columns
    obs_value_map = inactivity_sex_age.set_index(['classif1.label', 'time', 'sex.label'])['obs_value'].to_dict()

    # Step 8: Mapping values to calculate 'ILO inactive share'
    data2['ILO_inactive_share'] = data2.set_index(['age_group', 'year', 'sex']).index.map(obs_value_map).astype(float)
    
    # Step 9: Calculating 'Total inactive population'
    data2['total_inactive_population'] = data2['population'] * data2['ILO_inactive_share'] / 100

    # Step 10: Returning the processed DataFrame
    return data2

# Example usage:
filtered_inactive_pop = calculate_inactive_population(filtered_kenyan_df, filtered_population)
filtered_inactive_pop.head()

Unnamed: 0,name,age_group,sex,year,population,ILO_inactive_share,total_inactive_population
0,Kenya,15-24,female,2015,4742700.0,55.928,2652497.0
1,Kenya,15-24,female,2016,4887980.0,56.361,2754914.0
2,Kenya,15-24,female,2017,5033260.0,56.906,2864227.0
3,Kenya,15-24,female,2018,5178540.0,57.463,2975744.0
4,Kenya,15-24,female,2019,5323820.0,58.043,3090105.0


In [161]:
# checking for null values
filtered_inactive_pop.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 40 entries, 0 to 87
Data columns (total 7 columns):
 #   Column                     Non-Null Count  Dtype   
---  ------                     --------------  -----   
 0   name                       40 non-null     object  
 1   age_group                  40 non-null     category
 2   sex                        40 non-null     object  
 3   year                       40 non-null     int64   
 4   population                 40 non-null     float64 
 5   ILO_inactive_share         40 non-null     float64 
 6   total_inactive_population  40 non-null     float64 
dtypes: category(1), float64(3), int64(1), object(2)
memory usage: 2.3+ KB


After the imputations, more columns have been added in the populations data with no missing values 

#### 3: Apply Unemployment Rate by Age and Gender

We'll apply the ILO unemployment rate by age and gender to the corresponding population for 2015 - 2024 to get the total unemployed population.

In [162]:
def calculate_unemployment_data(data, data1):
    # Step 1: Filter data for the specific indicator
    unemployment_sex_age = data[data['indicator.label'] == 'Unemployment rate by sex and age -- ILO modelled estimates, Nov. 2022 (%)']
    
    # Step 2: Create a dictionary to map combination of attributes to 'obs_value'
    obs_value_map = unemployment_sex_age.set_index(['classif1.label', 'time', 'sex.label'])['obs_value'].to_dict()

    # Step 3: Map values to create 'ILO unemployed rate' column
    data1['ILO_unemployed_rate'] = data1.set_index(['age_group', 'year', 'sex']).index.map(obs_value_map)

    # Step 4: Calculate 'Total unemployed population' based on 'ILO unemployed rate', population, and inactive population
    data1['total_unemployed_population'] = data1['ILO_unemployed_rate'].astype(float) / 100 * (data1['population'] - data1['total_inactive_population'])

    # Step 5: Returning the updated DataFrame
    return data1

# usage:
filtered_unemployed_pop = calculate_unemployment_data(filtered_kenyan_df, filtered_inactive_pop)
filtered_unemployed_pop.head()

Unnamed: 0,name,age_group,sex,year,population,ILO_inactive_share,total_inactive_population,ILO_unemployed_rate,total_unemployed_population
0,Kenya,15-24,female,2015,4742700.0,55.928,2652497.0,7.339,153399.979382
1,Kenya,15-24,female,2016,4887980.0,56.361,2754914.0,7.371,157228.264801
2,Kenya,15-24,female,2017,5033260.0,56.906,2864227.0,9.304,201806.836312
3,Kenya,15-24,female,2018,5178540.0,57.463,2975744.0,11.192,246536.879053
4,Kenya,15-24,female,2019,5323820.0,58.043,3090105.0,13.175,294291.971987
