### 'Financial Inclusion in Africa' dataset 

we are going to work on the 'Financial Inclusion in Africa' dataset that was provided as part of the Financial Inclusion in Africa hosted by the Zindi platform.

Dataset description: The dataset contains demographic information and what financial services are used by approximately 33,600 individuals across East Africa. The ML model role is to predict which individuals are most likely to have or use a bank account.

The term financial inclusion means:  individuals and businesses have access to useful and affordable financial products and services that meet their needs – transactions, payments, savings, credit and insurance – delivered in a responsible and sustainable way.

➡️ Dataset link

https://i.imgur.com/UNUZ4zR.jpg

➡️Columns explanation


Instructions

Install the necessary packages
Import you data and perform basic data exploration phase
Display general information about the dataset
Create a pandas profiling reports to gain insights into the dataset
Handle Missing and corrupted values
Remove duplicates, if they exist
Handle outliers, if they exist
Encode categorical features
Based on the previous data exploration train and test a machine learning classifier
Create a streamlit application (locally) and add input fields for your features and a validation button at the end of the form
Import your ML model into the streamlit application and start making predictions given the provided features values
Deploy your application on Streamlit share:
Create a github and a streamlit share accounts
Create a new git repo
Upload your local code to the newly created git repo
log in to your streamlit account an deploy your application from the git repo

In [2]:
# Import necessary libraries
import numpy as np  # Import NumPy for numerical operations
import pandas as pd  # Import Pandas for data manipulation
import matplotlib.pyplot as plt  # Import Matplotlib for plotting
import seaborn as sns  # Import Seaborn for enhanced plotting capabilities
from sklearn.cluster import KMeans  # Import KMeans from scikit-learn for clustering
from sklearn.preprocessing import StandardScaler  # Import StandardScaler from scikit-learn for data scaling


In [6]:
import pandas as pd 
df = pd.read_csv(r"Financial_inclusion_dataset.csv")
df

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed
...,...,...,...,...,...,...,...,...,...,...,...,...,...
23519,Uganda,2018,uniqueid_2113,No,Rural,Yes,4,48,Female,Head of Household,Divorced/Seperated,No formal education,Other Income
23520,Uganda,2018,uniqueid_2114,No,Rural,Yes,2,27,Female,Head of Household,Single/Never Married,Secondary education,Other Income
23521,Uganda,2018,uniqueid_2115,No,Rural,Yes,5,27,Female,Parent,Widowed,Primary education,Other Income
23522,Uganda,2018,uniqueid_2116,No,Urban,Yes,7,30,Female,Parent,Divorced/Seperated,Secondary education,Self employed


In [8]:
df.describe()

Unnamed: 0,year,household_size,age_of_respondent
count,23524.0,23524.0,23524.0
mean,2016.975939,3.797483,38.80522
std,0.847371,2.227613,16.520569
min,2016.0,1.0,16.0
25%,2016.0,2.0,26.0
50%,2017.0,3.0,35.0
75%,2018.0,5.0,49.0
max,2018.0,21.0,100.0


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23524 entries, 0 to 23523
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   country                 23524 non-null  object
 1   year                    23524 non-null  int64 
 2   uniqueid                23524 non-null  object
 3   bank_account            23524 non-null  object
 4   location_type           23524 non-null  object
 5   cellphone_access        23524 non-null  object
 6   household_size          23524 non-null  int64 
 7   age_of_respondent       23524 non-null  int64 
 8   gender_of_respondent    23524 non-null  object
 9   relationship_with_head  23524 non-null  object
 10  marital_status          23524 non-null  object
 11  education_level         23524 non-null  object
 12  job_type                23524 non-null  object
dtypes: int64(3), object(10)
memory usage: 2.3+ MB


In [12]:
df.isnull().sum()

country                   0
year                      0
uniqueid                  0
bank_account              0
location_type             0
cellphone_access          0
household_size            0
age_of_respondent         0
gender_of_respondent      0
relationship_with_head    0
marital_status            0
education_level           0
job_type                  0
dtype: int64

In [14]:
import pandas as pd

# assume 'df' is your DataFrame

df['marital_status_2'] = df['marital_status'].apply(lambda x: 'Married' if x == 'Married/Living together' else 'Unmarried')
df


Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type,marital_status_2
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed,Married
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent,Unmarried
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed,Unmarried
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private,Married
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed,Unmarried
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23519,Uganda,2018,uniqueid_2113,No,Rural,Yes,4,48,Female,Head of Household,Divorced/Seperated,No formal education,Other Income,Unmarried
23520,Uganda,2018,uniqueid_2114,No,Rural,Yes,2,27,Female,Head of Household,Single/Never Married,Secondary education,Other Income,Unmarried
23521,Uganda,2018,uniqueid_2115,No,Rural,Yes,5,27,Female,Parent,Widowed,Primary education,Other Income,Unmarried
23522,Uganda,2018,uniqueid_2116,No,Urban,Yes,7,30,Female,Parent,Divorced/Seperated,Secondary education,Self employed,Unmarried


In [16]:
import pandas as pd

# assume 'df' is your DataFrame

# Drop the 'marital_status' column
df = df.drop(columns=['marital_status'])
df

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,education_level,job_type,marital_status_2
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,Female,Spouse,Secondary education,Self employed,Married
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,Female,Head of Household,No formal education,Government Dependent,Unmarried
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,Male,Other relative,Vocational/Specialised training,Self employed,Unmarried
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,Female,Head of Household,Primary education,Formally employed Private,Married
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,Male,Child,Primary education,Informally employed,Unmarried
...,...,...,...,...,...,...,...,...,...,...,...,...,...
23519,Uganda,2018,uniqueid_2113,No,Rural,Yes,4,48,Female,Head of Household,No formal education,Other Income,Unmarried
23520,Uganda,2018,uniqueid_2114,No,Rural,Yes,2,27,Female,Head of Household,Secondary education,Other Income,Unmarried
23521,Uganda,2018,uniqueid_2115,No,Rural,Yes,5,27,Female,Parent,Primary education,Other Income,Unmarried
23522,Uganda,2018,uniqueid_2116,No,Urban,Yes,7,30,Female,Parent,Secondary education,Self employed,Unmarried


In [18]:
df['gender_of_respondent'] = df['gender_of_respondent'].map({'Female': 0, 'Male': 1}).astype(float)
df

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,education_level,job_type,marital_status_2
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,0.0,Spouse,Secondary education,Self employed,Married
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,0.0,Head of Household,No formal education,Government Dependent,Unmarried
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,1.0,Other relative,Vocational/Specialised training,Self employed,Unmarried
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,0.0,Head of Household,Primary education,Formally employed Private,Married
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,1.0,Child,Primary education,Informally employed,Unmarried
...,...,...,...,...,...,...,...,...,...,...,...,...,...
23519,Uganda,2018,uniqueid_2113,No,Rural,Yes,4,48,0.0,Head of Household,No formal education,Other Income,Unmarried
23520,Uganda,2018,uniqueid_2114,No,Rural,Yes,2,27,0.0,Head of Household,Secondary education,Other Income,Unmarried
23521,Uganda,2018,uniqueid_2115,No,Rural,Yes,5,27,0.0,Parent,Primary education,Other Income,Unmarried
23522,Uganda,2018,uniqueid_2116,No,Urban,Yes,7,30,0.0,Parent,Secondary education,Self employed,Unmarried


In [20]:
df['location_type'] = df['location_type'].astype('category')
df

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,education_level,job_type,marital_status_2
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,0.0,Spouse,Secondary education,Self employed,Married
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,0.0,Head of Household,No formal education,Government Dependent,Unmarried
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,1.0,Other relative,Vocational/Specialised training,Self employed,Unmarried
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,0.0,Head of Household,Primary education,Formally employed Private,Married
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,1.0,Child,Primary education,Informally employed,Unmarried
...,...,...,...,...,...,...,...,...,...,...,...,...,...
23519,Uganda,2018,uniqueid_2113,No,Rural,Yes,4,48,0.0,Head of Household,No formal education,Other Income,Unmarried
23520,Uganda,2018,uniqueid_2114,No,Rural,Yes,2,27,0.0,Head of Household,Secondary education,Other Income,Unmarried
23521,Uganda,2018,uniqueid_2115,No,Rural,Yes,5,27,0.0,Parent,Primary education,Other Income,Unmarried
23522,Uganda,2018,uniqueid_2116,No,Urban,Yes,7,30,0.0,Parent,Secondary education,Self employed,Unmarried


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23524 entries, 0 to 23523
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   country                 23524 non-null  object  
 1   year                    23524 non-null  int64   
 2   uniqueid                23524 non-null  object  
 3   bank_account            23524 non-null  object  
 4   location_type           23524 non-null  category
 5   cellphone_access        23524 non-null  object  
 6   household_size          23524 non-null  int64   
 7   age_of_respondent       23524 non-null  int64   
 8   gender_of_respondent    23524 non-null  float64 
 9   relationship_with_head  23524 non-null  object  
 10  education_level         23524 non-null  object  
 11  job_type                23524 non-null  object  
 12  marital_status_2        23524 non-null  object  
dtypes: category(1), float64(1), int64(3), object(8)
memory usage: 2.2+ MB


In [24]:
df['cellphone_access'] = df['cellphone_access'].map({'Yes': 1, 'No': 0}).astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23524 entries, 0 to 23523
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype   
---  ------                  --------------  -----   
 0   country                 23524 non-null  object  
 1   year                    23524 non-null  int64   
 2   uniqueid                23524 non-null  object  
 3   bank_account            23524 non-null  object  
 4   location_type           23524 non-null  category
 5   cellphone_access        23524 non-null  int64   
 6   household_size          23524 non-null  int64   
 7   age_of_respondent       23524 non-null  int64   
 8   gender_of_respondent    23524 non-null  float64 
 9   relationship_with_head  23524 non-null  object  
 10  education_level         23524 non-null  object  
 11  job_type                23524 non-null  object  
 12  marital_status_2        23524 non-null  object  
dtypes: category(1), float64(1), int64(4), object(7)
memory usage: 2.2+ MB


In [26]:
df['location_type'] = df['location_type'].map({'Rural': 0, 'Urban': 1}).astype(int)
df['bank_account_binary'] = df['bank_account'].map({'Yes': 1, 'No': 0})
df

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,education_level,job_type,marital_status_2,bank_account_binary
0,Kenya,2018,uniqueid_1,Yes,0,1,3,24,0.0,Spouse,Secondary education,Self employed,Married,1
1,Kenya,2018,uniqueid_2,No,0,0,5,70,0.0,Head of Household,No formal education,Government Dependent,Unmarried,0
2,Kenya,2018,uniqueid_3,Yes,1,1,5,26,1.0,Other relative,Vocational/Specialised training,Self employed,Unmarried,1
3,Kenya,2018,uniqueid_4,No,0,1,5,34,0.0,Head of Household,Primary education,Formally employed Private,Married,0
4,Kenya,2018,uniqueid_5,No,1,0,8,26,1.0,Child,Primary education,Informally employed,Unmarried,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23519,Uganda,2018,uniqueid_2113,No,0,1,4,48,0.0,Head of Household,No formal education,Other Income,Unmarried,0
23520,Uganda,2018,uniqueid_2114,No,0,1,2,27,0.0,Head of Household,Secondary education,Other Income,Unmarried,0
23521,Uganda,2018,uniqueid_2115,No,0,1,5,27,0.0,Parent,Primary education,Other Income,Unmarried,0
23522,Uganda,2018,uniqueid_2116,No,1,1,7,30,0.0,Parent,Secondary education,Self employed,Unmarried,0


In [28]:
df['household_size_to_age_ratio'] = df['household_size'] / df['age_of_respondent']
df

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,education_level,job_type,marital_status_2,bank_account_binary,household_size_to_age_ratio
0,Kenya,2018,uniqueid_1,Yes,0,1,3,24,0.0,Spouse,Secondary education,Self employed,Married,1,0.125000
1,Kenya,2018,uniqueid_2,No,0,0,5,70,0.0,Head of Household,No formal education,Government Dependent,Unmarried,0,0.071429
2,Kenya,2018,uniqueid_3,Yes,1,1,5,26,1.0,Other relative,Vocational/Specialised training,Self employed,Unmarried,1,0.192308
3,Kenya,2018,uniqueid_4,No,0,1,5,34,0.0,Head of Household,Primary education,Formally employed Private,Married,0,0.147059
4,Kenya,2018,uniqueid_5,No,1,0,8,26,1.0,Child,Primary education,Informally employed,Unmarried,0,0.307692
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23519,Uganda,2018,uniqueid_2113,No,0,1,4,48,0.0,Head of Household,No formal education,Other Income,Unmarried,0,0.083333
23520,Uganda,2018,uniqueid_2114,No,0,1,2,27,0.0,Head of Household,Secondary education,Other Income,Unmarried,0,0.074074
23521,Uganda,2018,uniqueid_2115,No,0,1,5,27,0.0,Parent,Primary education,Other Income,Unmarried,0,0.185185
23522,Uganda,2018,uniqueid_2116,No,1,1,7,30,0.0,Parent,Secondary education,Self employed,Unmarried,0,0.233333


In [30]:
education_dummies = pd.get_dummies(df['education_level'], drop_first=True)
df = pd.concat([df, education_dummies], axis=1)

In [32]:
job_type_dummies = pd.get_dummies(df['job_type'], drop_first=True)
df = pd.concat([df, job_type_dummies], axis=1)

In [34]:
df['cellphone_access_location_type_interaction'] = df['cellphone_access'] * df['location_type']
df

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,...,Farming and Fishing,Formally employed Government,Formally employed Private,Government Dependent,Informally employed,No Income,Other Income,Remittance Dependent,Self employed,cellphone_access_location_type_interaction
0,Kenya,2018,uniqueid_1,Yes,0,1,3,24,0.0,Spouse,...,False,False,False,False,False,False,False,False,True,0
1,Kenya,2018,uniqueid_2,No,0,0,5,70,0.0,Head of Household,...,False,False,False,True,False,False,False,False,False,0
2,Kenya,2018,uniqueid_3,Yes,1,1,5,26,1.0,Other relative,...,False,False,False,False,False,False,False,False,True,1
3,Kenya,2018,uniqueid_4,No,0,1,5,34,0.0,Head of Household,...,False,False,True,False,False,False,False,False,False,0
4,Kenya,2018,uniqueid_5,No,1,0,8,26,1.0,Child,...,False,False,False,False,True,False,False,False,False,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
23519,Uganda,2018,uniqueid_2113,No,0,1,4,48,0.0,Head of Household,...,False,False,False,False,False,False,True,False,False,0
23520,Uganda,2018,uniqueid_2114,No,0,1,2,27,0.0,Head of Household,...,False,False,False,False,False,False,True,False,False,0
23521,Uganda,2018,uniqueid_2115,No,0,1,5,27,0.0,Parent,...,False,False,False,False,False,False,True,False,False,0
23522,Uganda,2018,uniqueid_2116,No,1,1,7,30,0.0,Parent,...,False,False,False,False,False,False,False,False,True,1


In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23524 entries, 0 to 23523
Data columns (total 30 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   country                                     23524 non-null  object 
 1   year                                        23524 non-null  int64  
 2   uniqueid                                    23524 non-null  object 
 3   bank_account                                23524 non-null  object 
 4   location_type                               23524 non-null  int64  
 5   cellphone_access                            23524 non-null  int64  
 6   household_size                              23524 non-null  int64  
 7   age_of_respondent                           23524 non-null  int64  
 8   gender_of_respondent                        23524 non-null  float64
 9   relationship_with_head                      23524 non-null  object 
 10  education_

In [38]:
selected_features=['bank_account_binary','year','location_type','cellphone_access','household_size','age_of_respondent','gender_of_respondent']
correlation_matrix = df[selected_features].corr()
correlation_matrix 

Unnamed: 0,bank_account_binary,year,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent
bank_account_binary,1.0,0.112318,0.087288,0.209669,-0.028326,0.019429,0.117234
year,0.112318,1.0,0.214621,-0.066505,-0.052264,-0.01933,0.000317
location_type,0.087288,0.214621,1.0,-0.085238,-0.257284,-0.047373,0.012924
cellphone_access,0.209669,-0.066505,-0.085238,1.0,0.09136,-0.103611,0.10237
household_size,-0.028326,-0.052264,-0.257284,0.09136,1.0,-0.129729,0.014576
age_of_respondent,0.019429,-0.01933,-0.047373,-0.103611,-0.129729,1.0,0.012745
gender_of_respondent,0.117234,0.000317,0.012924,0.10237,0.014576,0.012745,1.0


In [40]:
selected_features = [
    'bank_account_binary',
    'year',
    'location_type',
    'cellphone_access',
    'household_size',
    'age_of_respondent',
    'gender_of_respondent',
    'household_size_to_age_ratio',
    'cellphone_access_location_type_interaction',
    'Farming and Fishing',
    'Formally employed Government',
    'Formally employed Private',
    'Government Dependent',
    'Informally employed',
    'No Income',
    'Other Income',
    'Remittance Dependent',
    'Self employed',
    'Secondary education',
    'Vocational/Specialised training',
    'Primary education'
]

correlation_matrix = df[selected_features].corr()
print(correlation_matrix)

                                            bank_account_binary      year  \
bank_account_binary                                    1.000000  0.112318   
year                                                   0.112318  1.000000   
location_type                                          0.087288  0.214621   
cellphone_access                                       0.209669 -0.066505   
household_size                                        -0.028326 -0.052264   
age_of_respondent                                      0.019429 -0.019330   
gender_of_respondent                                   0.117234  0.000317   
household_size_to_age_ratio                           -0.074023 -0.021620   
cellphone_access_location_type_interaction             0.194925  0.188025   
Farming and Fishing                                   -0.037986 -0.248909   
Formally employed Government                           0.235900  0.056126   
Formally employed Private                              0.249478  0.094141   

In [42]:
import pandas as pd

pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', None)  # Adjust the display width

df.head(50)  # Print the first few rows of the dataframe

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,education_level,job_type,marital_status_2,bank_account_binary,household_size_to_age_ratio,Other/Dont know/RTA,Primary education,Secondary education,Tertiary education,Vocational/Specialised training,Farming and Fishing,Formally employed Government,Formally employed Private,Government Dependent,Informally employed,No Income,Other Income,Remittance Dependent,Self employed,cellphone_access_location_type_interaction
0,Kenya,2018,uniqueid_1,Yes,0,1,3,24,0.0,Spouse,Secondary education,Self employed,Married,1,0.125,False,False,True,False,False,False,False,False,False,False,False,False,False,True,0
1,Kenya,2018,uniqueid_2,No,0,0,5,70,0.0,Head of Household,No formal education,Government Dependent,Unmarried,0,0.071429,False,False,False,False,False,False,False,False,True,False,False,False,False,False,0
2,Kenya,2018,uniqueid_3,Yes,1,1,5,26,1.0,Other relative,Vocational/Specialised training,Self employed,Unmarried,1,0.192308,False,False,False,False,True,False,False,False,False,False,False,False,False,True,1
3,Kenya,2018,uniqueid_4,No,0,1,5,34,0.0,Head of Household,Primary education,Formally employed Private,Married,0,0.147059,False,True,False,False,False,False,False,True,False,False,False,False,False,False,0
4,Kenya,2018,uniqueid_5,No,1,0,8,26,1.0,Child,Primary education,Informally employed,Unmarried,0,0.307692,False,True,False,False,False,False,False,False,False,True,False,False,False,False,0
5,Kenya,2018,uniqueid_6,No,0,0,7,26,0.0,Spouse,Primary education,Informally employed,Married,0,0.269231,False,True,False,False,False,False,False,False,False,True,False,False,False,False,0
6,Kenya,2018,uniqueid_7,No,0,1,7,32,0.0,Spouse,Primary education,Self employed,Married,0,0.21875,False,True,False,False,False,False,False,False,False,False,False,False,False,True,0
7,Kenya,2018,uniqueid_8,No,0,1,1,42,0.0,Head of Household,Tertiary education,Formally employed Government,Married,0,0.02381,False,False,False,True,False,False,True,False,False,False,False,False,False,False,0
8,Kenya,2018,uniqueid_9,Yes,0,1,3,54,1.0,Head of Household,Secondary education,Farming and Fishing,Married,1,0.055556,False,False,True,False,False,True,False,False,False,False,False,False,False,False,0
9,Kenya,2018,uniqueid_10,No,1,1,3,76,0.0,Head of Household,No formal education,Remittance Dependent,Unmarried,0,0.039474,False,False,False,False,False,False,False,False,False,False,False,False,True,False,1


In [44]:
import pandas as pd

# Examine the distribution of values in the Other/Dont know/RTA column
print(df['Other/Dont know/RTA'].value_counts())

# Examine the relationship between the Other/Dont know/RTA column and other categorical columns
for column in df.columns:
    if df[column].dtype == 'object':
        print(f"Relationship between Other/Dont know/RTA and {column}:")
        print(pd.crosstab(df['Other/Dont know/RTA'], df[column]))

Other/Dont know/RTA
False    23489
True        35
Name: count, dtype: int64
Relationship between Other/Dont know/RTA and country:
country              Kenya  Rwanda  Tanzania  Uganda
Other/Dont know/RTA                                 
False                 6060    8717      6617    2095
True                     8      18         3       6
Relationship between Other/Dont know/RTA and uniqueid:
uniqueid             uniqueid_1  uniqueid_10  uniqueid_100  uniqueid_1000  \
Other/Dont know/RTA                                                         
False                         4            4             4              4   
True                          0            0             0              0   

uniqueid             uniqueid_1001  uniqueid_1002  uniqueid_1003  \
Other/Dont know/RTA                                                
False                            4              4              4   
True                             0              0              0   

uniqueid             

In [46]:
# Print out the unique countries in the dataset
unique_countries = df['country'].unique()

print("Unique countries in the dataset:")
print(unique_countries)

Unique countries in the dataset:
['Kenya' 'Rwanda' 'Tanzania' 'Uganda']


### Modeling

In [53]:
from sklearn.model_selection import train_test_split 
# Define the features (X) and the target variable (y)
X = df.drop(columns=['bank_account_binary', 'country', 'year', 'uniqueid', 'bank_account', 'relationship_with_head', 'job_type', 'marital_status_2', 'Other/Dont know/RTA',])
y = df['bank_account_binary']
# Select the relevant features based on the insights from the correlation matrix
X = X[['cellphone_access', 'Formally employed Private', 'household_size_to_age_ratio', 'Informally employed', 'location_type', 'cellphone_access_location_type_interaction', 'education_level', 'Primary education', 'Secondary education', 'Tertiary education', 'Vocational/Specialised training', 'Farming and Fishing', 'Formally employed Government', 'Government Dependent', 'No Income', 'Other Income', 'Remittance Dependent', 'Self employed']]
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [57]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Define the categorical columns
categorical_columns = ['Primary education', 'Secondary education', 'Tertiary education', 'Vocational/Specialised training', 'Farming and Fishing', 'Formally employed Government', 'Government Dependent', 'No Income', 'Other Income', 'Remittance Dependent', 'Self employed']

# Define the numerical columns
numerical_columns = ['cellphone_access', 'household_size_to_age_ratio', 'location_type', 'cellphone_access_location_type_interaction', 'education_level']

# Create a column transformer to perform one-hot encoding on the categorical columns
column_transformer = ColumnTransformer(transformers=[('one_hot_encoder', OneHotEncoder(), categorical_columns)])

# Create a pipeline to perform the column transformation and linear regression
pipeline = Pipeline(steps=[('column_transformer', column_transformer), ('linear_regression', LinearRegression())])

# Train the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = pipeline.predict(X_test)

# Evaluate the model using mean squared error and R-squared
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print ("LINEAR REGRESSION ")
print("Mean Squared Error:", mse)
print("R-squared:", r2)

LINEAR REGRESSION 
Mean Squared Error: 0.0944510702418298
R-squared: 0.19842476968273737


In [61]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define the categorical columns
categorical_columns = ['Primary education', 'Secondary education', 'Tertiary education', 'Vocational/Specialised training', 'Farming and Fishing', 'Formally employed Government', 'Government Dependent', 'No Income', 'Other Income', 'Remittance Dependent', 'Self employed']

# Define the numerical columns
numerical_columns = ['cellphone_access', 'household_size_to_age_ratio', 'location_type', 'cellphone_access_location_type_interaction', 'education_level']

# Create a column transformer to perform one-hot encoding on the categorical columns
column_transformer = ColumnTransformer(transformers=[('one_hot_encoder', OneHotEncoder(), categorical_columns)])

# Create a pipeline to perform the column transformation and logistic regression
pipeline = Pipeline(steps=[('column_transformer', column_transformer), ('logistic_regression', LogisticRegression())])

# Train the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = pipeline.predict(X_test)

# Evaluate the model using accuracy score, classification report, and confusion matrix
print ("LOGISTIC REGRESSION ")
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

LOGISTIC REGRESSION 
Accuracy: 0.8786397449521786
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.97      0.93      4063
           1       0.62      0.29      0.39       642

    accuracy                           0.88      4705
   macro avg       0.76      0.63      0.66      4705
weighted avg       0.86      0.88      0.86      4705

Confusion Matrix:
[[3951  112]
 [ 459  183]]


In [63]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define the categorical columns
categorical_columns = ['Primary education', 'Secondary education', 'Tertiary education', 'Vocational/Specialised training', 'Farming and Fishing', 'Formally employed Government', 'Government Dependent', 'No Income', 'Other Income', 'Remittance Dependent', 'Self employed']

# Define the numerical columns
numerical_columns = ['cellphone_access', 'household_size_to_age_ratio', 'location_type', 'cellphone_access_location_type_interaction', 'education_level']

# Create a column transformer to perform one-hot encoding on the categorical columns
column_transformer = ColumnTransformer(transformers=[('one_hot_encoder', OneHotEncoder(), categorical_columns)])

# Create a pipeline to perform the column transformation and random forest classification
pipeline = Pipeline(steps=[('column_transformer', column_transformer), ('random_forest', RandomForestClassifier(n_estimators=100, random_state=42))])

# Train the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = pipeline.predict(X_test)

# Evaluate the model using accuracy score, classification report, and confusion matrix
print ("RANDOM FOREST CLASSIFIER")
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

RANDOM FOREST CLASSIFIER
Accuracy: 0.8805526036131774
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.97      0.93      4063
           1       0.64      0.29      0.40       642

    accuracy                           0.88      4705
   macro avg       0.77      0.63      0.67      4705
weighted avg       0.86      0.88      0.86      4705

Confusion Matrix:
[[3957  106]
 [ 456  186]]


In [65]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define the categorical columns
categorical_columns = ['Primary education', 'Secondary education', 'Tertiary education', 'Vocational/Specialised training', 'Farming and Fishing', 'Formally employed Government', 'Government Dependent', 'No Income', 'Other Income', 'Remittance Dependent', 'Self employed']

# Define the numerical columns
numerical_columns = ['cellphone_access', 'household_size_to_age_ratio', 'location_type', 'cellphone_access_location_type_interaction', 'education_level']

# Create a column transformer to perform one-hot encoding on the categorical columns
column_transformer = ColumnTransformer(transformers=[('one_hot_encoder', OneHotEncoder(), categorical_columns)])

# Create a pipeline to perform the column transformation and gradient boosting classification
pipeline = Pipeline(steps=[('column_transformer', column_transformer), ('gradient_boosting', GradientBoostingClassifier(n_estimators=100, random_state=42))])

# Train the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = pipeline.predict(X_test)

# Evaluate the model using accuracy score, classification report, and confusion matrix
print ("GRADIENT BOOSTING")
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

GRADIENT BOOSTING
Accuracy: 0.8805526036131774
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.97      0.93      4063
           1       0.64      0.29      0.40       642

    accuracy                           0.88      4705
   macro avg       0.77      0.63      0.67      4705
weighted avg       0.86      0.88      0.86      4705

Confusion Matrix:
[[3957  106]
 [ 456  186]]


In [67]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define the categorical columns
categorical_columns = ['Primary education', 'Secondary education', 'Tertiary education', 'Vocational/Specialised training', 'Farming and Fishing', 'Formally employed Government', 'Government Dependent', 'No Income', 'Other Income', 'Remittance Dependent', 'Self employed']

# Define the numerical columns
numerical_columns = ['cellphone_access', 'household_size_to_age_ratio', 'location_type', 'cellphone_access_location_type_interaction', 'education_level']

# Create a column transformer to perform one-hot encoding on the categorical columns
column_transformer = ColumnTransformer(transformers=[('one_hot_encoder', OneHotEncoder(), categorical_columns)])

# Create a pipeline to perform the column transformation and SVM classification
pipeline = Pipeline(steps=[('column_transformer', column_transformer), ('svm', SVC(kernel='rbf', C=1, random_state=42))])

# Train the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = pipeline.predict(X_test)

# Evaluate the model using accuracy score, classification report, and confusion matrix
print ("SVM")
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

SVM
Accuracy: 0.8803400637619554
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.97      0.93      4063
           1       0.64      0.29      0.40       642

    accuracy                           0.88      4705
   macro avg       0.77      0.63      0.67      4705
weighted avg       0.86      0.88      0.86      4705

Confusion Matrix:
[[3957  106]
 [ 457  185]]


In [69]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define the categorical columns
categorical_columns = ['Primary education', 'Secondary education', 'Tertiary education', 'Vocational/Specialised training', 'Farming and Fishing', 'Formally employed Government', 'Government Dependent', 'No Income', 'Other Income', 'Remittance Dependent', 'Self employed']

# Define the numerical columns
numerical_columns = ['cellphone_access', 'household_size_to_age_ratio', 'location_type', 'cellphone_access_location_type_interaction', 'education_level']

# Create a column transformer to perform one-hot encoding on the categorical columns
column_transformer = ColumnTransformer(transformers=[('one_hot_encoder', OneHotEncoder(), categorical_columns)])

# Create a pipeline to perform the column transformation and KNN classification
pipeline = Pipeline(steps=[('column_transformer', column_transformer), ('knn', KNeighborsClassifier(n_neighbors=5))])  # Removed random_state

# Train the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions on the testing data
y_pred = pipeline.predict(X_test)

# Evaluate the model using accuracy score, classification report, and confusion matrix
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("Classification Report:")
print(classification_report(y_test, y_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 0.8801275239107332
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.97      0.93      4063
           1       0.61      0.33      0.43       642

    accuracy                           0.88      4705
   macro avg       0.76      0.65      0.68      4705
weighted avg       0.86      0.88      0.86      4705

Confusion Matrix:
[[3931  132]
 [ 432  210]]
