# **Financial Inclusion in Africa**

The following project is guided by the CRISP-DM methodology as follows:
- Business Understanding
- Data Understanding
- Data Preparation
- Modeling
- Evaluation
- Deployment

## **Business Understanding**

**Brief Background**
- Access to a bank account is a good indicator of financial inclusion, since individuals and businesses can save and grow their credit worthiness. Financial inclusion determines long-term economic growth.

**Business Objective**
- The main objective of this project is to predict which individuals are most likely to have a bank account. This will show the state of financial inclusion in Kenya, Rwanda, Uganda, and Tanzania.

**Project Goal**
- To create a machine learning that predicts whether an individual is most likely to have or use a bank account

**Project Timeline**

(estimate working time ~3 hours a day)
- Data Understanding, Preparation - 2 days
- Modeling - 4 days
- Evaluation - 4 days
- Deployment - 1 week

Total Time - 2 1/2 weeks

## **Data Understanding**

Load Packages

In [1]:
# load packages

# for data handling and plotting
import pandas as pd
import numpy as np 
import matplotlib as plt 
import seaborn as sns 


Load the Data

In [3]:
# load data
test_data = pd.read_csv("Data/Test.csv")
train_data = pd.read_csv("Data/Train.csv")
samplesubmission = pd.read_csv("Data/SampleSubmission.csv")
vdefinitions = pd.read_csv("Data/VariableDefinitions.csv")

Examine Data

In [15]:
# checkout data shape
print("Train Data:", train_data.shape)
print("Test Data:",test_data.shape)

Train Data: (23524, 13)
Test Data: (10086, 12)


The Train Dataset has 23524 rows and 13 columns. i.e., 13 variables; 12 independent and 1 dependent

The Test Dataset has 10086 rows and 12 columns, i.e., 12 variables

In [18]:
# inspect train_data
# look at first 5 rows
train_data.head()


Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed


In [19]:
# inspect test_data
# look at first 5 rows
test_data.head()

Unnamed: 0,country,year,uniqueid,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_6056,Urban,Yes,3,30,Male,Head of Household,Married/Living together,Secondary education,Formally employed Government
1,Kenya,2018,uniqueid_6060,Urban,Yes,7,51,Male,Head of Household,Married/Living together,Vocational/Specialised training,Formally employed Private
2,Kenya,2018,uniqueid_6065,Rural,No,3,77,Female,Parent,Married/Living together,No formal education,Remittance Dependent
3,Kenya,2018,uniqueid_6072,Rural,No,6,39,Female,Head of Household,Married/Living together,Primary education,Remittance Dependent
4,Kenya,2018,uniqueid_6073,Urban,No,3,16,Male,Child,Single/Never Married,Secondary education,Remittance Dependent


In [21]:
# check general data information
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23524 entries, 0 to 23523
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   country                 23524 non-null  object
 1   year                    23524 non-null  int64 
 2   uniqueid                23524 non-null  object
 3   bank_account            23524 non-null  object
 4   location_type           23524 non-null  object
 5   cellphone_access        23524 non-null  object
 6   household_size          23524 non-null  int64 
 7   age_of_respondent       23524 non-null  int64 
 8   gender_of_respondent    23524 non-null  object
 9   relationship_with_head  23524 non-null  object
 10  marital_status          23524 non-null  object
 11  education_level         23524 non-null  object
 12  job_type                23524 non-null  object
dtypes: int64(3), object(10)
memory usage: 2.3+ MB


In [22]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10086 entries, 0 to 10085
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   country                 10086 non-null  object
 1   year                    10086 non-null  int64 
 2   uniqueid                10086 non-null  object
 3   location_type           10086 non-null  object
 4   cellphone_access        10086 non-null  object
 5   household_size          10086 non-null  int64 
 6   age_of_respondent       10086 non-null  int64 
 7   gender_of_respondent    10086 non-null  object
 8   relationship_with_head  10086 non-null  object
 9   marital_status          10086 non-null  object
 10  education_level         10086 non-null  object
 11  job_type                10086 non-null  object
dtypes: int64(3), object(9)
memory usage: 945.7+ KB


All data types in train and test data match and are correct. Data info suggests no nulls in both sets.

In [25]:
# check for duplicates in train
train_data.duplicated().sum()

np.int64(0)

In [26]:
# check for duplicates in test
test_data.duplicated().sum()

np.int64(0)

There are no duplicates in both data sets

In [27]:
# check for missing values in train
train_data.isna().sum()

country                   0
year                      0
uniqueid                  0
bank_account              0
location_type             0
cellphone_access          0
household_size            0
age_of_respondent         0
gender_of_respondent      0
relationship_with_head    0
marital_status            0
education_level           0
job_type                  0
dtype: int64

In [28]:
# check for missing values in test
test_data.isna().sum()

country                   0
year                      0
uniqueid                  0
location_type             0
cellphone_access          0
household_size            0
age_of_respondent         0
gender_of_respondent      0
relationship_with_head    0
marital_status            0
education_level           0
job_type                  0
dtype: int64

Both datasets have no nulls

In [29]:
# look at all the variables and their definitions
vdefinitions

Unnamed: 0,Variable Definitions,Unnamed: 1
0,country,Country interviewee is in.
1,year,Year survey was done in.
2,uniqueid,Unique identifier for each interviewee
3,location_type,"Type of location: Rural, Urban"
4,cellphone_access,"If interviewee has access to a cellphone: Yes, No"
5,household_size,Number of people living in one house
6,age_of_respondent,The age of the interviewee
7,gender_of_respondent,"Gender of interviewee: Male, Female"
8,relationship_with_head,The interviewee’s relationship with the head o...
9,marital_status,The martial status of the interviewee: Married...


#### **Key Insights**
- Both datasets are clean
- There are no duplicates, nulls
- All variable names are in lower caps and match
- Data types match and are correct

### **Brief EDA**

## **Data Preparation**