
<br>
<br>
<div>
<img  style="float: left; padding-right: 100px; width: 350px" src="./logo.png">
    </div>
    <br>
    <h3 align="center">AI4D LAB TRAINING</h3>
<hr>
<br>
<h4 align="center"><a href="https://nsoma.me">Zephania Reuben</a></h4>
<br>
<h4 align="center">July 18, 2023</h4>
<br>
<hr>
<h3 align="center">DATA SCIENCE | TASK</h3>
<hr>
<br>

<h3 align="center"> FINANCIAL INCLUSSION IN AFRICA - ZINDI COMPETITION </h3>

### Understand The Problem Statement

Financial Inclusion remains one of the main obstacles to economic and human development in Africa. For example, across Kenya, Rwanda, Tanzania, and Uganda only 9.1 million adults (or 13.9% of the adult population) have access to or use a commercial bank account.

Traditionally, access to bank accounts has been regarded as an indicator of financial inclusion. Despite the proliferation of mobile money in Africa, and the growth of innovative fintech solutions, banks still play a pivotal role in facilitating access to financial services. Access to bank accounts enable households to save and facilitate payments while also helping businesses build up their credit-worthiness and improve their access to other finance services. Therefore, access to bank accounts is an essential contributor to long-term economic growth.

The objective of this Dataset is to create a machine learning model to predict which individuals are most likely to have or use a bank account. The models and solutions developed can provide an indication of the state of financial inclusion in Kenya, Rwanda, Tanzania and Uganda, while providing insights into some of the key demographic factors that might drive individuals’ financial outcomes.

Data source available in the the zindi platform, [Zindi Africa](https://zindi.africa/competitions/financial-inclusion-in-africa)

### Hypothesis Generation
This is a very important stage in any data science/machine learning pipeline. It involves understanding the problem in detail by brainstorming as many factors as possible which can impact the outcome. It is done by understanding the problem statement thoroughly and before looking at the data.

Below are some of the factors which I think can affect the chance for a person to have a bank account

- People who have mobile phone have lower chance to use bank account because of mobile money services.
- People who are employed have a higher chance of having a bank account than People who are unemployed.
- People with low education have low chance to have bank account
- People in rural areas have low chance to have bank account 
- People who have age below 18 have low chance to have bank account
- Female have less chance to have bank account

In [2]:
#Import libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score

In [3]:
# Load files into a pandas dataframe
train = pd.read_csv("/home/jok/4Legends/Train.csv")

Credit: [Davis David](https://twitter.com/Davis_McDavid)

In [4]:
train.head()

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed


In [5]:
# shows shapes of the datasets
print(train.shape)

(23524, 13)


In [6]:
train.head(6)

Unnamed: 0,country,year,uniqueid,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_1,Yes,Rural,Yes,3,24,Female,Spouse,Married/Living together,Secondary education,Self employed
1,Kenya,2018,uniqueid_2,No,Rural,No,5,70,Female,Head of Household,Widowed,No formal education,Government Dependent
2,Kenya,2018,uniqueid_3,Yes,Urban,Yes,5,26,Male,Other relative,Single/Never Married,Vocational/Specialised training,Self employed
3,Kenya,2018,uniqueid_4,No,Rural,Yes,5,34,Female,Head of Household,Married/Living together,Primary education,Formally employed Private
4,Kenya,2018,uniqueid_5,No,Urban,No,8,26,Male,Child,Single/Never Married,Primary education,Informally employed
5,Kenya,2018,uniqueid_6,No,Rural,No,7,26,Female,Spouse,Married/Living together,Primary education,Informally employed


In [7]:
#Information about the columns in the data
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23524 entries, 0 to 23523
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   country                 23524 non-null  object
 1   year                    23524 non-null  int64 
 2   uniqueid                23524 non-null  object
 3   bank_account            23524 non-null  object
 4   location_type           23524 non-null  object
 5   cellphone_access        23524 non-null  object
 6   household_size          23524 non-null  int64 
 7   age_of_respondent       23524 non-null  int64 
 8   gender_of_respondent    23524 non-null  object
 9   relationship_with_head  23524 non-null  object
 10  marital_status          23524 non-null  object
 11  education_level         23524 non-null  object
 12  job_type                23524 non-null  object
dtypes: int64(3), object(10)
memory usage: 2.3+ MB


In [8]:
#Changing categorical data into numerical data
train['country'].replace(['Kenya','Rwanda','Tanzania','Uganda'],[0,1,2,3],inplace=True)
train['bank_account'].replace(['No','Yes'],[0,1],inplace=True)
train['location_type'].replace(['Rural','Urban'],[0,1],inplace=True)
train['cellphone_access'].replace(['No','Yes'],[0,1],inplace=True)
train['gender_of_respondent'].replace(['Female','Male'],[0,1],inplace=True)
train['relationship_with_head'].replace(['Spouse','Head of Household','Other relative','Child','Parent','Other non-relatives'],[0,1,2,3,4,5],inplace=True)
train['marital_status'].replace(['Married/Living together','Widowed','Single/Never Married','Divorced/Seperated','Dont know'],[0,1,2,3,4],inplace=True)
train['education_level'].replace(['Secondary education','No formal education','Vocational/Specialised training','Primary education','Tertiary education','Other/Dont know/RTA'],[0,1,2,3,4,5],inplace=True)
train['job_type'].replace(['Self employed','Government Dependent','Formally employed Private','Informally employed','Formally employed Government','Farming and Fishing','Remittance Dependent','Other Income','Dont Know/Refuse to answer','No Income'],[0,1,2,3,4,5,6,7,8,9],inplace=True)

In [9]:
#Dropping uniqueid columns
trainee = train.drop(['uniqueid'],axis = 1)

In [10]:
trainee.head(10)

Unnamed: 0,country,year,bank_account,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,0,2018,1,0,1,3,24,0,0,0,0,0
1,0,2018,0,0,0,5,70,0,1,1,1,1
2,0,2018,1,1,1,5,26,1,2,2,2,0
3,0,2018,0,0,1,5,34,0,1,0,3,2
4,0,2018,0,1,0,8,26,1,3,2,3,3
5,0,2018,0,0,0,7,26,0,0,0,3,3
6,0,2018,0,0,1,7,32,0,0,0,3,0
7,0,2018,0,0,1,1,42,0,1,0,4,4
8,0,2018,1,0,1,3,54,1,1,0,0,5
9,0,2018,0,1,1,3,76,0,1,3,1,6


In [11]:
X = trainee.drop('bank_account', axis=1)
y = trainee['bank_account']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [12]:
#Training the models using train data set
#logisticRegression
model=LogisticRegression()
model.fit(X_train, y_train)

# K-Nearest Neighbors (KNN)
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

# Random Forest
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Decision Tree
dt_model = DecisionTreeClassifier(random_state=42)
dt_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [13]:
# Predict using each model
log_pred=model.predict(X_test)
knn_pred = knn_model.predict(X_test)
rf_pred = rf_model.predict(X_test)
dt_pred = dt_model.predict(X_test)

In [14]:
#Showing the result of predictions
print('Logistic Prediction',log_pred)
print('KNN Prediction',knn_pred)
print('Random Forest Prediction',rf_pred)
print('Decision Tree Prediction',dt_pred)

Logistic Prediction [0 0 0 ... 0 0 0]
KNN Prediction [0 0 0 ... 0 0 0]
Random Forest Prediction [0 0 1 ... 0 0 0]
Decision Tree Prediction [0 0 1 ... 0 0 0]


In [15]:
#Evaluating the performance and outputing their performance
meansquared_log=mean_squared_error(y_test,log_pred)
print('Performance of Logistic Regresssion',meansquared_log)
meansquared_knn=mean_squared_error(y_test,knn_pred)
print('Performance of KNN',meansquared_knn)
meansquared_rf=mean_squared_error(y_test,rf_pred)
print('Performance of Random Forest',meansquared_rf)
meansquared_dt=mean_squared_error(y_test,dt_pred)
print('Performance of Decision Tree',meansquared_dt)

Performance of Logistic Regresssion 0.13262486716259297
Performance of KNN 0.14027630180658873
Performance of Random Forest 0.13751328374070138
Performance of Decision Tree 0.15855472901168968


In [16]:
#Importing test data set
test = pd.read_csv("/home/jok/4Legends/Test.csv")

In [17]:
test.head()

Unnamed: 0,country,year,uniqueid,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,Kenya,2018,uniqueid_6056,Urban,Yes,3,30,Male,Head of Household,Married/Living together,Secondary education,Formally employed Government
1,Kenya,2018,uniqueid_6060,Urban,Yes,7,51,Male,Head of Household,Married/Living together,Vocational/Specialised training,Formally employed Private
2,Kenya,2018,uniqueid_6065,Rural,No,3,77,Female,Parent,Married/Living together,No formal education,Remittance Dependent
3,Kenya,2018,uniqueid_6072,Rural,No,6,39,Female,Head of Household,Married/Living together,Primary education,Remittance Dependent
4,Kenya,2018,uniqueid_6073,Urban,No,3,16,Male,Child,Single/Never Married,Secondary education,Remittance Dependent


In [18]:
#Changing categorical data into numerical data
test['country'].replace(['Kenya','Rwanda','Tanzania','Uganda'],[0,1,2,3],inplace=True)
test['location_type'].replace(['Rural','Urban'],[0,1],inplace=True)
test['cellphone_access'].replace(['No','Yes'],[0,1],inplace=True)
test['gender_of_respondent'].replace(['Female','Male'],[0,1],inplace=True)
test['relationship_with_head'].replace(['Spouse','Head of Household','Other relative','Child','Parent','Other non-relatives'],[0,1,2,3,4,5],inplace=True)
test['marital_status'].replace(['Married/Living together','Widowed','Single/Never Married','Divorced/Seperated','Dont know'],[0,1,2,3,4],inplace=True)
test['education_level'].replace(['Secondary education','No formal education','Vocational/Specialised training','Primary education','Tertiary education','Other/Dont know/RTA'],[0,1,2,3,4,5],inplace=True)
test['job_type'].replace(['Self employed','Government Dependent','Formally employed Private','Informally employed','Formally employed Government','Farming and Fishing','Remittance Dependent','Other Income','Dont Know/Refuse to answer','No Income'],[0,1,2,3,4,5,6,7,8,9],inplace=True)

In [19]:
#Dropping uniqueid columns
tester = test.drop(['uniqueid'],axis = 1)

In [20]:
tester

Unnamed: 0,country,year,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type
0,0,2018,1,1,3,30,1,1,0,0,4
1,0,2018,1,1,7,51,1,1,0,2,2
2,0,2018,0,0,3,77,0,4,0,1,6
3,0,2018,0,0,6,39,0,1,0,3,6
4,0,2018,1,0,3,16,1,3,2,0,6
...,...,...,...,...,...,...,...,...,...,...,...
10081,3,2018,0,0,2,62,0,0,0,3,0
10082,3,2018,1,1,8,42,1,1,0,3,0
10083,3,2018,1,1,1,39,1,1,2,0,7
10084,3,2018,0,1,6,28,0,0,0,3,0


In [21]:
#Predicting the valuse using test data set
log_pred=model.predict(tester)
knn_pred = knn_model.predict(tester)
rf_pred = rf_model.predict(tester)
dt_pred = dt_model.predict(tester)


In [22]:
#Showing the result of predictions
print('Logistic Prediction',log_pred)
print('KNN Prediction',knn_pred)
print('Random Forest Prediction',rf_pred)
print('Decision Tree Prediction',dt_pred)

Logistic Prediction [0 0 0 ... 0 0 0]
KNN Prediction [1 1 0 ... 0 0 0]
Random Forest Prediction [1 1 0 ... 0 0 0]
Decision Tree Prediction [1 1 0 ... 0 0 0]


In [25]:
#Assigning prediction to variable mypredict
mypredict = dt_pred

In [26]:
test["bank_account"]= mypredict

In [35]:
#Changing the numerical data into categorical data 
test['country'].replace([0,1,2,3],['Kenya','Rwanda','Tanzania','Uganda'],inplace=True)

In [36]:
test.head()

Unnamed: 0,country,year,uniqueid,location_type,cellphone_access,household_size,age_of_respondent,gender_of_respondent,relationship_with_head,marital_status,education_level,job_type,bank_account
0,Kenya,2018,uniqueid_6056,1,1,3,30,1,1,0,0,4,1
1,Kenya,2018,uniqueid_6060,1,1,7,51,1,1,0,2,2,1
2,Kenya,2018,uniqueid_6065,0,0,3,77,0,4,0,1,6,0
3,Kenya,2018,uniqueid_6072,0,0,6,39,0,1,0,3,6,0
4,Kenya,2018,uniqueid_6073,1,0,3,16,1,3,2,0,6,0


In [37]:
#Joining data for zindi
ready_for_zindi = test[['uniqueid','country','bank_account']]

In [38]:
#Displaying the data after joining columns needed by zindi
ready_for_zindi

Unnamed: 0,uniqueid,country,bank_account
0,uniqueid_6056,Kenya,1
1,uniqueid_6060,Kenya,1
2,uniqueid_6065,Kenya,0
3,uniqueid_6072,Kenya,0
4,uniqueid_6073,Kenya,0
...,...,...,...
10081,uniqueid_2998,Uganda,0
10082,uniqueid_2999,Uganda,0
10083,uniqueid_3000,Uganda,0
10084,uniqueid_3001,Uganda,0


In [39]:
#Exporting data to csv
ready_for_zindi.to_csv("/home/jok/4Legends/zindii.csv",sep=',',index=False,encoding = 'utf-8')