# Financial Inclusion in Africa Starter Notebook


This is a simple starter notebook to get started with the Financial Inclusion Competition on Zindi.

This notebook covers:
- Loading the data
- Simple EDA and an example of feature enginnering
- Data preprocessing and data wrangling
- Creating a simple model
- Making a submission
- Some tips for improving your score

### Importing libraries

In [None]:
# dataframe and plotting
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# machine learning
from lightgbm import LGBMClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import lightgbm as lgb

### 1. Load the dataset

In [None]:
# Load files into a pandas dataframe
train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')
# ss = pd.read_csv('SampleSubmission.csv')
variables = pd.read_csv('VariableDefinitions.csv')

In [None]:
# Let’s observe the shape of our datasets.
print('train data shape :', train.shape)
print('test data shape :', test.shape)

The above output shows the number of rows and columns for train and test dataset. We have 13 variables in the train dataset, 12 independent variables and 1 dependent variable. In the test dataset, we have 12 independent variables.

We can observe the first five rows from our data set by using the head() method from the pandas library.

In [None]:
# inspect train data
train.head()

In [None]:
# Check for missing values
print('missing values:', train.isnull().sum())

We don't have missing data in our dataset.



In [None]:
# Explore Target distribution 
sns.catplot(x="bank_account", kind="count", data=train)

It is important to understand the meaning of each feature so you can really understand the dataset. You can read the VariableDefinition.csv file to understand the meaning of each variable presented in the dataset.

The SampleSubmission.csv gives us an example of how our submission file should look. This file will contain the uniqueid column combined with the country name from the Test.csv file and the target we predict with our model. Once we have created this file, we will submit it to the competition page and obtain a position on the leaderboard.


In [None]:
# view the submission file
ss.head()

### 2. Understand the dataset
We can get more information about the features presented by using the info() method from pandas.


In [None]:
 #show some information about the dataset
 print(train.info())

The output shows the list of variables/features, sizes, if it contains missing values and data type for each variable. From the dataset, we don’t have any missing values and we have 3 features of integer data type and 10 features of the object data type.

If you want to learn how to handle missing data in your dataset, we recommend you read How to [Handle Missing Data with Python](https://machinelearningmastery.com/handle-missing-data-python/) by Jason Brownlee.

We won’t go further on understanding the dataset because Davis has already published an article about exploratory data analysis (EDA) with the financial Inclusion in Africa dataset. You can read and download the notebook for EDA in the link below.

[Why you need to explore your data and how you can start](https://https://medium.com/analytics-vidhya/why-you-need-to-explore-your-data-how-you-can-start-13de6f29c8c1)

In [None]:
# Let's view the variables
variables

In [48]:
#import preprocessing module
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

# Convert target label to numerical Data
le = LabelEncoder()
train['bank_account'] = le.fit_transform(train['bank_account'])

#Separate training features from target
X_train = train.drop(['bank_account'], axis=1)
y_train = train['bank_account']




# save the label encoder
import pickle
filename = '../savedModel/label_encoder.sav'
pickle.dump(le, open(filename, 'wb'))

# save the minmax scaler
filename = '../savedModel/minmax_scaler.sav'
pickle.dump(MinMaxScaler, open(filename, 'wb'))



The target values have been transformed into numerical datatypes, **1** represents **‘Yes’** and **0** represents **‘No’**.

We have created a simple preprocessing function to:

*   Handle conversion of data types
*   Convert categorical features to numerical features by using [One-hot Encoder and Label Encoder](https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd)
*   Drop uniqueid variable
*   Perform [feature scaling](https://towardsdatascience.com/preprocessing-with-sklearn-a-complete-and-comprehensive-guide-670cb98fcfb9).

The processing function will be used for both train and test independent variables.

In [53]:
# function to preprocess our data from train models
def preprocessing_data(data):

    # Convert the following numerical labels from integer to float
    float_array = data[["household_size", "age_of_respondent", "year"]].values.astype(float)

    # categorical features to be converted to One Hot Encoding
    categ = [
        "relationship_with_head",
        "marital_status",
        "education_level",
        "job_type",
        "country"
        ]

    # One Hot Encoding conversion
    data = pd.get_dummies(data, prefix_sep="_", columns=categ)

    # Label Encoder conversion
    data["location_type"] = le.fit_transform(data["location_type"])
    data["cellphone_access"] = le.fit_transform(data["cellphone_access"])
    data["gender_of_respondent"] = le.fit_transform(data["gender_of_respondent"])

    # drop unique_id column
    data = data.drop(["uniqueid"], axis=1)

    # scale our data into range of 0 and 1
    scaler = MinMaxScaler(feature_range=(0, 1))
    data = scaler.fit_transform(data)

    return data

Preprocess both train and test dataset.

In [None]:
# preprocess the train data 
processed_train = preprocessing_data(X_train)
processed_test = preprocessing_data(test)

Observe the first row in the train data.

In [57]:
# the first train row
print(processed_train[:1])

[[1.        0.        1.        0.1       0.0952381 0.        0.
  0.        0.        0.        0.        1.        0.        0.
  1.        0.        0.        0.        0.        0.        1.
  0.        0.        0.        0.        0.        0.        0.
  0.        0.        0.        0.        1.        1.        0.
  0.        0.       ]]


Observe the shape of the train data.

In [58]:
# shape of the processed train set
print(processed_train.shape)

(23524, 37)


Now we have more independent variables than before (37 variables). This doesn’t mean all these variables are important to train our model. You need to select only important features that can increase the performance of the model. But we will not apply any feature selection technique in this article; if you want to learn and know more about feature selection techniques, we recommend you read the following articles:


*    [Introduction to Feature Selection methods with an example (or how to select the right variables?)](https://https://www.analyticsvidhya.com/blog/2016/12/introduction-to-feature-selection-methods-with-an-example-or-how-to-select-the-right-variables/)
*   [The 5 Feature Selection Algorithms every Data Scientist should know](https://towardsdatascience.com/the-5-feature-selection-algorithms-every-data-scientist-need-to-know-3a6b566efd2)
*   [How to Choose a Feature Selection Method For Machine Learning](https://machinelearningmastery.com/feature-selection-with-real-and-categorical-data/)
*   [Feature Selection Techniques in Machine Learning with Python](https://towardsdatascience.com/feature-selection-techniques-in-machine-learning-with-python-f24e7da3f36e)

###4. Model Building and Experiments
A portion of the train data set will be used to evaluate our models and find the best one that performs well before using it in the test dataset.


In [None]:
import sklearn.model_selection

In [None]:
# Split train_data
from sklearn.model_selection import train_test_split, GridSearchCV

X_Train, X_Val, y_Train, y_val = train_test_split(processed_train, y_train, stratify = y_train, test_size = 0.1, random_state=42)

Only 10% of the train dataset will be used for evaluating the models. The parameter stratify = y_train will ensure an equal balance of values from both classes (‘yes’ and ‘no’) for both train and validation set.

There are many models to choose from such as

*   [K Nearest Neighbors](https://www.datacamp.com/community/tutorials/k-nearest-neighbor-classification-scikit-learn)
*   [Logistic Regression](https://machinelearningmastery.com/logistic-regression-tutorial-for-machine-learning/)
*   [Random Forest](https://www.datacamp.com/community/tutorials/random-forests-classifier-python)


We will start by training these models using the train set after splitting our train dataset.

In [None]:
X_Train[0]

In [None]:
#import classifier algorithm here
# Create a LightGBM model
lgb_model = LGBMClassifier()

# Fitting the model
lgb_model.fit(X_Train, y_Train)

# Make predictions on the test set
y_pred = lgb_model.predict(X_Val)


The evaluation metric for this challenge will be the percentage of survey respondents for whom you predict the binary 'bank account' classification incorrectly.

This means the **lower** the incorrect percentage we get, the better the model performance.

In [None]:
# import evaluation metrics
from sklearn.metrics import confusion_matrix, accuracy_score

# evaluate the model

# Get error rate
print("Error rate of XGB classifier: ", 1 - accuracy_score(y_val, y_pred))

In [None]:
# Get the predicted result for the test Data
test.bank_account = lgb_model.predict(processed_test)

Then we create a submission file according to the instruction provided in the SubmissionFile.csv.


In [None]:
# Create submission DataFrame
submission = pd.DataFrame(
    {
        "uniqueid": test["uniqueid"] + " x " + test["country"],
        "bank_account": test.bank_account
    }
)

Let’s observe the sample results from our submission DataFrame.


In [None]:
#show the five sample
submission.sample(5)

Save results in the CSV file.


In [None]:
# Create submission csv file csv file
submission.to_csv('light_gbm_submission.csv', index = False)

In [None]:
# save the model for future inference
import joblib
joblib.dump(lgb_model, '../savedModel/light_gbm_model.pkl')


In [None]:
# required data for inference
# ['country', 'year', 'uniqueid', 'location_type', 'cellphone_access', household_size', 'age_of_respondent', 'gender_of_respondent', 'relationship_with_head', 'marital_status', 'education_level', 'job_type']


In [61]:
def preprocess_user_data(data):
    """
    Preprocess user-provided data for model inference.

    Args:
        data (dict): User-provided data as a dictionary containing feature values.

    Returns:
        np.array: Preprocessed data as a NumPy array suitable for model input.
    """
    # Convert numerical features to float if necessary
    numerical_features = ["household_size", "age_of_respondent", "year"]
    for feature in numerical_features:
        if feature in data:
            data[feature] = float(data[feature])


    # Categorical features for One-Hot Encoding
    categorical_features = [
        "relationship_with_head",
        "marital_status",
        "education_level",
        "job_type",
        "country"
    ]

    # One-Hot Encoding conversion
    data = pd.DataFrame(data).get_dummies(prefix_sep="_", columns=categorical_features)

    # Load the saved LabelEncoder
    le = joblib.load("../savedModel/label_encoder.pkl")

    # Apply Label Encoder for specific features (if applicable)
    for feature in ["location_type", "cellphone_access", "gender_of_respondent"]:
        if feature in data:
            data[feature] = le.transform([data[feature]])  # Ensure correct shape

    # Drop unnecessary columns (e.g., "uniqueid")
    data = data.drop(columns=["uniqueid"], axis=1)

    # Load the saved MinMaxScaler
    scaler = joblib.load("../savedModel/minmax_scaler.pkl")

    # Scale data using the MinMaxScaler
    data = scaler.transform(data)

    # Convert DataFrame back to NumPy array for model input
    return data.values

