# Digital Service Usage Rwanda Dataset – Data Cleaning and Preprocessing

In this notebook, we perform data preparation, cleaning, and preprocessing for the Digital Service Usage Rwanda Dataset, which contains information about citizens’ interactions with various digital services across Rwanda’s districts. The dataset captures key metrics such as the number of users reported, satisfaction scores, departments, and service names over time.

Effective data cleaning ensures that the dataset is accurate, consistent, and ready for analysis. This process helps uncover patterns in digital adoption, identify service performance gaps, and provide valuable insights for improving digital service delivery and citizen satisfaction.

Throughout this notebook, we will address common data quality issues such as missing values, duplicate records, inconsistent formatting, and invalid entries, ensuring that all columns are properly structured and formatted for analysis. We will also handle numerical and categorical variables appropriately and prepare the dataset for visualization or modeling tasks.

By the end of this preprocessing stage, the dataset will be well-structured, reliable, and analysis-ready, forming a solid foundation for any further analytics, reporting, or policy evaluation related to Rwanda’s digital transformation initiatives.


We start by importing essential Python libraries for data handling and manipulation.

- `pandas` for structured data operations.

- `numpy` for numerical operations.

- `os` for interacting with the operating system and directory structures.


In [14]:
# Import libraries

import pandas as pd
import numpy as np
import os

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv("DigitalServiceUsage_Rwanda - DigitalServiceUsage_Rwanda.csv")

# Check first few rows
df.head()


Unnamed: 0,District,Service_Name,Department,Users_Reported,Satisfaction_Score_(%),Year,Month
0,Musanze,eTax Portal,Ministry of Health,910.0,,2022,Unknown
1,Gasabo,Land Registration Portal,Immigration Department,3822.0,93.6,2023,Dec
2,Rusizi,eTax Portal,Rwanda Revenue Authority,516.0,82.9,2024,Mar
3,Nyagatare,E-Visa Service,Ministry of ICT,4476.0,,2024,Jun
4,Nyagatare,Digital Health Records,Rwanda Revenue Authority,,,2023,May


In [2]:
# Check data types and non-null values
df.info()

# Get number of rows and columns
df.shape


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1025 entries, 0 to 1024
Data columns (total 7 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   District                1025 non-null   object 
 1   Service_Name            1025 non-null   object 
 2   Department              1025 non-null   object 
 3   Users_Reported          501 non-null    float64
 4   Satisfaction_Score_(%)  510 non-null    float64
 5   Year                    1025 non-null   int64  
 6   Month                   937 non-null    object 
dtypes: float64(2), int64(1), object(4)
memory usage: 56.2+ KB


(1025, 7)

In [3]:
# Check missing values
df.isnull().sum()

# Percentage of missing values per column
df.isnull().mean() * 100


District                   0.000000
Service_Name               0.000000
Department                 0.000000
Users_Reported            51.121951
Satisfaction_Score_(%)    50.243902
Year                       0.000000
Month                      8.585366
dtype: float64

In [5]:
# Count duplicates
df.duplicated().sum()

# Remove duplicates if any
df = df.drop_duplicates()


In [6]:
# Clean string columns
for col in df.select_dtypes(include='object'):
    df[col] = df[col].str.strip().str.title()


In [7]:
df.describe()  # Summary of numeric data


Unnamed: 0,Users_Reported,Satisfaction_Score_(%),Year
count,491.0,497.0,1000.0
mean,2520.600815,75.187123,2022.441
std,1421.076054,14.515027,1.120614
min,51.0,50.1,2021.0
25%,1292.5,62.3,2021.0
50%,2447.0,75.4,2022.0
75%,3780.5,87.7,2023.0
max,4983.0,100.0,2024.0


In [8]:
# Fill numeric missing values with median
df['Users_Reported'] = df['Users_Reported'].fillna(df['Users_Reported'].median())


In [9]:
df['Month'] = pd.to_datetime(df['Month'], errors='coerce')


  df['Month'] = pd.to_datetime(df['Month'], errors='coerce')


In [10]:
for col in df.select_dtypes(include='object'):
    df[col] = df[col].fillna('Unknown')


In [11]:
df[df['Users_Reported'] < 0]


Unnamed: 0,District,Service_Name,Department,Users_Reported,Satisfaction_Score_(%),Year,Month


In [12]:
df.loc[df['Users_Reported'] < 0, 'Users_Reported'] = None


In [13]:
df.to_csv("DigitalServiceUsage_Rwanda_cleaned.csv", index=False)
