# Data Cleaning and Preparation

This notebook loads, inspects, and prepares banking sector and macroeconomic data for analysis.


## Data Coverage and Frequency

Monthly macro-financial indicators span January 2010 onward, sourced primarily from the Bank of Ghana.
Quarterly real GDP data from the Ghana Statistical Service is converted to monthly frequency using linear interpolation.
This approach preserves medium-term macroeconomic trends while enabling integration with higher-frequency banking indicators.


In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# 1. Loading the data
df_monthly = pd.read_csv('../data/raw/data_first.csv')
df_gdp = pd.read_csv('../data/raw/data_first2.csv')

# 2. Removing the footer rows
df_monthly_clean = df_monthly.iloc[:192].copy()
df_gdp_clean = df_gdp.iloc[:79].copy()

# 3. Standardizing Dates
#'01/01/2010' to datetime
df_monthly_clean['Date'] = pd.to_datetime(df_monthly_clean['Name of Series'], format='%d/%m/%Y')

# Converting '2006Q1' to datetime
df_gdp_clean['Date'] = pd.PeriodIndex(df_gdp_clean['Name of Series'], freq='Q').to_timestamp()

# 4. Converting all text columns to Numbers (Floats)
# Monthly columns
cols_to_fix = [col for col in df_monthly_clean.columns if col not in ['Name of Series', 'Date']]
for col in cols_to_fix:
    df_monthly_clean[col] = pd.to_numeric(df_monthly_clean[col], errors='coerce')

# GDP column
df_gdp_clean['GDP_Real'] = pd.to_numeric(df_gdp_clean['Gross Domestic Product (GDP), production, real'], errors='coerce')

# 5. Handling the Frequency Mismatch 
# Resample monthly data to ensure Date is the index
df_gdp_clean = df_gdp_clean.set_index('Date')
df_gdp_monthly = df_gdp_clean[['GDP_Real']].resample('MS').interpolate(method='linear')
df_gdp_monthly = df_gdp_monthly.reset_index()

# 6. Merging into a Master Dataset
master_df = pd.merge(df_monthly_clean.drop(columns=['Name of Series']), 
df_gdp_monthly, on='Date', how='inner')

# 7. final version 
master_df.to_csv('../data/processed/ghana_banking_master.csv', index=False)

print("Master Dataset Created Successfully!")
print(f"Total Months Analyzed: {len(master_df)}")
master_df.head()

Master Dataset Created Successfully!
Total Months Analyzed: 187


Unnamed: 0,Monetary Policy Rate (%),"Consumer Price Index, All Items","USD Exchange Rate, monthly averages",Total Liquidity (M2+),Gold Price (Realised Gold Price),Return on Assets,Non Performing Loan Ratio,Capital Adequacy Ratio,Date,GDP_Real
0,18.0,30.02508,1.4295,10222.3,1113.2,5.209983,19.710867,17.408464,2010-01-01,21005.42351
1,16.0,30.47826,1.4298,10094.1,1094.0,4.043841,20.024241,19.67046,2010-02-01,20601.295093
2,16.0,30.82513,1.4271,10538.0,1111.4,3.661842,18.487448,20.544521,2010-03-01,20197.166677
3,15.0,31.25998,1.4222,10408.2,1124.4,4.064806,18.901357,20.15175,2010-04-01,19793.03826
4,15.0,31.84412,1.4206,10467.1,1202.3,3.819093,18.732988,19.149789,2010-05-01,21957.90124
