## <span style="color: blue;">Berka Bank Dataset Preparation - Part 1 [using Pandas and Python]</span>
by Dr. John Akinyemi

### About the Dataset and the Project

The dataset is called the Berka dataset. It contains operational information from a Czech bank. The purpose is for data exploration and knowledge discovery in the financial industry, specifically in a bank. , in the areas of clients, accounts, loans, credit cards, districts (bank branches), and other financial transactions. 

The dataset was originally provided for the 1999 PKDD Discovery Challenge.
Source: https://webpages.charlotte.edu/mirsad/itcs6265/group1/index.html

### Statistics:
* The dataset contains more than 5,000 clients, and approximately 1 million transactions. 
* The dataset shows that the bank advanced almost 700 loans to clients, and issued nearly 900 credit cards.

### Bank managers need help to improve their services. 
* They want to have better understanding of their customers and seek specific actions to improve their services. 

### Problems and Opportunities Facing Bank Managers
* They only have vague ideas about their clients.
* They want to know the best clients to be offered additional services.
* They also want to know bad clients, so they can minimize potential loses for the bank.

### Desired Outcomes for Bank Managers
* Managers want a general understanding of their clients, if possible, per district.
* They want to gain useful and actionable insights from the data.
* They want to know the characteristics of customers who are good candidate for loans, credit cards, and other services.
* They also want to know clients who are bad candidate for loans, credit cards, and other services.

### <span style="color: blue;">John's Notes</span>

The goal of this project is to tell a story with the data in this dataset and to show mastery of data analysis skills from data acquisition to visualization. 

The project is divided into four (4) parts, to show mastery of:

* **Part 1: Data preprocessing using python and the pandas library.**
* **Part 2: Using SQL to load and preprocess data within MS SQL Server** 
* **Part 3: Exploratory Data Analysis (EDA) with SQL within MS SQL Server** 
    * Bonus: showing mastery of working with python, pandas, SQL, and MS SQL Server RDBMS, together within Jupyter Notebook environment.
* **Part 4: Data story telling and visualization using Power BI.** 


<span style="color: blue;">The next step is to load the raw data from the Berka dataset into pandas and preprocess to get the data ready for SQL Server.</span>


In [1]:
import pandas as pd

## <span style="color: blue;">Load csv files and Rename Columns/Fields</span>
* The separator (delimiter) in the csv files is a semi colon (`;`), not comma (`,`).
* Add `sep=';'` to `read_csv()`
* Also, add `low_memory=False` to the `read_csv()` method for `trans.csv` because the file is a relatively large file (~67MB).

In [2]:
account = pd.read_csv('../berka-bank-raw-dataset/account.csv', sep=';')
card = pd.read_csv('../berka-bank-raw-dataset/card.csv', sep=';')
client = pd.read_csv('../berka-bank-raw-dataset/client.csv', sep=';')
disposition = pd.read_csv('../berka-bank-raw-dataset/disposition.csv', sep=';')
district = pd.read_csv('../berka-bank-raw-dataset/district.csv', sep=';')
loan = pd.read_csv('../berka-bank-raw-dataset/loan.csv', sep=';')
order = pd.read_csv('../berka-bank-raw-dataset/order.csv', sep=';')
transaction = pd.read_csv('../berka-bank-raw-dataset/transaction.csv', sep=';', low_memory=False)

### Data Preprocessing for Account (account.csv)

In [3]:
account.head()

Unnamed: 0,account_id,district_id,frequency,date
0,576,55,POPLATEK MESICNE,930101
1,3818,74,POPLATEK MESICNE,930101
2,704,55,POPLATEK MESICNE,930101
3,2378,16,POPLATEK MESICNE,930101
4,2632,24,POPLATEK MESICNE,930102


In [4]:
account.columns

Index(['account_id', 'district_id', 'frequency', 'date'], dtype='object')

In [5]:
account.rename(columns={
    'account_id':'AccountID', 
    'district_id':'DistrictID', 
    'frequency':'Frequency', 
    'date':'EntryDate'
}, inplace=True)

In [6]:
account.head()

Unnamed: 0,AccountID,DistrictID,Frequency,EntryDate
0,576,55,POPLATEK MESICNE,930101
1,3818,74,POPLATEK MESICNE,930101
2,704,55,POPLATEK MESICNE,930101
3,2378,16,POPLATEK MESICNE,930101
4,2632,24,POPLATEK MESICNE,930102


### <span style="color: blue;">Save preprocessed accounts dataframe data into another `account.csv` file</span>
* The data has been preprocessed and it is ready to be ingested into MS SQL Server.
* Export the preprocessed data in the pandas dataframe into `account.csv` file in another folder.

In [7]:
account.to_csv('../dataset-ready-to-upload-to-rdbms/account.csv', index=False)

### Data Preprocessing for Credit Card (card.csv)

In [8]:
card.head()

Unnamed: 0,card_id,disp_id,type,issued
0,1005,9285,classic,931107 00:00:00
1,104,588,classic,940119 00:00:00
2,747,4915,classic,940205 00:00:00
3,70,439,classic,940208 00:00:00
4,577,3687,classic,940215 00:00:00


In [9]:
card.columns

Index(['card_id', 'disp_id', 'type', 'issued'], dtype='object')

In [10]:
card.rename(columns={
    'card_id':'CardID',
    'disp_id':'DispositionID', 
    'type':'Type', 
    'issued':'IssuedDate', 
}, inplace=True)

In [11]:
card.head()

Unnamed: 0,CardID,DispositionID,Type,IssuedDate
0,1005,9285,classic,931107 00:00:00
1,104,588,classic,940119 00:00:00
2,747,4915,classic,940205 00:00:00
3,70,439,classic,940208 00:00:00
4,577,3687,classic,940215 00:00:00


### <span style="color: blue;">Save preprocessed credit card dataframe data into another `card.csv` file</span>
* The data has been preprocessed and it is ready to be ingested into MS SQL Server.
* Export the preprocessed data in the pandas dataframe into `card.csv` file in another folder.

In [12]:
card.to_csv('../dataset-ready-to-upload-to-rdbms/credit_card.csv', index=False)

### Data Preprocessing for Client (client.csv)

In [13]:
client.head()

Unnamed: 0,client_id,birth_number,district_id
0,1,706213,18
1,2,450204,1
2,3,406009,1
3,4,561201,5
4,5,605703,5


In [14]:
client.columns

Index(['client_id', 'birth_number', 'district_id'], dtype='object')

In [15]:
client.rename(columns={
    'client_id':'ClientID',
    'birth_number':'BirthNumber', 
    'district_id':'DistricttID', 
}, inplace=True)

In [16]:
client.head()

Unnamed: 0,ClientID,BirthNumber,DistricttID
0,1,706213,18
1,2,450204,1
2,3,406009,1
3,4,561201,5
4,5,605703,5


### <span style="color: blue;">Save preprocessed client dataframe data into another `client.csv` file</span>
* The data has been preprocessed and it is ready to be ingested into MS SQL Server.
* Export the preprocessed data in the pandas dataframe into `client.csv` file in another folder.

In [17]:
client.to_csv('../dataset-ready-to-upload-to-rdbms/client.csv', index=False)

### Data Preprocessing for Disposition (disp.csv)

In [18]:
disposition.head()

Unnamed: 0,disp_id,client_id,account_id,type
0,1,1,1,OWNER
1,2,2,2,OWNER
2,3,3,2,DISPONENT
3,4,4,3,OWNER
4,5,5,3,DISPONENT


In [19]:
disposition.rename(columns={
    'disp_id':'DispositionID', 
    'client_id':'ClientID',
    'account_id':'AccountID', 
    'type':'Type',    
}, inplace=True)

In [20]:
disposition.head()

Unnamed: 0,DispositionID,ClientID,AccountID,Type
0,1,1,1,OWNER
1,2,2,2,OWNER
2,3,3,2,DISPONENT
3,4,4,3,OWNER
4,5,5,3,DISPONENT


In [21]:
disposition.columns

Index(['DispositionID', 'ClientID', 'AccountID', 'Type'], dtype='object')

### <span style="color: blue;">Save preprocessed disposition dataframe data into another `disposition.csv` file</span>
* The data has been preprocessed and it is ready to be ingested into MS SQL Server.
* Export the preprocessed data in the pandas dataframe into `disposition.csv` file in another folder.

In [22]:
disposition.to_csv('../dataset-ready-to-upload-to-rdbms/disposition.csv', index=False)

### Data Preprocessing for District (district.csv)

In [23]:
district.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,1,Hl.m. Praha,Prague,1204953,0,0,0,1,1,100.0,12541,0.29,0.43,167,85677,99107
1,2,Benesov,central Bohemia,88884,80,26,6,2,5,46.7,8507,1.67,1.85,132,2159,2674
2,3,Beroun,central Bohemia,75232,55,26,4,1,5,41.7,8980,1.95,2.21,111,2824,2813
3,4,Kladno,central Bohemia,149893,63,29,6,2,6,67.4,9753,4.64,5.05,109,5244,5892
4,5,Kolin,central Bohemia,95616,65,30,4,1,6,51.4,9307,3.85,4.43,118,2616,3040


In [24]:
district.rename(columns={
    'A1':'DistrictID',    
}, inplace=True)

In [25]:
district.head()

Unnamed: 0,DistrictID,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,1,Hl.m. Praha,Prague,1204953,0,0,0,1,1,100.0,12541,0.29,0.43,167,85677,99107
1,2,Benesov,central Bohemia,88884,80,26,6,2,5,46.7,8507,1.67,1.85,132,2159,2674
2,3,Beroun,central Bohemia,75232,55,26,4,1,5,41.7,8980,1.95,2.21,111,2824,2813
3,4,Kladno,central Bohemia,149893,63,29,6,2,6,67.4,9753,4.64,5.05,109,5244,5892
4,5,Kolin,central Bohemia,95616,65,30,4,1,6,51.4,9307,3.85,4.43,118,2616,3040


In [26]:
district.columns

Index(['DistrictID', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10',
       'A11', 'A12', 'A13', 'A14', 'A15', 'A16'],
      dtype='object')

### <span style="color: blue;">Save preprocessed district dataframe data into another `district.csv` file</span>
* The data has been preprocessed and it is ready to be ingested into MS SQL Server.
* Export the preprocessed data in the pandas dataframe into `district.csv` file in another folder.

In [27]:
district.to_csv('../dataset-ready-to-upload-to-rdbms/district.csv', index=False)

### Data Preprocessing for Loan (loan.csv)

In [28]:
loan.head()

Unnamed: 0,loan_id,account_id,date,amount,duration,payments,status
0,5314,1787,930705,96396,12,8033.0,B
1,5316,1801,930711,165960,36,4610.0,A
2,6863,9188,930728,127080,60,2118.0,A
3,5325,1843,930803,105804,36,2939.0,A
4,7240,11013,930906,274740,60,4579.0,A


In [29]:
loan.rename(columns={
    'loan_id':'LoanID', 
    'account_id':'AccountID', 
    'date':'EntryDate',
    'amount':'Amount', 
    'duration':'Duration',
    'payments':'Payments', 
    'status':'Status',     
}, inplace=True)

In [30]:
loan.head()

Unnamed: 0,LoanID,AccountID,EntryDate,Amount,Duration,Payments,Status
0,5314,1787,930705,96396,12,8033.0,B
1,5316,1801,930711,165960,36,4610.0,A
2,6863,9188,930728,127080,60,2118.0,A
3,5325,1843,930803,105804,36,2939.0,A
4,7240,11013,930906,274740,60,4579.0,A


In [31]:
loan.columns

Index(['LoanID', 'AccountID', 'EntryDate', 'Amount', 'Duration', 'Payments',
       'Status'],
      dtype='object')

### <span style="color: blue;">Save preprocessed loan dataframe data into another `loan.csv` file</span>
* The data has been preprocessed and it is ready to be ingested into MS SQL Server.
* Export the preprocessed data in the pandas dataframe into `loan.csv` file in another folder.

In [32]:
loan.to_csv('../dataset-ready-to-upload-to-rdbms/loan.csv', index=False)

### Data Preprocessing for Order (order.csv)

In [33]:
order.head()

Unnamed: 0,order_id,account_id,bank_to,account_to,amount,k_symbol
0,29401,1,YZ,87144583,2452.0,SIPO
1,29402,2,ST,89597016,3372.7,UVER
2,29403,2,QR,13943797,7266.0,SIPO
3,29404,3,WX,83084338,1135.0,SIPO
4,29405,3,CD,24485939,327.0,


In [34]:
order.rename(columns={
    'order_id':'OrderID', 
    'account_id':'AccountID', 
    'bank_to':'BankTo',
    'account_to':'AccountTo', 
    'amount':'Amount', 
    'k_symbol':'KSymbol',     
}, inplace=True)

In [35]:
order.head()

Unnamed: 0,OrderID,AccountID,BankTo,AccountTo,Amount,KSymbol
0,29401,1,YZ,87144583,2452.0,SIPO
1,29402,2,ST,89597016,3372.7,UVER
2,29403,2,QR,13943797,7266.0,SIPO
3,29404,3,WX,83084338,1135.0,SIPO
4,29405,3,CD,24485939,327.0,


In [36]:
order.columns

Index(['OrderID', 'AccountID', 'BankTo', 'AccountTo', 'Amount', 'KSymbol'], dtype='object')

### <span style="color: blue;">Save preprocessed order dataframe data into another `order.csv` file</span>
* The data has been preprocessed and it is ready to be ingested into MS SQL Server.
* Export the preprocessed data in the pandas dataframe into `order.csv` file in another folder.

In [37]:
order.to_csv('../dataset-ready-to-upload-to-rdbms/order.csv', index=False)

### Data Preprocessing for Transaction Data (trans.csv)

In [38]:
transaction.head()

Unnamed: 0,trans_id,account_id,date,type,operation,amount,balance,k_symbol,bank,account
0,695247,2378,930101,PRIJEM,VKLAD,700.0,700.0,,,
1,171812,576,930101,PRIJEM,VKLAD,900.0,900.0,,,
2,207264,704,930101,PRIJEM,VKLAD,1000.0,1000.0,,,
3,1117247,3818,930101,PRIJEM,VKLAD,600.0,600.0,,,
4,579373,1972,930102,PRIJEM,VKLAD,400.0,400.0,,,


In [39]:
transaction.rename(columns={
    'trans_id':'TransactionID', 
    'account_id':'AccountID', 
    'date':'EntryDate',
    'type':'Type', 
    'operation':'Operation', 
    'amount':'Amount', 
    'balance':'Balance',
    'k_symbol':'KSymbol',     
    'bank':'Bank',     
    'account':'Account' 
}, inplace=True)

In [40]:
transaction.head()

Unnamed: 0,TransactionID,AccountID,EntryDate,Type,Operation,Amount,Balance,KSymbol,Bank,Account
0,695247,2378,930101,PRIJEM,VKLAD,700.0,700.0,,,
1,171812,576,930101,PRIJEM,VKLAD,900.0,900.0,,,
2,207264,704,930101,PRIJEM,VKLAD,1000.0,1000.0,,,
3,1117247,3818,930101,PRIJEM,VKLAD,600.0,600.0,,,
4,579373,1972,930102,PRIJEM,VKLAD,400.0,400.0,,,


In [41]:
transaction.columns

Index(['TransactionID', 'AccountID', 'EntryDate', 'Type', 'Operation',
       'Amount', 'Balance', 'KSymbol', 'Bank', 'Account'],
      dtype='object')

### <span style="color: blue;">Save preprocessed transaction dataframe data into another `transaction.csv` file</span>
* The data has been preprocessed and it is ready to be ingested into MS SQL Server.
* Export the preprocessed data in the pandas dataframe into `transaction.csv` file in another folder.

In [42]:
transaction.to_csv('../dataset-ready-to-upload-to-rdbms/transaction.csv', index=False)

### <span style="color: blue;">Next Step: Exploratory Data Analysis (EDA) in SQL (MS SQL Server)</span>

* Although the EDA can be completed here using python and pandas, part of the original goal is to demonstrate a mastery of SQL by performing the EDA in MS SQL Server using SQL.
* So, the dataset will be imported into a relational database called `BankDB` on MS SQL Server.

### Purpose: 
* to understand the dataset, the data structure, the content, and any potential issues with the dataset.
* to analyze with descriptive statistics.
* to handle missing values.
* to correct data type mismatch, including date data types.
* to standardize text data.
* to handle duplicate and inconsistent data.
* to validate the data.
* to add new tables and columns based on existing data for analysis purposes.