## **Berka Bank Dataset Analysis Project - Part 1** 
*<span style="color: blue;">by Dr. John Akinyemi*</span>

### **Introduction**
In this project, we delve into the Berka dataset, a rich source of operational information from a Czech bank. Our aim is to leverage data exploration and knowledge discovery techniques to illuminate insights within the financial domain, particularly in understanding clients, accounts, loans, credit cards, districts (bank branches), and various financial transactions.

#### **About the Dataset**
The Berka dataset, originally provided for the 1999 PKDD Discovery Challenge, constitutes a comprehensive collection of data points crucial for understanding banking operations and client behaviors. [Link to dataset](https://webpages.charlotte.edu/mirsad/itcs6265/group1/index.html)

### **Key Statistics**
* The dataset boasts a substantial repository of information, encompassing over 5,000 clients and approximately 1 million transactions.
* Notably, the bank has extended nearly 700 loans to its clients and issued close to 900 credit cards, underscoring its significant financial activities.

### **Challenges and Opportunities**
Bank managers face a dual challenge: the imperative to enhance their services while mitigating risks. They aspire to move beyond vague notions about their clientele and aim to discern actionable insights to improve service delivery.

#### **Identified Challenges:**
1. **Limited Client Understanding:** Bank managers lack detailed insights into their clients' profiles and behaviors.
2. **Targeted Service Improvements:** They seek to identify prime candidates for additional services, as well as mitigate risks associated with less favorable clients.
  
#### **Desired Outcomes:**
1. **Granular Client Profiling:** A comprehensive understanding of clients, potentially segmented by district, to inform tailored strategies.
2. **Actionable Insights:** Extracting meaningful insights from the dataset to drive informed decision-making.
3. **Risk Mitigation Strategies:** Identification of client characteristics conducive to successful loan and credit card applications, alongside indicators of potential risk.

### **Conclusion**
The Berka dataset represents a goldmine of information for banks seeking to optimize their operations and enhance customer satisfaction. By leveraging advanced analytics techniques, we aim to empower bank managers with the knowledge and insights necessary to navigate the complexities of the financial landscape effectively.

### **John's Data Analysis Project**

#### **Project Overview**
The objective of this project is to demonstrate proficiency in data analysis skills ranging from data acquisition to visualization.

#### **Project Structure**
The project is structured into four distinct parts, each showcasing mastery in various aspects of data analysis:

1. **Part 1: Data Preprocessing with Python and Pandas**
2. **Part 2: SQL Integration with MS SQL Server for Data Loading and Preprocessing**
3. **Part 3: Exploratory Data Analysis (EDA) with SQL in MS SQL Server**
   * **Bonus:** Seamless Integration of Python, Pandas, SQL, and MS SQL Server within the Jupyter Notebook Environment
4. **Part 4: Data Storytelling and Visualization Utilizing Power BI**

#### **Part 1: Data Preprocessing with Python and Pandas**
In this initial phase, the focus lies on leveraging Python and the powerful Pandas library to preprocess the raw data extracted from the Berka dataset. The aim is to ensure that the data is formatted and structured optimally for seamless integration into SQL Server.

#### **Project Progression**
The project unfolds as follows:
- **Part 1:** Data preprocessing using Python and Pandas
- **Part 2:** Loading and preprocessing data using SQL within MS SQL Server
- **Part 3:** Conducting Exploratory Data Analysis (EDA) with SQL within MS SQL Server
  - *Bonus:* Demonstrating expertise in working with Python, Pandas, SQL, and MS SQL Server RDBMS cohesively within the Jupyter Notebook environment
- **Part 4:** Crafting a compelling data story and visualizations using Power BI

#### **Next Steps**
With the groundwork laid in Part 1, the immediate next step involves loading the raw Berka dataset into Pandas and undertaking preprocessing tasks to ensure the data is primed for integration into SQL Server. This pivotal phase sets the stage for subsequent analysis and storytelling endeavors.

#### **Conclusion**
By embarking on this comprehensive data analysis journey, I aim to unearth valuable insights as well as showcase the seamless synergy between diverse tools and technologies within the realm of data analysis and visualization. Stay tuned as we unravel the intricacies of the Berka dataset and illuminate the story it holds.

In [1]:
import pandas as pd

## <span style="color: blue;">Load csv files and Rename Columns/Fields</span>
* The separator (delimiter) in the csv files is a semi colon (`;`), not comma (`,`).
* Add `sep=';'` to `read_csv()`
* Also, add `low_memory=False` to the `read_csv()` method for `trans.csv` because the file is a relatively large file (~67MB).

In [2]:
account = pd.read_csv('../berka-bank-raw-dataset/account.csv', sep=';')
card = pd.read_csv('../berka-bank-raw-dataset/card.csv', sep=';')
client = pd.read_csv('../berka-bank-raw-dataset/client.csv', sep=';')
disposition = pd.read_csv('../berka-bank-raw-dataset/disposition.csv', sep=';')
district = pd.read_csv('../berka-bank-raw-dataset/district.csv', sep=';')
loan = pd.read_csv('../berka-bank-raw-dataset/loan.csv', sep=';')
order = pd.read_csv('../berka-bank-raw-dataset/order.csv', sep=';')
transaction = pd.read_csv('../berka-bank-raw-dataset/transaction.csv', sep=';', low_memory=False)

### Data Preprocessing for Account (account.csv)

In [3]:
account.head()

Unnamed: 0,account_id,district_id,frequency,date
0,576,55,POPLATEK MESICNE,930101
1,3818,74,POPLATEK MESICNE,930101
2,704,55,POPLATEK MESICNE,930101
3,2378,16,POPLATEK MESICNE,930101
4,2632,24,POPLATEK MESICNE,930102


In [4]:
account.columns

Index(['account_id', 'district_id', 'frequency', 'date'], dtype='object')

In [5]:
account.rename(columns={
    'account_id':'AccountID', 
    'district_id':'DistrictID', 
    'frequency':'Frequency', 
    'date':'EntryDate'
}, inplace=True)

In [6]:
account.head()

Unnamed: 0,AccountID,DistrictID,Frequency,EntryDate
0,576,55,POPLATEK MESICNE,930101
1,3818,74,POPLATEK MESICNE,930101
2,704,55,POPLATEK MESICNE,930101
3,2378,16,POPLATEK MESICNE,930101
4,2632,24,POPLATEK MESICNE,930102


### <span style="color: blue;">Save preprocessed accounts dataframe data into another `account.csv` file</span>
* The data has been preprocessed and it is ready to be ingested into MS SQL Server.
* Export the preprocessed data in the pandas dataframe into `account.csv` file in another folder.

In [7]:
account.to_csv('../dataset-ready-to-upload-to-rdbms/account.csv', index=False)

### Data Preprocessing for Credit Card (card.csv)

In [8]:
card.head()

Unnamed: 0,card_id,disp_id,type,issued
0,1005,9285,classic,931107 00:00:00
1,104,588,classic,940119 00:00:00
2,747,4915,classic,940205 00:00:00
3,70,439,classic,940208 00:00:00
4,577,3687,classic,940215 00:00:00


In [9]:
card.columns

Index(['card_id', 'disp_id', 'type', 'issued'], dtype='object')

In [10]:
card.rename(columns={
    'card_id':'CardID',
    'disp_id':'DispositionID', 
    'type':'Type', 
    'issued':'IssuedDate', 
}, inplace=True)

In [11]:
card.head()

Unnamed: 0,CardID,DispositionID,Type,IssuedDate
0,1005,9285,classic,931107 00:00:00
1,104,588,classic,940119 00:00:00
2,747,4915,classic,940205 00:00:00
3,70,439,classic,940208 00:00:00
4,577,3687,classic,940215 00:00:00


Save preprocessed credit card dataframe data into another `card.csv` file
* The data has been preprocessed and it is ready to be ingested into MS SQL Server.
* Export the preprocessed data in the pandas dataframe into `card.csv` file.

In [12]:
card.to_csv('../dataset-ready-to-upload-to-rdbms/credit_card.csv', index=False)

### Data Preprocessing for Client (client.csv)

In [13]:
client.head()

Unnamed: 0,client_id,birth_number,district_id
0,1,706213,18
1,2,450204,1
2,3,406009,1
3,4,561201,5
4,5,605703,5


In [14]:
client.columns

Index(['client_id', 'birth_number', 'district_id'], dtype='object')

In [15]:
client.rename(columns={
    'client_id':'ClientID',
    'birth_number':'BirthNumber', 
    'district_id':'DistricttID', 
}, inplace=True)

In [16]:
client.head()

Unnamed: 0,ClientID,BirthNumber,DistricttID
0,1,706213,18
1,2,450204,1
2,3,406009,1
3,4,561201,5
4,5,605703,5


Save preprocessed client dataframe data into another `client.csv` file

In [17]:
client.to_csv('../dataset-ready-to-upload-to-rdbms/client.csv', index=False)

### Data Preprocessing for Disposition (disp.csv)

In [18]:
disposition.head()

Unnamed: 0,disp_id,client_id,account_id,type
0,1,1,1,OWNER
1,2,2,2,OWNER
2,3,3,2,DISPONENT
3,4,4,3,OWNER
4,5,5,3,DISPONENT


In [19]:
disposition.rename(columns={
    'disp_id':'DispositionID', 
    'client_id':'ClientID',
    'account_id':'AccountID', 
    'type':'Type',    
}, inplace=True)

In [20]:
disposition.head()

Unnamed: 0,DispositionID,ClientID,AccountID,Type
0,1,1,1,OWNER
1,2,2,2,OWNER
2,3,3,2,DISPONENT
3,4,4,3,OWNER
4,5,5,3,DISPONENT


In [21]:
disposition.columns

Index(['DispositionID', 'ClientID', 'AccountID', 'Type'], dtype='object')

Save preprocessed disposition dataframe data into another `disposition.csv` file.

In [22]:
disposition.to_csv('../dataset-ready-to-upload-to-rdbms/disposition.csv', index=False)

### Data Preprocessing for District (district.csv)

In [23]:
district.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,1,Hl.m. Praha,Prague,1204953,0,0,0,1,1,100.0,12541,0.29,0.43,167,85677,99107
1,2,Benesov,central Bohemia,88884,80,26,6,2,5,46.7,8507,1.67,1.85,132,2159,2674
2,3,Beroun,central Bohemia,75232,55,26,4,1,5,41.7,8980,1.95,2.21,111,2824,2813
3,4,Kladno,central Bohemia,149893,63,29,6,2,6,67.4,9753,4.64,5.05,109,5244,5892
4,5,Kolin,central Bohemia,95616,65,30,4,1,6,51.4,9307,3.85,4.43,118,2616,3040


In [24]:
district.rename(columns={
    'A1':'DistrictID',    
}, inplace=True)

In [25]:
district.head()

Unnamed: 0,DistrictID,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,1,Hl.m. Praha,Prague,1204953,0,0,0,1,1,100.0,12541,0.29,0.43,167,85677,99107
1,2,Benesov,central Bohemia,88884,80,26,6,2,5,46.7,8507,1.67,1.85,132,2159,2674
2,3,Beroun,central Bohemia,75232,55,26,4,1,5,41.7,8980,1.95,2.21,111,2824,2813
3,4,Kladno,central Bohemia,149893,63,29,6,2,6,67.4,9753,4.64,5.05,109,5244,5892
4,5,Kolin,central Bohemia,95616,65,30,4,1,6,51.4,9307,3.85,4.43,118,2616,3040


In [26]:
district.columns

Index(['DistrictID', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10',
       'A11', 'A12', 'A13', 'A14', 'A15', 'A16'],
      dtype='object')

Save preprocessed district dataframe data into another `district.csv` file.

In [27]:
district.to_csv('../dataset-ready-to-upload-to-rdbms/district.csv', index=False)

### Data Preprocessing for Loan (loan.csv)

In [28]:
loan.head()

Unnamed: 0,loan_id,account_id,date,amount,duration,payments,status
0,5314,1787,930705,96396,12,8033.0,B
1,5316,1801,930711,165960,36,4610.0,A
2,6863,9188,930728,127080,60,2118.0,A
3,5325,1843,930803,105804,36,2939.0,A
4,7240,11013,930906,274740,60,4579.0,A


In [29]:
loan.rename(columns={
    'loan_id':'LoanID', 
    'account_id':'AccountID', 
    'date':'EntryDate',
    'amount':'Amount', 
    'duration':'Duration',
    'payments':'Payments', 
    'status':'Status',     
}, inplace=True)

In [30]:
loan.head()

Unnamed: 0,LoanID,AccountID,EntryDate,Amount,Duration,Payments,Status
0,5314,1787,930705,96396,12,8033.0,B
1,5316,1801,930711,165960,36,4610.0,A
2,6863,9188,930728,127080,60,2118.0,A
3,5325,1843,930803,105804,36,2939.0,A
4,7240,11013,930906,274740,60,4579.0,A


In [31]:
loan.columns

Index(['LoanID', 'AccountID', 'EntryDate', 'Amount', 'Duration', 'Payments',
       'Status'],
      dtype='object')

Save preprocessed loan dataframe data into another `loan.csv` file

In [32]:
loan.to_csv('../dataset-ready-to-upload-to-rdbms/loan.csv', index=False)

### Data Preprocessing for Order (order.csv)

In [33]:
order.head()

Unnamed: 0,order_id,account_id,bank_to,account_to,amount,k_symbol
0,29401,1,YZ,87144583,2452.0,SIPO
1,29402,2,ST,89597016,3372.7,UVER
2,29403,2,QR,13943797,7266.0,SIPO
3,29404,3,WX,83084338,1135.0,SIPO
4,29405,3,CD,24485939,327.0,


In [34]:
order.rename(columns={
    'order_id':'OrderID', 
    'account_id':'AccountID', 
    'bank_to':'BankTo',
    'account_to':'AccountTo', 
    'amount':'Amount', 
    'k_symbol':'KSymbol',     
}, inplace=True)

In [35]:
order.head()

Unnamed: 0,OrderID,AccountID,BankTo,AccountTo,Amount,KSymbol
0,29401,1,YZ,87144583,2452.0,SIPO
1,29402,2,ST,89597016,3372.7,UVER
2,29403,2,QR,13943797,7266.0,SIPO
3,29404,3,WX,83084338,1135.0,SIPO
4,29405,3,CD,24485939,327.0,


In [36]:
order.columns

Index(['OrderID', 'AccountID', 'BankTo', 'AccountTo', 'Amount', 'KSymbol'], dtype='object')

Save preprocessed order dataframe data into another `order.csv` file.

In [37]:
order.to_csv('../dataset-ready-to-upload-to-rdbms/order.csv', index=False)

### Data Preprocessing for Transaction Data (trans.csv)

In [38]:
transaction.head()

Unnamed: 0,trans_id,account_id,date,type,operation,amount,balance,k_symbol,bank,account
0,695247,2378,930101,PRIJEM,VKLAD,700.0,700.0,,,
1,171812,576,930101,PRIJEM,VKLAD,900.0,900.0,,,
2,207264,704,930101,PRIJEM,VKLAD,1000.0,1000.0,,,
3,1117247,3818,930101,PRIJEM,VKLAD,600.0,600.0,,,
4,579373,1972,930102,PRIJEM,VKLAD,400.0,400.0,,,


In [39]:
transaction.rename(columns={
    'trans_id':'TransactionID', 
    'account_id':'AccountID', 
    'date':'EntryDate',
    'type':'Type', 
    'operation':'Operation', 
    'amount':'Amount', 
    'balance':'Balance',
    'k_symbol':'KSymbol',     
    'bank':'Bank',     
    'account':'Account' 
}, inplace=True)

In [40]:
transaction.head()

Unnamed: 0,TransactionID,AccountID,EntryDate,Type,Operation,Amount,Balance,KSymbol,Bank,Account
0,695247,2378,930101,PRIJEM,VKLAD,700.0,700.0,,,
1,171812,576,930101,PRIJEM,VKLAD,900.0,900.0,,,
2,207264,704,930101,PRIJEM,VKLAD,1000.0,1000.0,,,
3,1117247,3818,930101,PRIJEM,VKLAD,600.0,600.0,,,
4,579373,1972,930102,PRIJEM,VKLAD,400.0,400.0,,,


In [41]:
transaction.columns

Index(['TransactionID', 'AccountID', 'EntryDate', 'Type', 'Operation',
       'Amount', 'Balance', 'KSymbol', 'Bank', 'Account'],
      dtype='object')

Save preprocessed transaction dataframe data into another `transaction.csv` file.

In [42]:
transaction.to_csv('../dataset-ready-to-upload-to-rdbms/transaction.csv', index=False)

### **Exploratory Data Analysis (EDA) in SQL (MS SQL Server)**

#### **Introduction**
In this phase of the project, we transition from Python-based analysis to SQL-centric exploration using MS SQL Server. By importing the dataset into a relational database named `BankDB`, we aim to delve deeper into the dataset's intricacies, structure, and content. This shift not only underscores mastery in SQL but also allows for a comprehensive analysis leveraging the robust capabilities of MS SQL Server.

#### **Purpose**
The primary objectives of conducting EDA in SQL are as follows:
1. **Understanding Dataset Characteristics:** Gain insights into the dataset's structure, content, and potential issues.
2. **Descriptive Statistics Analysis:** Perform descriptive statistical analysis to glean meaningful insights.
3. **Data Preprocessing:** Handle missing values, correct data type mismatches (including date types), standardize text data, and address duplicate or inconsistent entries.
4. **Data Validation:** Validate the integrity and accuracy of the dataset.
5. **Augmentation:** Add new tables and columns derived from existing data to facilitate further analysis.

#### **Key Steps**
1. **Data Import:** Import the dataset into the `BankDB` relational database on MS SQL Server.
2. **Exploratory Analysis:** Utilize SQL queries to examine dataset characteristics, including structure, content, and summary statistics.
3. **Preprocessing:** Address data quality issues such as missing values, data type inconsistencies, text standardization, and duplicate/inconsistent entries.
4. **Validation:** Verify the data's integrity through rigorous validation checks.
5. **Augmentation:** Introduce new tables and columns as necessary to enrich the dataset for subsequent analysis.

#### **Conclusion**
Embarking on EDA within MS SQL Server marks a pivotal phase in our data analysis journey. By harnessing the power of SQL, we aim to unearth hidden insights, rectify data anomalies, and lay the groundwork for deeper analysis. Stay tuned as we navigate through the dataset's intricacies and illuminate the path forward in our quest for actionable insights.