## Data Pipeline with Financial Data

This is intended to be an end-to-end data pipeline built over synthetic data generated using the [Faker](https://fakerjs.dev/) Python module. Then the architecture of the solution will include AWS services for storage (Amazon S3). The data will be persisted in a relational database using the Airbyte MotherDuck connector.

### Generating data for datapipeline

In [1]:
import faker
from faker.providers import BaseProvider
import pandas as pd
import random
from datetime import timedelta
from utils.data_generator import generate_financial_dataset
from print_versions import print_versions

In [2]:
print_versions(globals())

faker==19.6.2
pandas==2.0.3


In [2]:
# Generate the dataset
financial_data = generate_financial_dataset(100000)

# # Optional: Save to CSV
# financial_data.to_csv('financial_transactions.csv', index=False)

# # Display basic information about the dataset
# print(financial_data.info())
# print("\nSample Data:")
# print(financial_data.head())

In [3]:
financial_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 19 columns):
 #   Column              Non-Null Count   Dtype  
---  ------              --------------   -----  
 0   transaction_id      100000 non-null  object 
 1   customer_id         100000 non-null  object 
 2   customer_name       100000 non-null  object 
 3   customer_email      100000 non-null  object 
 4   customer_age        100000 non-null  int64  
 5   account_type        100000 non-null  object 
 6   transaction_type    100000 non-null  object 
 7   investment_type     100000 non-null  object 
 8   risk_profile        100000 non-null  object 
 9   transaction_amount  100000 non-null  float64
 10  currency            100000 non-null  object 
 11  transaction_date    100000 non-null  object 
 12  country             100000 non-null  object 
 13  city                100000 non-null  object 
 14  annual_income       100000 non-null  float64
 15  credit_score        100000 non-null

In [4]:
financial_data.head(5)

Unnamed: 0,transaction_id,customer_id,customer_name,customer_email,customer_age,account_type,transaction_type,investment_type,risk_profile,transaction_amount,currency,transaction_date,country,city,annual_income,credit_score,interest_rate,return_percentage,market_sector
0,d47b6700-96e1-44a7-9696-97f5f050bf65,f4fd86b0-abb1-4905-baec-20d4551a66dd,William Ruiz,eguzman@example.org,60,Joint,Withdrawal,Real Estate Investment Trust,High Risk,2068.69,VUV,2024-05-21,KH,Lake Jordan,29967.84,768,13.23,-8.42,Manufacturing
1,d5a6c001-cd4a-4899-b089-2a518d397de2,27215339-0ead-486b-9def-4e1a165fcb34,Dominic Cross,tiffany55@example.com,57,Checking,Deposit,Cryptocurrency,Conservative,2279.75,CUP,2024-05-05,CA,Lawrenceside,278172.6,612,6.15,1.41,Manufacturing
2,36b10567-92f2-4069-b579-28d30c6e3e4e,9904a9f7-3154-4231-b5f7-f2b5f2a0baad,Steven Rodriguez,jackgreen@example.com,70,Savings,Deposit,Stocks,Conservative,2300.5,LYD,2024-01-15,NR,East Christophershire,30295.41,449,4.64,16.93,Energy
3,a77c65be-616b-4d9c-b128-c1a3abbd6837,de10ee97-6902-4875-8455-43a5f99f5d21,Joseph Robinson,zpatterson@example.net,37,Savings,Transfer,Mutual Funds,Low Risk,670.04,NAD,2023-11-08,MZ,Johnsonfort,165247.49,717,7.83,11.59,Manufacturing
4,55f460a9-cc7b-44d4-bb01-480284815072,65f8f604-8932-4269-8f2a-54476c71774f,Juan Lee Jr.,april30@example.net,39,Retirement,Withdrawal,Bonds,High Risk,747.73,PGK,2024-01-04,UG,Michellechester,131564.82,668,2.34,5.7,Finance


In [5]:
financial_data.describe()

Unnamed: 0,customer_age,transaction_amount,annual_income,credit_score,interest_rate,return_percentage
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,46.48957,4999.618991,259671.085485,575.73712,7.730999,4.995537
std,16.697012,2882.696784,138547.161302,159.270146,4.191487,8.674687
min,18.0,10.12,20004.98,300.0,0.5,-10.0
25%,32.0,2502.9525,139401.0525,438.0,4.11,-2.52
50%,47.0,4981.05,259365.32,576.0,7.69,4.98
75%,61.0,7497.7425,380194.36,714.0,11.35,12.53
max,75.0,9999.93,499983.71,850.0,15.0,20.0


In [7]:
financial_data.columns.to_list()

['transaction_id',
 'customer_id',
 'customer_name',
 'customer_email',
 'customer_age',
 'account_type',
 'transaction_type',
 'investment_type',
 'risk_profile',
 'transaction_amount',
 'currency',
 'transaction_date',
 'country',
 'city',
 'annual_income',
 'credit_score',
 'interest_rate',
 'return_percentage',
 'market_sector']