# Idea 1: Analyse Aggregated Financial Data

Once the project is defined, verify it meets these requirements.

### Learning Objectives

1. Understand How to Source Data
2. Collate and Format Data for Processing and Analysis
3. Analyse Data to Support Business Outcomes
4. Present and Communicate Data to the Appropriate Audience
5. Store, Manage and Distribute Data Securely
6. Collaborate with Others and Practise CPD

### Milestones / Stages

1. Defining outline, purpose, and scope of the project
2. Collecting and formatting relevant data
3. Carrying out numerical and statistical analysis
4. Preparing charts and diagrams to highlight trends and correlations in the data
5. Outline how the data can be stored and made available securely
6. Describe how the project could be further developed beyond the course

### Deliverables

1. Presentation (max 10 mins) – slides or pdf
- Title slide
- Contents slide
- Introduction to the data project
- Data analysis steps taken (e.g. collection and analysis)
- Findings – with graphs and diagrams explanations
- Conclusion

2. Technical Report – pdf
- Title Page with learner’s name, project title, date
- Learner declaration (e.g. statement to say this is your own work)
- Executive Summary (short outline)
- Contents
- Introduction – project outline, purpose
- Data collection phase – how data was sourced
- Data analysis phase – numerical and statistical analysis undertaken
- Data visualisation – charts and diagrams
- Data Storage – (e.g CSV files, SQL DB, how sensitive data would be secured)
- Further development
- Conclusions
- References – links to data sources, URLs, published papers, etc
- Appendix

3. Sample data and program code folder - zip
- Code files – e.g. Python, HTML
- Description of any APIs used
- Sample data file(s) - CSV file or output from SQL table

---

## Description (high-level)

Nowadays people have many financial accounts spread all over the place.
It's so quick and easy to open a financial account, can do it completely
from your phone with mobile challenger banks, and there's often incentives
to attract you to open accounts like offers, bonuses, interest rates, features.

For instance
- accounts at the same bank (e.g. current and savings)
- accounts at different banks (e.g. offers or features)
- accounts in different countries (e.g. multiple currencies or tax reasons)
- accounts for pensions (e.g. SIPP)
- accounts for ISAs (e.g. tax-free vehicles for cash or stock investments)
- accounts for P2P (e.g. peer-to-peer lending to individuals or businesses)
- accounts for general investments (e.g. taxable investments in bonds or stocks)
- accounts at any institution (e.g. ledger with a supplier or your building management)

Because of the number of accounts, that they're all segregated / isolated from each other,
and that they have differences in how data is recorded or displayed, it's very difficult to
ask questions or analyse the dataset as a whole. Even basic questions such as what is my
cash net worth?

The idea is to extract financial transaction data from a number of sources,
pre-process the data (e.g. sanitise, transform, normalise, anonymise, filter),
dump it into a centralised relational database, then run analysis on the data.

It will utilise the ETL (Extract, Transform, Load) data pipeline pattern.
Extract: Pull raw data from various sources.
Transform: Clean, structure, and prepare data for analysis.
Load: Store processed data in a data repository.

Analysis can include descriptive ("what happened", sums, counts, averages),
predictive ("what might happen", classification, forecasting, estimations),
exploratory (patterns, relationships, anomalies) and data visualisations.

---

## Feasibility

Is this project doable? Risks:

### Fetching data

The data sources are financial institutions and the data itself is sensitive,
which means access to the data is more locked down. For various reasons, e.g.
security and risk aversion, legacy infrastructure, regulations and compliance,
vendor lock-in and closed eco systems, and no business incentive.
In practice that means data is not accessible directly from the source via an API
and scraping the data is difficult because the authentication process is more
cumbersome (e.g. login requires 2FA with no way to bypass making it difficult
to automate).

Ways to derisk:

Open Banking was introduced by regulators to increase competition, customer control, and innovation in financial services.It offers a standardised API framework that allows third-party providers (TPPs) to securely access customer financial data (e.g. balances, transactions, direct debits, standing orders) with customer consent from UK banks and payment institutions, thereby reducing the dependency on web scraping. 

The UK’s nine largest banks and building societies are required to make your data available through open banking. Other smaller banks and building societies can choose to take part in open banking.

However, [access is regulated](https://www.openbanking.org.uk/faqs/#b27d41f6-4b8f-4d85-854c-eeb37eb41427__button).
Only regulated TPPs can access these APIs. You need to be registered or authorised with the Financial Conduct Authority (FCA) and enrolled in the Open Banking Directory.
Also need to comply with the Payment Services Regulations 2017 which contain strict requirements around the sharing of data, secure methods of communication and customer identification.

This rules out the solution that offers the most automation for sourcing data.

Remaining solutions include semi-automated access and full-manual access.

Semi-automated would use web scraping but require manual intervention for the login steps.
Risks include:
- scraping might violate the service terms
- anti-scraping deterrents (e.g. no-robots txt file, IP address banning, captchas)
- the frontend of the website changing, breaking the scraper
- every website is different and needs a custom scraper which can be time-consuming to build

Fully manual means a person manually logs in to the different websites, navigates the websites,
filters the data, exports the data, and names and organises the exported data files.

Both of these methods would limit the number of institutions we can scrape data from simply
because adding a new institution becomes much more time consuming than if we had a single API.

Another general risk is accessing historical data. Some services do not allow you to access data
past a certain length threshold, for example 12 months. This means there could be gaps or a lack of historical data, which can affect analysis and predictions.

Having data that we can work with programmatically relies on the websites having a CSV export feature. For those that do not, there would be an intermediate step of either exporting PDF statements and converting those to data that is easier to work with, or paginating the transactions with the scraper and iteratively building up our own CSV export.

### Combining different types of data

Depending on the financial platform, the transactional data could vary and be difficult to normalise. All transactions should have some basic fields
- Direction (debit or credit)
- Amount
- Date

Some might offer additional/optional fields such as description or reference. And some fields will need to be implied based on the context, such as currency and account.

As the transactions will come from a ledger specific to your account, as opposed to a general journal with double-entry bookkeeping, you can't see the opposite transaction. So some data may be limited or missing, such as the use of funds which includes where the money came from or went to and why.

Other financial platforms, such as investment platforms, might have transactional data that looks
very different than a bank statement because of the nature of investing. For instance, a data export from such a platform might include fields such as
- Trade date
- Settle date
- Unit cost (p)
- Quantity

To simplify the project, I'll focus on the easiest platforms first. which I expect will be banks.
Other types of transactions, such as from investment platforms, can be introduced later.

---

## Project Design / Outline

### 1. Source Data

### 2. Collate and Format Data for Processing and Analysis

### 3. Analyse Data to Support Business Outcomes

### 4. Present and Communicate Data to the Appropriate Audience

### 5. Store, Manage and Distribute Data Securely

### 6. Collaborate with Others and Practise CPD
