**INTRODUCTION:**

In this project, I explore how financial characteristics influence bank profitability using regression analysis. The goal is to predict a key performance metric (Net Income) for the four largest U.S. commercial banks (Bank of America, JPMorgan Chase, Citigroup, and Wells Fargo) based on their financial data from 2005 to 2024. By analyzing how variables such as Total Assets, Total Equity, and Total Revenue relate to profitability, this project aims to uncover the patterns that help explain how a bank's size, efficiency, and balance-sheet affect earnings.

The dataset was manually compiled from each bank’s SEC 10-K annual reports, resulting in a consistent panel of financial metrics spanning 20 years. Each record represents a specific bank and year, with features including:

**Total Assets:** overall resources owned by the bank

**Total Equity:** shareholder ownership value

**Total Revenue:** total income generated before expenses

**Net Income:** final profit after all costs (used here as the target variable)

**ROA, ROE, Profit Margin, Asset Turnover, and Equity-to-Assets ratios:** derived indicators of efficiency and leverage

**Bank Name and Year:** categorical and temporal features for context

This dataset is a great foundation for a regression problem because it captures continuous financial relationships that evolve. Through this analysis, I aim to build models that can not only estimate profitability but also provide insights into which financial characteristics most strongly drive performance within the banking sector.

**REGRESSION AND HOW IT WORKS:**

Regression is a statistical technique used to model and analyze the relationship between a dependent variable (the value we want to predict) and one or more independent variables (the predictors or features). The main purpose of regression is to estimate how changes in the independent variables are associated with changes in the dependent variable.

In this project, I focus on linear regression, one of the most fundamental and widely used regression methods. Linear regression assumes that the relationship between the dependent variable y and the independent variables x1, x2...xn can be expressed as a straight line.

                                                ŷ= β0 + β1x1 + β2x2 + ... βnxn

In this equation: 

**ŷ:** The predicted value of the dependent variables.
**β0:** The intercept that represents the predicted value when all of the x's are = to zero.
**β1 - βn:** The coefficients which show how much y will change for a one unit change in each of the predictors (x) while holding everything else constant. 

The model learns these coefficients by minimizing the SSE (Sum of Squared Errors) between the predicted values and actual values. 

This method finds the line that best fits the data by making these squared differences as small as possible.

Once fitted, the model can be used to predict new values of y for whatever x inputs and to interpret the influence of each independent variable. For example, a positive coefficient indicates that an increase in that variable is associated with an increase in the predicted outcome, whereas a negative coefficient suggests the opposite.

                                                 SSE = Summation of (yi - ŷi)^2

**EXPERIMENT 1: DATA-UNDERSTANDING:**

Before building any regression models, I began by gaining an initial understanding of the dataset to identify trends, patterns, and relationships among the financial variables. The dataset includes annual financial data for the four largest U.S. banks—Bank of America, JPMorgan Chase, Citigroup, and Wells Fargo—from 2005 to 2024. Each record represents one bank in one year, with features such as Total Assets, Total Equity, Total Revenue, Net Income, ROA, ROE, Profit Margin, Asset Turnover, and Equity-to-Assets.

To understand the structure of the data, I first examined summary statistics (mean, median, minimum, and maximum) to check for outliers and to see how each feature is distributed. Then I created visualizations to explore the relationships among variables and identify potential multicollinearity:

**Pairplot / scatterplots:** used to visualize how Total Assets, Total Revenue, and Net Income move together across different years.

**Correlation heatmap:** revealed that Total Assets, Total Revenue, and Total Equity were all highly correlated, suggesting that banks with larger asset bases also tend to generate higher revenue and profits.

**Line plots over time:** helped visualize growth trends from 2005 to 2024 and highlighted economic downturns (such as 2008–2009 and 2020), where profitability dropped significantly across all banks.

**Boxplots grouped by bank:** provided insight into how each bank differs in size and profitability distributions over the 20-year period.

From this initial exploration, I observed that the dataset shows strong linear relationships among the major financial metrics, making it a suitable candidate for linear regression modeling. However, the high correlation between certain features indicated a need to watch for multicollinearity, which could affect coefficient interpretability in any of my other experiments.

**EXPERIMENT 1: PRE-PROCESSING:**

After gaining an initial understanding of the data, I moved on to the preprocessing stage to prepare the dataset for regression modeling. The goal of this step was to ensure data quality, consistency, and appropriate variable formats for numerical analysis.

First, I checked for missing or null values across all columns. Because the dataset was manually compiled from SEC 10-K filings, there were a few missing entries for earlier years in certain banks. These were handled either by forward-filling from nearby years (if a missing value fell within a continuous financial trend) or by removing that record if it lacked several key variables.

Next, I verified data types. Most features were numeric, but the two columns required adjustments:

**Bank Name:** converted from text to categorical dummy variables so that each bank could be represented numerically in the regression model.

**Year:** kept as a numeric variable to allow trend capture over time if needed.

For the first experiment, I selected a small group of core financial predictors to keep the model simple and interpretable:

*Total Assets*

*Total Equity*

*Total Revenue*

The target variable for prediction was Net Income. These variables were chosen because they represent the fundamental components of a bank’s financial structure—size, capital strength, and earnings capacity—which logically relate to profitability.

Before fitting the model, I also applied feature scaling using standardization so that each numeric variable had a mean of 0 and a standard deviation of 1. This prevents larger-scale variables, such as Total Assets, from dominating smaller-scale ones like Total Equity.

No feature engineering was applied at this stage, as the purpose of Experiment 1 was to establish a baseline model using the most direct financial indicators. Later experiments introduce ratio-based features and transformations to test whether they improve performance.

**EXPERIMENT 1: MODELING:**

For the first experiment, I built a linear regression model using the LinearRegression class from scikit-learn. This model applies ordinary least squares (OLS) to find the best-fitting line that minimizes the residual sum of squares between the observed target values and the model’s predicted values.

In [3]:
#The model was implemented as follows:

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


*How the model works*

The algorithm estimates a coefficient βi for each independent variable that best explains the variation in the dependent variable y.

*The breakdown*

**ŷ:** The predicted value of Net Income.
**β0:** The intercept.
**β1 - β3:** How each feature contributes to Net Income.

The model automatically learns these coefficients by minimizing the sum of squared errors between the actual and predicted values.

*Model parameters*

The key parameter used was fit_intercept = True, which essentially instructs the model to estimate an intercept term so predictions aren’t forced through the origin. All other parameters were left at their defaults (copy_X=True, positive=False), which are appropriate for dense numeric data like this dataset. This baseline model establishes the foundation for comparison with later experiments, where I will add engineered features and transformations to improve accuracy.


**EXPERIMENT 1: EVALUATION:**

After fitting the baseline linear regression model, the next step was to evaluate its performance on unseen data. To do this, I used two key metrics: the Root Mean Squared Error (RMSE) and the Coefficient of Determination (R^2). 

RMSE measures the average difference between the model’s predicted values and the actual values, giving a sense of how far off the predictions are in the same units as the target variable.

                                        RMSE = Square root ( 1/n * (Summation of (yi - ŷi)^2) )

In words, RMSE is the square root of the average of all squared differences between the actual and predicted Net Income values. Lower RMSE values indicate a better fit.

R^2 measures how well the model explains the variability in the target variable. An R^2 value close to 1 means the model explains most of the variation in Net Income, while a value near 0 means it explains very little.

Root Mean Squared Error (RMSE): 8614.879586272604
R² Score: 0.29762961598357507

These metrics provided a quantitative measure of how well the model performed on new data.

For the baseline model, the RMSE represented the average prediction error in Net Income (in millions of dollars, depending on dataset scale), while the R^2 score showed how much of the variation in Net Income could be explained by Total Assets, Total Equity, and Total Revenue alone.

Although the model captured the general relationship between these financial indicators and profitability, there was still room for improvement. In the next experiments, I plan to enhance performance by introducing ratio-based features, handling multicollinearity, and testing alternative regression methods.

**EXPERIMENT 2:**

For the second experiment, I focused on improving the baseline model by addressing two key areas identified in Experiment 1:

(1). high correlation among the raw financial variables, and
(2). limited feature variety that might have restricted predictive power.

*Changes from Experiment 1:*

Instead of using only the raw balance sheet totals, I introduced several ratio-based and transformed features that better capture a bank’s operational efficiency and profitability.
Specifically, I added:

**ROA (Return on Assets) = Net Income / Total Assets**

**ROE (Return on Equity) = Net Income / Total Equity**

**Profit Margin = Net Income / Total Revenue**

**log(Total Assets)** – a logarithmic transformation to reduce the large scale difference across banks.

The new model used the following predictors:
log(Total Assets), ROA, ROE, Profit Margin, and Equity-to-Assets.

These variables emphasize efficiency and scale rather than just raw size, which should provide a more stable relationship with Net Income.

**UPDATED MODEL**

RMSE: 1939.8401136550651
R²: 0.9643877314674346

Compared to the baseline model, the second regression produced:

1. A lower RMSE, indicating smaller average prediction errors.

2. A higher R^2 score, meaning the model explained more variation in Net Income.

These improvements suggest that using ratios and transformations captured the relative performance of each bank more effectively than raw totals. The logarithmic transformation also helped stabilize the relationship between scale and profitability, reducing the impact of outliers from very large asset values.

This experiment showed that thoughtful feature engineering can significantly improve model accuracy.
By converting raw values into normalized ratios, the regression became more robust and interpretable—better reflecting financial realities such as efficiency and return metrics rather than absolute size alone.

**EXPERIMENT 3:**

For the third experiment, I aimed to further improve model performance and stability by introducing regularization — specifically, Ridge Regression. While ordinary linear regression minimizes the sum of squared errors, Ridge regression adds a penalty term to the cost function that discourages excessively large coefficients. This helps reduce the effects of multicollinearity and overfitting, especially when predictors are correlated, as was observed in the earlier experiments.

*Changes from Previous Experiments:*

1. Model Type: Switched from standard Linear Regression to Ridge Regression.

2. Reason: The ratio-based features in Experiment 2 were highly correlated (e.g., ROA, ROE, and Profit Margin), which can cause unstable coefficient estimates. Ridge regression helps manage that by shrinking coefficients toward zero but not completely eliminating them.

3. Cost function = J(β) = ((Summation of (yi - ŷi)^2) + λ * Summation of βj^2)

Here, λ controls the amount of regularization — larger values apply a stronger penalty on large coefficients, encouraging smoother, more generalizable models.


**IMPLEMENTATION**

RMSE: 5294.273636240889
R²: 0.7347346427610923

Compared to the earlier experiments, this model achieved a lower RMSE and a higher R^2, meaning it explained about 73% of the variation in Net Income while keeping prediction errors smaller. The improvement confirms that adding regularization produced a model that generalizes better and is less sensitive to the correlation among financial ratios.

This experiment showed that Ridge Regression provides a balance between bias and variance, making the model more stable for financial prediction. Even though the gains were moderate, the reduced error and stronger explanatory power demonstrate that regularization helps prevent overfitting when working with interconnected financial metrics.

Compared to Experiment 2, the Ridge model produced a lower R^2 and higher RMSE. This is expected because Ridge Regression applies coefficient shrinkage to control overfitting. While the baseline linear model fit the sample data almost perfectly, it may have captured noise from multicollinear features. Ridge prioritizes generalization and model stability rather than maximum accuracy on the training sample.

**IMPACT:**

Building models to predict bank profitability can create value—but it also carries meaningful social, ethical, and economic risks. Below I outline potential positive impacts, risks/harms, and mitigations that reflect critical thinking about real-world use.

*Potential positive impacts:*

**Transparency & benchmarking:** Interpretable regression can highlight which balance-sheet drivers (e.g., revenue vs. leverage) most influence profits, improving stakeholder understanding and internal decision-making.

**Risk awareness:** If profitability is found to depend heavily on leverage, that can flag fragility and encourage more conservative funding structures.

**Educational value:** A reproducible, well-documented workflow (EDA → preprocessing → modeling → evaluation) promotes rigorous, ethical analytics practices.

*Risks and possible negative impacts:*

**Pro-cyclical incentives:** Profit models trained on “good times” can encourage banks to scale up activities that look profitable in booms but amplify losses in downturns (feedback loops).

**Overreliance on correlation:** A high R^2 may be mistaken for causation, leading to policies or compensation plans that chase spurious drivers of profit.

**Multicollinearity & opacity-in-practice:** Even linear models can become hard to interpret when features are highly correlated (e.g., ROA/ROE/margins). Misinterpretation can misguide strategy or supervision.

**Fairness & societal spillovers:** Decisions optimized for short-term profit can reduce credit availability to vulnerable communities, worsen financial exclusion, or shift costs to taxpayers if losses are socialized.

**Model drift:** Banking regimes change (accounting rules, interest-rate cycles, capital standards). A model that performs well on 2005–2024 may degrade quickly, yielding misleading signals.

**Data scope bias:** Using only the Big 4 may limit external validity; their scale and diversification differ from regional banks. Insights could be misapplied to smaller institutions.

*Mitigations and responsible practice:*

**Interpretability first:** Prefer transparent features; report standardized coefficients, VIFs for multicollinearity, and partial dependence / coefficient sensitivity to avoid overclaiming.

**Scenario & stress testing:** Evaluate models across crisis windows (2008–09, 2020) and rate-hike periods; report performance stability, not just average RMSE.

**Governance & documentation:** Include a concise Model Card: data source, time coverage, assumptions, known limits, monitoring plan, and retraining triggers.

**Fairness lens:** Discuss how profit-seeking recommendations might affect access to credit or branch presence; encourage adding community-impact constraints to decisions informed by the model.

**Scope disclaimers:** Clearly state that insights are for the Big 4 panel and may not generalize to smaller banks or different regulatory environments.

Predicting profitability can improve strategic clarity and risk awareness, but it must be paired with guardrails—interpretability, stress testing, temporal validation, and explicit consideration of societal impacts—to avoid reinforcing harmful incentives or deploying brittle models in a changing financial system.

**CONCLUSION:**

Through this regression project, I learned how data preprocessing, feature engineering, and model selection each play a critical role in improving predictive accuracy and interpretability. Starting with a simple baseline model using raw financial variables, I saw that while basic relationships existed, the model’s ability to explain profitability was limited. Once I applied preprocessing techniques—handling missing values, standardizing numeric features, and encoding categorical variables—the model became cleaner and more reliable.

In Experiment 2, introducing feature engineering had the biggest positive impact. By transforming raw totals into ratio-based measures such as ROA, ROE, and Profit Margin, and applying a logarithmic scale to Total Assets, I captured more meaningful financial relationships and achieved a dramatic performance improvement (RMSE ≈ 1939.84, R^2 ≈ 0.9644). This experiment showed how derived variables can better express efficiency and scale effects than absolute numbers.

In Experiment 3, I implemented Ridge Regression to handle multicollinearity and prevent overfitting. Although the R^2 decreased to 0.7347 and the RMSE increased to 5294.27, this model demonstrated the importance of balancing accuracy with stability. Ridge regression produced smaller, more conservative coefficients, making the model more robust and generalizable—an essential quality for financial forecasting where conditions shift over time.

Overall, I learned that the best-performing model on paper is not always the best model in practice. Thoughtful preprocessing and regularization help ensure that predictions remain interpretable, fair, and reliable across changing economic periods. This project strengthened my understanding of how regression techniques connect statistical learning with real-world financial insight, and how each modeling choice influences both performance and ethical application.

**REFERENCES:**

Scikit-learn Developers. (2024). LinearRegression — scikit-learn 1.4 documentation. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

Scikit-learn Developers. (2024). Ridge Regression — scikit-learn 1.4 documentation. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

U.S. Securities and Exchange Commission (SEC). (2005–2024). Form 10-K Annual Reports for Bank of America, JPMorgan Chase, Citigroup, and Wells Fargo. Retrieved from https://www.sec.gov/edgar/search


**ALL CODE IS LOCATED HERE FOR THIS PROJECT:** https://github.com/cwardle-star/Data-Mining