CASE STUDY: AI in Finance: ROA / Net Income Prediction

This repo is for the preparation, training, testing, and validation of models for the Introduction to AI in Finance project. The goal is to predict whether a firm's Return on Assets (ROA) will improve in the next fiscal year ($ROA_{t+1} > ROA_t$) using Annual financial ratios and features from the current year ($t$).

Git Workflow Guide

To contribute to this repository, ensure you have Git installed and configured in VS Code.

Cloning the Repository

git clone https://github.com/axehole42/data_science_project

Updating the Repository (Pushing Changes)

Stage your changes:
```
git add .
```
Commit your changes (add a meaningful message):
```
git commit -m "Description of your changes"
```
Push to GitHub:
```
git push origin main 
```
(Note: You will need to log in to GitHub the first time you push.)

Project Overview

This repo contains a machine learning pipeline for predicting Return on Assets (ROA) improvement using financial data (Compustat)

Procedural Workflow

To produce the analysis and model training, we executed the following steps in order:

Data Cleanup
- Command: python data_cleanup.py
- Action: Loads the raw Compustat CSV, filters for industrial companies (INDL), and performs initial cleaning.
- Output: task_data/cleaned_data.parquet
Feature Engineering
- Command: python feature_engineering.py
- Action: Loads the cleaned data and calculates financial ratios, lags, and other derived features defined in task_data/feature_groups.json.
- Output: task_data/features.parquet
Model Training (Main Model)
- Command: python MAIN_MODEL_XGB_OPT_TS.py (or MAIN_MODEL_XGB_OPT_TS.py)
- Action: Trains an XGBoost classifier using time-series cross-validation and Optuna for hyperparameter optimization
- Output: task_data/models_optuna_tscv/ (contains model artifacts, metrics, and best parameters)
Model Training (Baselines)
- Logistic Regression: python MODEL_LR_TS_CLEAN.py -> Output: task_data/models_optuna_tscv_logreg_clean/
- Random Forest: python MODEL_RandomF_TS_CLEAN.py -> Output: task_data/models_optuna_tscv_clean_rf_fast/
Report Generation
- Run specific scripts in the latex/ folder to generate LaTeX tables and statistics for the final output in our case study

Main Code Modules

Data Processing

data_cleanup.py:
- Handles the ingestion of the raw CSV file (itaiif_compustat_data_24112025.csv).
- Applies standard Compustat filters (e.g., keeping only standard industrial format 'INDL').
- Converts the data to efficient Parquet format.
feature_engineering.py:
- Central hub for feature creation.
- Computes financial ratios (liquidity, profitability, leverage, etc.).
- Handles lag generation (creating features from previous years).
- Uses robust mathematical operations (safe division, safe log) to handle financial edge cases (zeros, negatives).

Model Training

All model scripts share a common "time-series cross-validation" architecture:

MAIN_MODEL_XGB_OPT_TS.py: The primary XGBoost model script. It uses Optuna to find the best hyperparameters by training on a rolling window of historical data and validating on the subsequent year. It saves feature importance and evaluation metrics.
MODEL_LR_TS_CLEAN.py: A Logistic Regression baseline. Includes preprocessing steps like winsorization and standardization within the cross-validation loop to prevent leakage.
MODEL_RandomF_TS_CLEAN.py: A Random Forest baseline optimized for speed ("fast" implementation with capped threads and warm starts).

LaTeX Generation Tools (`latex/` folder)

These scripts generate .tex files or printed tables for our case study:

Missing Data Analysis:
- analyze_missing_mechanisms.py, formal_rubin_test.py: Analyze why data is missing (Missing Completely at Random vs. Missing At Random).
- generate_mar_latex.py: Generates tables summarizing missing data patterns.
Descriptive Statistics:
- generate_feature_stats.py: summary statistics for the features.
- generate_ratio_stats.py: Statistics specifically for financial ratios.
Model Results:
- generate_fi_latex.py: Creates feature importance tables from the trained models.
- generate_params_table.py: Formats the best hyperparameters found by Optuna into a LaTeX table.

Outputs

All intermediate and final outputs are stored in the task_data/ directory:

task_data/cleaned_data.parquet: The cleaned raw dataset.
task_data/features.parquet: The final dataset with all engineering features, ready for training.
task_data/models_optuna_tscv/ (Main XGBoost Output):
- best_params.json: Optimal hyperparameters.
- metrics.json: Accuracy, AUC, F1, etc., on validation/test sets.
- feature_importance.csv: Global feature importance scores.
- xgb_model.json: The saved model object.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
__pycache__		__pycache__
data_analysis_tools		data_analysis_tools
latex		latex
task_data		task_data
.gitignore		.gitignore
LICENSE		LICENSE
MAIN_MODEL_XGB_OPT_TS.py		MAIN_MODEL_XGB_OPT_TS.py
MODEL_LR_TS_CLEAN.py		MODEL_LR_TS_CLEAN.py
MODEL_RandomF_TS_CLEAN.py		MODEL_RandomF_TS_CLEAN.py
MODEL_XGB_OPT_TS_CLEAN.py		MODEL_XGB_OPT_TS_CLEAN.py
README.md		README.md
data_cleanup.py		data_cleanup.py
documentation.tex		documentation.tex
feature_engineering.py		feature_engineering.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CASE STUDY: AI in Finance: ROA / Net Income Prediction

Git Workflow Guide

Cloning the Repository

Updating the Repository (Pushing Changes)

Project Overview

Procedural Workflow

Main Code Modules

Data Processing

Model Training

LaTeX Generation Tools (`latex/` folder)

Outputs

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CASE STUDY: AI in Finance: ROA / Net Income Prediction

Git Workflow Guide

Cloning the Repository

Updating the Repository (Pushing Changes)

Project Overview

Procedural Workflow

Main Code Modules

Data Processing

Model Training

LaTeX Generation Tools (latex/ folder)

Outputs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

LaTeX Generation Tools (`latex/` folder)

Packages