This project, part of the Software Engineering for AI course, performs an empirical study on the quality of tabular data generated by Large Language Models (LLMs), specifically GPT-4o. We recreated the German Credit dataset using different prompt engineering techniques: 0-Shot, 1-Shot, and 2-Shot, to assess their usability across various dimensions.
- Structural Metrics: Evaluating Uniqueness, Readability, Consistency, and Completeness.
- Performance Metrics: Examining F1-score and Accuracy of machine learning models trained on the generated data.
- Fairness Metrics: Ensuring that the generated data does not perpetuate bias using Equal Opportunity Difference (EOD), Average Odds Difference (AOD), and Statistical Parity Difference (SPD).
- Quality Metrics: 1-Shot prompts produced the highest-quality datasets in terms of structural metrics but showed high data duplication.
- Performance Metrics: Models trained on 1-Shot generated datasets performed best, achieving an F1 score up to 0.968 and accuracy up to 0.956 on synthetic data without duplicates.
- Fairness Metrics: 0-Shot and 2-Shot techniques generally maintained better fairness metrics compared to the 1-Shot technique, with 1-Shot showing increased bias especially across demographic attributes like sex and age.
datasets/
: Original and generated datasets.notebooks/
: Jupyter notebooks for data generation, analysis, and visualization.documents/
: Contains the full report and a presentation PDF detailing methodology and results.
The repository contains detailed instructions in the Jupyter notebooks which guide through the process of generating data, training models, and evaluating them across different metrics.