group member: Yuxin Xiao, Grace Guo, Nancy Zha, Kailin Xu
Before running the notebook, ensure you have the following:
Python 3.8 or later
The following Python packages installed:
osopenai, specificallyAzureOpenAInumpypandasmatplotlibseabornscikit-learnscipystatsmodels
You can install the necessary Python packages using pip:
pip install numpy pandas matplotlib seaborn scikit-learn jupyter
File: Query.py
Description:
A Python script that contains functions to query processed data for specific analytical needs. This script is essential for extracting subsets of data based on particular criteria, such as disease type, demographic information, or prediction metrics.
File: Parsing.ipynb
Description:
This notebook is used for parsing raw data files and preparing them for analysis. It includes steps for data cleaning, initial preprocessing, and formatting to ensure compatibility with analysis tools used in viz.ipynb.
File: viz.ipynb
Data:
- data/Chinese.csv: GPT-4 result from Chinese prompts
- data/English.csv: GPT-4 result from English prompts
- data/True.csv: processed data of true disease prevalence
- data/final_true_dist.csv: raw data of true disease prevalence in the United States from Zack et al.
Structure:
- Data Loading: load the disease prevalence data from CSV files.
- Data Preprocessing: perform any cleaning or transformation of the data.
- Data Analysis: execute the statistical analysis comparing actual and predicted disease prevalences.
- Visualization: generate plots and visualizations of the results.
File: stat.ipynb
Description:
This notebook contains the statistical tests and methods used to analyze the data processed in Parsing.ipynb and visualized in viz.ipynb. It provides detailed statistical insights into the biases in disease prevalence predictions across different demographics and languages. The notebook includes hypothesis testing, p-value calculations, and other statistical methods to quantify biases.
- Clone the repository to your local machine.
- Navigate to the repository directory in your terminal.
- Start Jupyter Notebook or JupyterLab to run
.ipynbfiles:jupyter notebookorjupyter lab - Run Python scripts directly in your terminal:
python Query.py
Travis Zack, Eric Lehman, Mirac Suzgun, Jorge A Rodriguez, Leo Anthony Celi, Judy Gichoya, Dan Jurafsky, Peter Szolovits, David W Bates, Raja-Elie E Abdulnour, et al. Assessing the potential of gpt-4 to perpetuate racial and gender biases in health care: a model evaluation study. The Lancet Digital Health, 6(1):e12–e22, 2024.