Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH Add SciPy tutorial dataset to fairlearn.datasets #1086

Merged
merged 12 commits into from
Jul 12, 2022
19 changes: 19 additions & 0 deletions docs/refs.bib
Original file line number Diff line number Diff line change
Expand Up @@ -100,3 +100,22 @@ @article{crenshaw1991intersectionality
URL = {https://www.jstor.org/stable/1229039},
eprint = {https://www.jstor.org/stable/1229039}
}

@article{strack2014impact,
author = {Strack, Beata and Deshazo, Jonathan and Gennings, Chris and Olmo Ortiz, Juan Luis and Ventura, Sebastian and Cios, Krzysztof and Clore, John},
year = {2014},
month = {04},
pages = {781670},
title = {Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records},
volume = {2014},
journal = {BioMed research international},
doi = {10.1155/2014/781670}
}

@misc{strack2014diabetes,
author = {Strack, Beata and Deshazo, Jonathan and Gennings, Chris and Olmo Ortiz, Juan Luis and Ventura, Sebastian and Cios, Krzysztof and Clore, John},
year = {2014},
month = {05},
title = {Diabetes 130-US hospitals for years 1999-2008 Data Set},
URL = {https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008}
}
199 changes: 199 additions & 0 deletions docs/user_guide/datasets/diabetes_hospital_data.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,199 @@
.. _diabetes_hospital_data:
Diabetes 130-Hospitals Dataset
------------------------------

Introduction
^^^^^^^^^^^^

The Diabetes 130-Hospitals Dataset consists of 10 years worth of clinical care data
at 130 US hospitals and integrated delivery networks :footcite:`strack2014impact`.
Each record represents the hospital admission record for a patient diagnosed with
diabetes whose stay lasted between one to fourteen days. Also, laboratory tests were
performed and medications were administered during the encounter. The features
describing each encounter include demographics, diagnoses, diabetic medications, number
of visits in the year preceding the encounter, and payer information, as well as
whether the patient was readmitted after release, and whether the readmission occurred
within 30 days of the release.

Strack et al. used the data to investigate the impact of HbA1c measurement
on hospital readmission rates. The data was collected from the Health Facts
database, which is a national data warehouse in the United States consisting of
clinical records from hospitals throughout the United States. Once Strack et al.
completed their research, the dataset was submitted to the UCI Machine Learning
Repository such that it became available for later use.

.. _diabetes_hospital_dataset_description:

Dataset Description
^^^^^^^^^^^^^^^^^^^

The original data can be found in the UCI Repository :footcite:`strack2014diabetes`.
This version of the dataset was derived by the Fairlearn team for the SciPy 2021
tutorial "Fairness in AI Systems: From social context to practice using Fairlearn".
In this version, the target variable "readmitted" is binarized into whether the
patient was readmitted within thirty days. The full dataset pre-processing script
can be found on `GitHub <https://github.com/fairlearn/talks/blob/main/2021_scipy_tutorial/preprocess.py>`_.
The dataset contains 101,766 rows. Each row describes a person and contains 25
rensoostenbach marked this conversation as resolved.
Show resolved Hide resolved
features, which we describe below:

.. list-table::
:header-rows: 1
:widths: 7 30
:stub-columns: 1

* - Column name
- Description

* - race
- Race of the patient:
- African American
- Asian
- Caucasian
- Hispanic
- Other
- Unknown

* - gender
- Gender of patient:
- Female
- Male
- Unknown/Invalid

* - age
- Age of patient:
- 30 years or younger
- 30-60 years
- Over 60 years

* - discharge_disposition_id
- The place the patient was discharged to:
- Discharged to Home
- Other

* - admission_source_id
- Means of admission into the hospital:
- Emergency
- Other
- Referral

* - time_in_hospital
- Integer number of days between admission and discharge.

* - medical_specialty
- Specialty of the admitting physician:
- Cardiology
- Emergency/Trauma
- Family/GeneralPractice
- InternalMedicine
- Missing
- Other

* - num_lab_procedures
- Integer number of lab tests performed during the encounter

* - num_procedures
- Integer number of procedures (other than lab tests) performed during the
encounter

* - num_medications
- Integer number of distinct generic names administered during the encounter

* - primary_diagnosis
- The primary (first) diagnosis:
- Diabetes
- Genitourinary Issues
- Musculoskeletal Issues
- Respiratory Issues
- Other

* - number_diagnoses
- Integer number of diagnoses.

* - max_glu_serum
- Indicates the range of the result in mg/dL or if the Glucose serum test was not taken:
- >200
- >300
- Norm (indicating normal)
- None

* - A1Cresult
- Indicates the range of the result in percentages or if the A1c test was
not taken:
- >7 (greater than 7%, but less than 8%)
- >8 (greater than 8%)
- Norm (indicating normal, which is less than 7%)
- None

* - insulin
- Indicates whether the drug was prescribed or there was a change in the dosage:
- Down
- Steady
- Up
- No

* - change
- Indicates if there was a change in diabetic medications:
- Ch (Change)
- No (no change)

* - diabetesMed
- Binary attribute indicating whether there was any diabetic medication
prescribed.

* - medicare
- Binary attribute indicating whether the patient had medicare as insurance.

* - medicaid
- Binary attribute indicating whether the patient had medicaid as insurance.

* - had_emergency
- Binary attribute indicating whether the patient had an emergency in the prior
year.

* - had_inpatient_days
- Binary attribute indicating whether the patient had inpatient days in the prior
year.

* - had_outpatient_days
- Binary attribute indicating whether the patient had outpatient days in the
prior year.

* - readmitted
- Attribute indicating whether the patient was readmitted and when. Can also be used as a target variable:
- <30 (readmitted in less than 30 days)
- >30 (readmitted in more than 30 days)
- NO (not readmitted)

* - readmit_binary
- Binary attribute indicating whether the patient was readmitted. Can also be
used as a target variable.


The default target label is given by readmit_30_days. However, the "readmitted" or
"readmit_binary" attributes can also be used as a target, depending on what you
are interested in.

.. list-table::
:header-rows: 1
:widths: 7 30
:stub-columns: 1

* - Column name
- Description

* - readmit_30_days
- Binary attribute indicating whether the patient was readmitted within 30 days.


.. _using_diabetes_hospital_dataset:

Using the dataset
^^^^^^^^^^^^^^^^^
The dataset can be loaded via the :func:`fairlearn.datasets.fetch_diabetes_hospital`
function. By default, the dataset is returned as a :class:`pandas.DataFrame`, since
the dataset contains string attributes that are not supported for array representation.
Therefore, passing :code:`as_frame=False` would return a :code:`ValueError`.

.. topic:: References:

.. footbibliography::
1 change: 1 addition & 0 deletions docs/user_guide/datasets/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ In this section, we dive deeper into various datasets that have fairness-related

adult_data
boston_housing_data
diabetes_hospital_data
4 changes: 3 additions & 1 deletion fairlearn/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,12 +7,14 @@

from ._fetch_acs_income import fetch_acs_income
from ._fetch_adult import fetch_adult
from ._fetch_bank_marketing import fetch_bank_marketing
from ._fetch_boston import fetch_boston
from ._fetch_bank_marketing import fetch_bank_marketing
from ._fetch_diabetes_hospital import fetch_diabetes_hospital

__all__ = [
"fetch_acs_income",
"fetch_adult",
"fetch_boston",
"fetch_bank_marketing",
"fetch_diabetes_hospital",
]
106 changes: 106 additions & 0 deletions fairlearn/datasets/_fetch_diabetes_hospital.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# Copyright (c) Microsoft Corporation and Fairlearn contributors.
rensoostenbach marked this conversation as resolved.
Show resolved Hide resolved
# Licensed under the MIT License.

import pathlib

from sklearn.datasets import fetch_openml
from ._constants import _DOWNLOAD_DIRECTORY_NAME


def fetch_diabetes_hospital(
*, cache=True, data_home=None, as_frame=True, return_X_y=False
):
"""Load the preprocessed Diabetes 130-Hospitals dataset (binary classification).

Download it if necessary.

============== ============================
Samples total 101766
Dimensionality 25
Features numeric, categorical, string
Classes 2
============== ============================

Source: UCI Repository :footcite:`strack2014diabetes`
Paper: Strack et al., 2014 :footcite:`strack2014impact`

The "Diabetes 130-Hospitals" dataset represents 10 years of clinical care at 130
U.S. hospitals and delivery networks, collected from 1999 to 2008. Each record
represents the hospital admission record for a patient diagnosed with diabetes
whose stay lasted between one to fourteen days.

The original "Diabetes 130-Hospitals" dataset was collected by Beata Strack,
Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura,
Krzysztof J. Cios, and John N. Clore in 2014.

This version of the dataset was derived by the Fairlearn team for the SciPy
2021 tutorial "Fairness in AI Systems: From social context to practice using
Fairlearn". In this version, the target variable "readmitted" is binarized
into whether the patient was re-admitted within thirty days.

Read more in the :ref:`User Guide <diabetes_hospital_data>`.

.. versionadded:: 0.8.0

Parameters
----------
cache : bool, default=True
Whether to cache downloaded datasets using joblib.

data_home : str, default=None
Specify another download and cache folder for the datasets.
By default, all fairlearn data is stored in '~/.fairlearn-data'
subfolders.

as_frame : bool, default=True
rensoostenbach marked this conversation as resolved.
Show resolved Hide resolved
If True, the data is a pandas DataFrame including columns with
appropriate dtypes (numeric, string or categorical). The target is
a pandas DataFrame or Series depending on the number of target_columns.
The Bunch will contain a ``frame`` attribute with the target and the
data. If ``return_X_y`` is True, then ``(data, target)`` will be pandas
DataFrames or Series as describe above. If false, a ``ValueError`` is
returned because string attributes are not supported for array representation.

return_X_y : bool, default=False
If True, returns ``(data.data, data.target)`` instead of a Bunch
object.

Returns
-------
dataset : :obj:`~sklearn.utils.Bunch`
Dictionary-like object, with the following attributes.

data : ndarray, shape (101766, 25)
Each row corresponding to the 25 feature values in order.
If ``as_frame`` is True, ``data`` is a pandas object.
target : numpy array of shape (101766,)
Each value represents whether readmission of the patient
occurred within 30 days of the release. If ``as_frame``
is True, ``target`` is a pandas object.
feature_names : list of length 25
Array of ordered feature names used in the dataset.
DESCR : string
Description of the Diabetes 130-Hospitals dataset.

(data, target) : tuple of (numpy.ndarray, numpy.ndarray)
if ``return_X_y`` is True and ``as_frame`` is False

(data, target) : tuple of (pandas.DataFrame, pandas.Series)
if ``return_X_y`` is True and ``as_frame`` is True

References
----------

.. footbibliography::

"""
if not data_home:
data_home = pathlib.Path().home() / _DOWNLOAD_DIRECTORY_NAME

return fetch_openml(
data_id=43874,
data_home=data_home,
cache=cache,
as_frame=as_frame,
return_X_y=return_X_y,
)
Loading