# Diabetes Dataset

Downloads from [hastie.su.domains/Papers/LARS](https://hastie.su.domains/Papers/LARS/).

This is the original unscaled dataset from [Least Angle Regression](https://hastie.su.domains/Papers/LARS/LeastAngle_2002.pdf) (Efron et al., 2003) with the feature names from [Scikit-learn](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset).

**Description**: Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of $n = 442$ diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.

**Features**:
  - `age` - age in years
  - `sex` - sex
  - `bmi` - body mass index
  - `bp` - average blood pressure
  - `tc` - total serum cholesterol
  - `ldl` - low-density lipoproteins
  - `hdl` - high-density lipoproteins
  - `tch` - total cholesterol / HDL ratio
  - `ltg` - log of serum triglycerides level
  - `glu` - blood sugar level

**Target**: Column 11 is a quantitative measure of disease progression one year after baseline.

In [1]:
import numpy as np
import pandas as pd

from sklearn.utils import Bunch

In [2]:
# url = "https://hastie.su.domains/Papers/LARS/diabetes.data"
url = "https://lab.aef.me/files/data/diabetes.csv"

# sep = "\t"
sep = ","

diabetes_df = pd.read_csv(url, sep=sep)
diabetes_df.columns = [
    "age",
    "sex",
    "bmi",
    "bp",
    "tc",
    "ldl",
    "hdl",
    "tch",
    "ltg",
    "glu",
    "target",
]
diabetes_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 442 entries, 0 to 441
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     442 non-null    int64  
 1   sex     442 non-null    int64  
 2   bmi     442 non-null    float64
 3   bp      442 non-null    float64
 4   tc      442 non-null    int64  
 5   ldl     442 non-null    float64
 6   hdl     442 non-null    float64
 7   tch     442 non-null    float64
 8   ltg     442 non-null    float64
 9   glu     442 non-null    int64  
 10  target  442 non-null    int64  
dtypes: float64(6), int64(5)
memory usage: 38.1 KB


In [3]:
# write to diabetes.csv
diabetes_df.to_csv("diabetes.csv", index=False)

In [4]:
# like `load_diabetes`
diabetes = Bunch(
    data=diabetes_df.drop("target", axis=1).values,
    target=diabetes_df["target"].values,
    feature_names=diabetes_df.columns[:-1].tolist(),
)

In [5]:
# convert back to dataframe (ints will be floats now)
diabetes_df = pd.DataFrame(
    np.c_[diabetes.data, diabetes.target],
    columns=np.append(diabetes.feature_names, ["target"]),
)
diabetes_df.head()

Unnamed: 0,age,sex,bmi,bp,tc,ldl,hdl,tch,ltg,glu,target
0,59.0,2.0,32.1,101.0,157.0,93.2,38.0,4.0,4.8598,87.0,151.0
1,48.0,1.0,21.6,87.0,183.0,103.2,70.0,3.0,3.8918,69.0,75.0
2,72.0,2.0,30.5,93.0,156.0,93.6,41.0,4.0,4.6728,85.0,141.0
3,24.0,1.0,25.3,84.0,198.0,131.4,40.0,5.0,4.8903,89.0,206.0
4,50.0,1.0,23.0,101.0,192.0,125.4,52.0,4.0,4.2905,80.0,135.0
