`sklearn` dataset API

It has 3 important parts:
+ Loaders (`load_*`) used to load simple small toy datasets bundled with `sklearn`
+ Fetchers (`fetch_*`) fetch large datasets from the internet and loads them in memory. These dataset don't come bundled with sklearn, need to get them from the internet.
+ Generators (`make_*`) generate controlled synthetic datasets.

### Loaders

#### Loading Iris Dataset

The iris dataset is a classic dataset used for multi-class classification. It has 150 samples, each with 4 features(sepal length, sepal width, petal length and petal width). There are 3 classes: Iris-Setosa, Iris-Versicolour, Iris-Virginica labelled as 0, 1 and 2.

In [1]:
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)

# X is our feature matrix
# y is the label vector

print(X.shape, y.shape)

(150, 4) (150,)


In [2]:
# printing out the first 5 samples and the first 5 labels
X[:5], y[:5]

(array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2]]),
 array([0, 0, 0, 0, 0]))

In [3]:
# get a documentation of the iris dataset
?load_iris

[0;31mSignature:[0m [0mload_iris[0m[0;34m([0m[0;34m*[0m[0;34m,[0m [0mreturn_X_y[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mas_frame[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Load and return the iris dataset (classification).

The iris dataset is a classic and very easy multi-class classification
dataset.

Classes                          3
Samples per class               50
Samples total                  150
Dimensionality                   4
Features            real, positive

Read more in the :ref:`User Guide <iris_dataset>`.

Parameters
----------
return_X_y : bool, default=False
    If True, returns ``(data, target)`` instead of a Bunch object. See
    below for more information about the `data` and `target` object.

    .. versionadded:: 0.18

as_frame : bool, default=False
    If True, the data is a pandas DataFrame including columns with
    appropriate dtypes (numeric). The target is
    a pandas DataFrame or Ser

Both Fetchers and Loaders return a `Bunch` object which have the following attributes:
+ `data`, which has the feature matrix
+ `target` which is the label vector
+ `feature_names` contain the names of the features
+ `target_names` contain the names of the classes
+ `DESCR` has full description of dataset

In [4]:
print(f"The features are {load_iris().feature_names}.") 
print(f"The classes are: {load_iris().target_names}")

The features are ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'].
The classes are: ['setosa' 'versicolor' 'virginica']


In [5]:
# see a description of the iris dataset
print(load_iris().DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

### Loading the diabetes dataset

The diabetes dataset is a simple dataset used for regression. It has 442 total samples with 10 features.

In [6]:
from sklearn.datasets import load_diabetes

In [7]:
# get the documentation of the dataset
?load_diabetes

[0;31mSignature:[0m [0mload_diabetes[0m[0;34m([0m[0;34m*[0m[0;34m,[0m [0mreturn_X_y[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mas_frame[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mscaled[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Load and return the diabetes dataset (regression).

Samples total    442
Dimensionality   10
Features         real, -.2 < x < .2
Targets          integer 25 - 346

.. note::
   The meaning of each feature (i.e. `feature_names`) might be unclear
   (especially for `ltg`) as the documentation of the original dataset is
   not explicit. We provide information that seems correct in regard with
   the scientific literature in this field of research.

Read more in the :ref:`User Guide <diabetes_dataset>`.

Parameters
----------
return_X_y : bool, default=False
    If True, returns ``(data, target)`` instead of a Bunch object.
    See below for more information about the `data` and `target` object.



In [8]:
X, y = load_diabetes(return_X_y=True)
X.shape, y.shape

((442, 10), (442,))

In [9]:
# view the description of the diabetes dataset
print(load_diabetes().DESCR)

.. _diabetes_dataset:

Diabetes dataset
----------------

Ten baseline variables, age, sex, body mass index, average blood
pressure, and six blood serum measurements were obtained for each of n =
442 diabetes patients, as well as the response of interest, a
quantitative measure of disease progression one year after baseline.

**Data Set Characteristics:**

  :Number of Instances: 442

  :Number of Attributes: First 10 columns are numeric predictive values

  :Target: Column 11 is a quantitative measure of disease progression one year after baseline

  :Attribute Information:
      - age     age in years
      - sex
      - bmi     body mass index
      - bp      average blood pressure
      - s1      tc, total serum cholesterol
      - s2      ldl, low-density lipoproteins
      - s3      hdl, high-density lipoproteins
      - s4      tch, total cholesterol / HDL
      - s5      ltg, possibly log of serum triglycerides level
      - s6      glu, blood sugar level

Note: Each of these 1

In [10]:
import pandas as pd

# loading the diabetes dataset into a dataframe
df = pd.DataFrame(X, columns=load_diabetes().feature_names)
df

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930


Similarly, there are loaders for other datasets like `load_digits`, `load_wine`, `load_breast_cancer`, `load_linnerud`.

You can use them in a similar way after importing them from `sklearn.datasets`

### Fetchers

#### The california housing dataset

California housing dataset is a regression dataset. It has more than 20,000 samples each with 8 features (like: MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude) and the target variable is the median house value.
We can access this dataset using the `fetch_california_housing` fetcher.

In [11]:
from sklearn.datasets import fetch_california_housing

In [12]:
?fetch_california_housing

[0;31mSignature:[0m
[0mfetch_california_housing[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata_home[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdownload_if_missing[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mreturn_X_y[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mas_frame[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Load the California housing dataset (regression).

Samples total             20640
Dimensionality                8
Features                   real
Target           real 0.15 - 5.

Read more in the :ref:`User Guide <california_housing_dataset>`.

Parameters
----------
data_home : str or path-like, default=None
    Specify another download and cache folder for the datasets. By default
    all scikit-learn data is stored in '~/scikit_learn_data' subfolde

In [13]:
print(fetch_california_housing().DESCR)

.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

    :Number of Instances: 20640

    :Number of Attributes: 8 numeric, predictive attributes and the target

    :Attribute Information:
        - MedInc        median income in block group
        - HouseAge      median house age in block group
        - AveRooms      average number of rooms per household
        - AveBedrms     average number of bedrooms per household
        - Population    block group population
        - AveOccup      average number of household members
        - Latitude      block group latitude
        - Longitude     block group longitude

    :Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived

In [14]:
X, y = fetch_california_housing(return_X_y=True, as_frame=True) # the as_frame option returns the data matrix as a dataframe
X.shape, y.shape

((20640, 8), (20640,))

In [15]:
X

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25
...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32


#### fetch_openml

[OpenML](https://openml.org) is a public repository for machine learning data and experiments, that allows everybody to upload open datasets.

Import the library and access the documentation

In [16]:
from sklearn.datasets import fetch_openml
?fetch_openml

[0;31mSignature:[0m
[0mfetch_openml[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mname[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mstr[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mversion[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mint[0m[0;34m][0m [0;34m=[0m [0;34m'active'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata_id[0m[0;34m:[0m [0mOptional[0m[0;34m[[0m[0mint[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata_home[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mos[0m[0;34m.[0m[0mPathLike[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtarget_column[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mList[0m[0;34m,[0m [0mNoneType[0m[0;34m][0m [0;34m=[0m [0;34m'default-target'[0m[

Note that this is an experimental API and is likely to change in the future releases.

> We can use this fetch_openml for fetching the MNIST digit classification dataset

In [18]:
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)

  warn(


Similarly, there are other fetchers like `fetch_20newsgroups` and `fetch_kddcup99`

### Generators

Generators can be used to generate synthetically controlled datasets. 
We can make datasets for regression, classification problems.

For example, to create a dataset for a regression problem, we can use the `make_regression` generator.

In [20]:
from sklearn.datasets import make_regression
?make_regression

[0;31mSignature:[0m
[0mmake_regression[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mn_samples[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_features[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_informative[0m[0;34m=[0m[0;36m10[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_targets[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbias[0m[0;34m=[0m[0;36m0.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0meffective_rank[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtail_strength[0m[0;34m=[0m[0;36m0.5[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnoise[0m[0;34m=[0m[0;36m0.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshuffle[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcoef[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0

Lets generate 100 samples with 5 features for a single label regression problem:

(don't forget to set the random state so that we get consistent repeatable results)

In [21]:
X, y = make_regression(n_samples=100, n_features=5, n_targets=1, shuffle=True, random_state=42)

In [23]:
X.shape, y.shape

((100, 5), (100,))

In [28]:
# check the first three elements from X and y
X[:3], y[:3]

(array([[-0.93782504,  0.51504769,  0.51503527,  3.85273149,  0.51378595],
        [ 1.0889506 , -0.71530371,  0.06428002,  0.67959775, -1.07774478],
        [-0.60170661, -1.05771093,  1.85227818,  0.82254491, -0.01349722]]),
 array([271.31612081,   6.2305406 ,  11.86102446]))

We can use the `make_classification` generator to generate a classification dataset.

Let's generate a binary classification(2 classes) problem with 10 features and 100 samples.

In [29]:
from sklearn.datasets import make_classification
?make_classification

[0;31mSignature:[0m
[0mmake_classification[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mn_samples[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_features[0m[0;34m=[0m[0;36m20[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_informative[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_redundant[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_repeated[0m[0;34m=[0m[0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_classes[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_clusters_per_class[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mweights[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mflip_y[0m[0;34m=[0m[0;36m0.01[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mclass_sep[0m[0;34m=[0m[0;36m1.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mhypercube[0m[0;34m=[0m[0;32m

In [32]:
X, y = make_classification(n_samples=100, n_features=10, n_classes=2, n_clusters_per_class=1, random_state=42)
X.shape, y.shape

((100, 10), (100,))

`make_blobs` enables us to generate random data for clustering

In [33]:
from sklearn.datasets import make_blobs
?make_blobs

[0;31mSignature:[0m
[0mmake_blobs[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mn_samples[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_features[0m[0;34m=[0m[0;36m2[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcenters[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcluster_std[0m[0;34m=[0m[0;36m1.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcenter_box[0m[0;34m=[0m[0;34m([0m[0;34m-[0m[0;36m10.0[0m[0;34m,[0m [0;36m10.0[0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshuffle[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mreturn_centers[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Generate isotropic Gaussian blobs for clustering.

Read more in the :ref:`User Gu

Let's generate a random dataset of 10 samples with 2 features each, and 3 cluster centers for clustering:

In [52]:
X, y = make_blobs(n_samples=10, centers=3, n_features=2, random_state=42)
X.shape, y.shape

((10, 2), (10,))

We can find the cluster membership (i.e. to which cluster does each data point belongs) in y

In [53]:
y

array([2, 2, 1, 2, 0, 0, 0, 1, 1, 0])