# Sklearn dataset loading utilities

scikit-learn Python library comes with a few small standard datasets that do not require to download any file from some external website, as they are available in sci-kit learn installation by executing `sklearn.datasets` package.

### General dataset API

There are three main kinds of dataset interfaces that can be used to get datasets depending on the desired type of dataset.

![General Dataset API](images/general_dataset_api.png)

 * **The dataset loaders** 
 
 Can be used to load small standard datasets, they are also called _the Toy datasets_, and I will talk about more in the next sections of this document.

 *  **The dataset fetchers**
 
 They can be used to download and load larger datasets. For further information refer to the _Real world datasets_ section in [scikit learn site](https://scikit-learn.org/stable/datasets/real_world.html).


 * **The dataset generation functions** 
 
 They can be used to generate controlled synthetic datasets. Find more information in the _Generated datasets_ section in [scikit learn site](https://scikit-learn.org/stable/datasets/sample_generators.html).



 The datasets also contain a full description in their `DESCR` attribute and some contain `feature_names` and `target_names`. See the dataset descriptions below for details.

#### Toy datasets

scikit-learn comes with a few small standard datasets that do not require to download any file from some external website.

They can be loaded using the following functions:

Dataset | Description
--- | ---
**load_boston**(*[, return_X_y]) | `Load and return the boston house-prices dataset (regression).`
**load_iris**(*[, return_X_y, as_frame]) | `Load and return the boston house-prices dataset (regression).`
**load_diabetes**(*[, return_X_y, as_frame]) | `Load and return the diabetes dataset (regression).`
**load_digits**(*[, n_class, return_X_y, as_frame]) | `Load and return the digits dataset (classification).`
**load_linnerud**(*[, return_X_y, as_frame]) | `Load and return the physical excercise linnerud dataset.`
**load_wine**(*[, return_X_y, as_frame]) | `Load and return the wine dataset (classification).`
**load_breast_cancer**(*[, return_X_y, as_frame]) | `Load and return the breast cancer wisconsin dataset (classification).`

Have in mind that these datasets are useful to quickly illustrate the behavior of the various algorithms implemented in scikit-learn, eventhough, they are considered too small to be representative of real world machine learning tasks.

In [1]:
import pandas as pd
from sklearn.datasets import load_wine

Now that we have loaded a toy dataset from sklearn API by applying the function `load_wine()`, we store it inside the variable _data_

In [2]:
wine = load_wine()

Next let's make use of _shape_ in order to inpect how many colums and rows it has.

In [3]:
print(f'The wine dataset contains {wine.data.shape[0]} rows and {wine.data.shape[1]} columns')

The wine dataset contains 178 rows and 13 columns


Looking at the dataset's data type, we notice that _sklearn.utils.Bunch_ is returned, For more information about this Bunch object go to this [link](https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html#sklearn.utils.Bunch)

In [4]:
print(type(wine))

<class 'sklearn.utils.Bunch'>


So far we know that our wine toy dataset is comprised by 178 rows and 13 columns, but we haven't still made a first sight to it, so for that let's use the _DESCR_ attribute as talked before, this works in a similar way as pandas _describe()_ function, but it provides more detailed information regarding the dataset.

In [5]:
print(wine.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

By applying the _key()_ function, we'll have access to a list of attributes available to apply on this dataset.

In [6]:
wine.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])

The first one is `data`, which returns an array of the data itself contained in the dataset.

In [7]:
wine.data

array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
        1.065e+03],
       [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
        1.050e+03],
       [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
        1.185e+03],
       ...,
       [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
        8.350e+02],
       [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
        8.400e+02],
       [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
        5.600e+02]])

The fourth one gives us the name of the classes of wine, means the target features.

In [8]:
wine.target_names

array(['class_0', 'class_1', 'class_2'], dtype='<U7')

And the last one the name of all features in the dataset

In [13]:
wine.feature_names

['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']

### Converting Sklearn dataset in pandas dataframe

Though the loaded dataset from sklearn API is ready to be used for machine learning algorithms, it is also useful to convert it into pasdas dataframe for other purposes.

In [9]:
df_wine = pd.DataFrame(data=wine.data,columns=wine.feature_names)
df_wine.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [14]:
df_wine.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0


One way or another will be the starting point for our EDA or data preprosessing for Machine Learning. In this tutorial we have learned how to use pre-loaded datasets available in sklearn API, if you are interested on cloning this notebook find it in my [GitHub repo](https://github.com/fvgm-spec/ML/blob/main/notebooks/Sklearn%20toy%20datasets.ipynb) or if you need more documentation find it in [sklearn site](https://scikit-learn.org/stable/datasets.html)

### To be continued...