# Built-In Datasets

## Introduction

Various python libraries offer access to built-in datasets to more easily import data to employ various machine learning techniques. These built-in datasets are very simple examples and are generally best used only for example purposes or for personal education.

## Scikit Learn

Scikit Learn is one of the most important python libraries for machine learning. It is an open source library that provides tools for classification, regression, clustering, dimensionality reduction, model selection tools, and data preprocessing.

To test their different features they also provide a series of toy datasets to illustrate how different functions work.

https://scikit-learn.org/stable/index.html

In [None]:
import sklearn
# Print the version of scikit-learn
print("scikit-learn version:", sklearn.__version__)

scikit-learn version: 1.6.1


The module for the datasets in Scikit Learn is `datasets`

https://scikit-learn.org/stable/api/sklearn.datasets.html

In [3]:
from sklearn import datasets
# Get a list of all functions and classes in the datasets module
module_list = dir(datasets)
# Filter out private members (those starting with an underscore)
public_members = [member for member in module_list if not member.startswith('_')]
# Create a list of the sample datasets
sample_datasets = [member for member in public_members if 'load_' in member or 'fetch_' in member]
# Create a list of the generator functions
generator_functions = [member for member in public_members if 'make_' in member]

### Sklearn Datasets

Datasets may begin with `fetch_` or `load_`

In [4]:
print("Sample datasets:")
for dataset in sample_datasets:
    print(f" - {dataset}")

Sample datasets:
 - fetch_20newsgroups
 - fetch_20newsgroups_vectorized
 - fetch_california_housing
 - fetch_covtype
 - fetch_file
 - fetch_kddcup99
 - fetch_lfw_pairs
 - fetch_lfw_people
 - fetch_olivetti_faces
 - fetch_openml
 - fetch_rcv1
 - fetch_species_distributions
 - load_breast_cancer
 - load_diabetes
 - load_digits
 - load_files
 - load_iris
 - load_linnerud
 - load_sample_image
 - load_sample_images
 - load_svmlight_file
 - load_svmlight_files
 - load_wine


Loading a dataset is as simple as calling the fuction

In [5]:
iris_dataset = datasets.load_iris()

The loaded dataset is more than just the raw data, it includes a variety of entries within a dictionary

In [6]:
iris_dataset

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [8]:
print(iris_dataset['DESCR'])

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

These can also be accessed as attributes of the module

In [9]:
print(datasets.load_iris().DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
    - sepal length in cm
    - sepal width in cm
    - petal length in cm
    - petal width in cm
    - class:
            - Iris-Setosa
            - Iris-Versicolour
            - Iris-Virginica

:Summary Statistics:

                Min  Max   Mean    SD   Class Correlation
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fis

### Sklearn Generators

Generators create a random assortment of data according to a specific pattern

In [10]:
print("Sample generators:")
for generator in generator_functions:
    print(f" - {generator}")

Sample generators:
 - make_biclusters
 - make_blobs
 - make_checkerboard
 - make_circles
 - make_classification
 - make_friedman1
 - make_friedman2
 - make_friedman3
 - make_gaussian_quantiles
 - make_hastie_10_2
 - make_low_rank_matrix
 - make_moons
 - make_multilabel_classification
 - make_regression
 - make_s_curve
 - make_sparse_coded_signal
 - make_sparse_spd_matrix
 - make_sparse_uncorrelated
 - make_spd_matrix
 - make_swiss_roll


## Statsmodels

## Tensorflow

## Seaborn