# Introduction
- 3 API's are used for loading dataset
    - Loaders (`load_*`) for loading dataset in sklearn
    - Fetchers (fetch_*) to fetch large dataset from outside
    - Generators (generate_*) to generate dataset

### Loading Iris Dataset

In [5]:
from sklearn.datasets import load_iris
data = load_iris()

- `data` has feature matrix
- `target` has label vector
- `feature_names` contain name of features
- `target_names` contain name of targets
- `DESCR` has full description of dataset
- `filename` has path to the file

In [6]:
type(data)

sklearn.utils._bunch.Bunch

In [7]:
data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [8]:
data.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

In [19]:
data.data[:5,]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

In [20]:
data.target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [21]:
?load_iris

[0;31mSignature:[0m [0mload_iris[0m[0;34m([0m[0;34m*[0m[0;34m,[0m [0mreturn_X_y[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mas_frame[0m[0;34m=[0m[0;32mFalse[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Load and return the iris dataset (classification).

The iris dataset is a classic and very easy multi-class classification
dataset.

Classes                          3
Samples per class               50
Samples total                  150
Dimensionality                   4
Features            real, positive

Read more in the :ref:`User Guide <iris_dataset>`.

.. versionchanged:: 0.20
    Fixed two wrong data points according to Fisher's paper.
    The new version is the same as in R, but not as in the UCI
    Machine Learning Repository.

Parameters
----------
return_X_y : bool, default=False
    If True, returns ``(data, target)`` instead of a Bunch object. See
    below for more information about the `data` and `target` object.

    .. versionadded:

### Loading Diabetes Dataset

In [38]:
from sklearn.datasets import load_diabetes
data = load_diabetes()

In [39]:
?load_diabetes

[0;31mSignature:[0m [0mload_diabetes[0m[0;34m([0m[0;34m*[0m[0;34m,[0m [0mreturn_X_y[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mas_frame[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m [0mscaled[0m[0;34m=[0m[0;32mTrue[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Load and return the diabetes dataset (regression).

Samples total    442
Dimensionality   10
Features         real, -.2 < x < .2
Targets          integer 25 - 346

.. note::
   The meaning of each feature (i.e. `feature_names`) might be unclear
   (especially for `ltg`) as the documentation of the original dataset is
   not explicit. We provide information that seems correct in regard with
   the scientific literature in this field of research.

Read more in the :ref:`User Guide <diabetes_dataset>`.

Parameters
----------
return_X_y : bool, default=False
    If True, returns ``(data, target)`` instead of a Bunch object.
    See below for more information about the `data` and `target` object.



In [40]:
data.data[:5]

array([[ 0.03807591,  0.05068012,  0.06169621,  0.02187239, -0.0442235 ,
        -0.03482076, -0.04340085, -0.00259226,  0.01990749, -0.01764613],
       [-0.00188202, -0.04464164, -0.05147406, -0.02632753, -0.00844872,
        -0.01916334,  0.07441156, -0.03949338, -0.06833155, -0.09220405],
       [ 0.08529891,  0.05068012,  0.04445121, -0.00567042, -0.04559945,
        -0.03419447, -0.03235593, -0.00259226,  0.00286131, -0.02593034],
       [-0.08906294, -0.04464164, -0.01159501, -0.03665608,  0.01219057,
         0.02499059, -0.03603757,  0.03430886,  0.02268774, -0.00936191],
       [ 0.00538306, -0.04464164, -0.03638469,  0.02187239,  0.00393485,
         0.01559614,  0.00814208, -0.00259226, -0.03198764, -0.04664087]])

In [43]:
data.feature_names

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

## Generators

In [46]:
from sklearn.datasets import make_regression
?make_regression

[0;31mSignature:[0m
[0mmake_regression[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mn_samples[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_features[0m[0;34m=[0m[0;36m100[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_informative[0m[0;34m=[0m[0;36m10[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_targets[0m[0;34m=[0m[0;36m1[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbias[0m[0;34m=[0m[0;36m0.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0meffective_rank[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtail_strength[0m[0;34m=[0m[0;36m0.5[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnoise[0m[0;34m=[0m[0;36m0.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshuffle[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcoef[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;32mNone[0

### Let's generate 100 samples with 5 features for a single label regression problem

In [47]:
X, y = make_regression(n_samples = 100, n_features = 5, n_targets=1, shuffle= True, random_state=42)

### Let's generate 100 samples with 5 features for a 5 label (Multiclass regression problem)

In [50]:
X, y = make_regression(n_samples = 100, n_features = 5, n_targets=5, shuffle= True, random_state=42)

In [52]:
print(y)

[[ 4.65824885e+01  2.69517607e+01  1.58496876e+02 -2.11559654e+02
  -2.30334313e+01]
 [ 1.96636891e+02  1.68667371e+02  2.09939266e+02  1.22354308e+02
   6.57662915e+01]
 [-7.89589416e+01 -1.02121907e+02 -7.68217091e+01 -2.12753804e+02
  -7.61055362e+01]
 [ 3.90579017e+01  6.91487819e+01 -5.11760771e+01  1.80896189e+02
  -1.18306497e+00]
 [ 4.30403126e+01  3.20677888e+01  8.74686678e+01 -6.67831311e+01
   1.46450238e+01]
 [ 1.38523749e+02  1.13294553e+02  6.35192539e+01  7.21840459e+01
   1.56793637e+01]
 [ 8.39515501e+01  5.03399279e+01  9.67464146e+01 -5.51205864e+01
   1.62404632e+01]
 [-3.66778631e+01 -2.09264646e+01  6.15677243e+00  1.39307358e+02
   8.49635147e+01]
 [-1.58276114e+02 -1.33982982e+02 -9.80064496e+01 -1.82645525e+02
  -1.11997092e+02]
 [ 3.61753596e+02  3.20954976e+02  3.48777947e+02  7.13579399e+01
   4.19167346e+01]
 [-1.13528852e+00 -1.83276592e+00  1.69839243e+01  8.21275281e+01
   9.00539764e+01]
 [-5.30818526e+01 -4.68015485e+01 -1.07326177e+02 -1.32771394e+01

In [64]:
from sklearn.datasets import make_classification

X,y = make_classification(n_samples = 100, n_features = 5, n_classes=2, n_clusters_per_class = 1)

In [65]:
print(X)

[[ 1.18907868e+00 -1.00734223e+00 -7.65050933e-01  1.11398935e-01
   9.38839891e-01]
 [-3.24828515e-01  2.25254964e+00  7.71309055e-01 -3.22513016e-01
  -1.58506333e+00]
 [ 9.85808700e-01 -6.87764123e-01 -1.72868026e+00 -1.82084010e-02
   1.30141394e+00]
 [ 1.25795472e+00  5.21100990e-01  1.83931755e+00  5.51755821e-02
  -1.27594875e+00]
 [ 1.41213034e+00  1.30641266e+00  7.48267818e-01 -1.63532646e-01
  -1.08403713e+00]
 [ 5.67857230e-01  6.40545874e-01  1.01833821e+00 -2.92753217e-02
  -8.88157256e-01]
 [ 9.07228479e-01 -1.40585552e+00  3.16019220e-01  2.63597131e-01
   5.52719718e-01]
 [ 2.05320934e-01 -1.39253668e+00  2.38497759e+00  4.23006823e-01
  -5.86821693e-01]
 [-1.64206685e-01  6.87580814e-01  1.30791211e+00 -1.46402090e-02
  -1.07096679e+00]
 [-4.23207575e-01  1.32116453e+00  8.01512490e-01 -1.61878834e-01
  -1.12080147e+00]
 [-3.70736594e-01 -7.39025302e-01 -1.47508894e+00  1.03188077e-02
   1.18904546e+00]
 [-8.94212467e-01 -8.02969656e-01 -1.17276049e+00  4.48097476e-02

In [66]:
print(y)

[1 0 1 0 0 0 1 0 0 0 1 1 0 1 0 0 1 0 1 0 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1 1
 1 1 1 0 1 0 0 0 1 1 1 1 1 0 0 0 1 0 1 1 1 0 1 0 0 1 0 1 1 0 0 1 1 0 1 0 1
 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 1 1 1 0 1 1 1 0 0 1]


In [67]:
from sklearn.datasets import make_blobs

X,y = make_blobs(n_samples = 100, n_features = 2, centers = 3)