# Loaders, Fetcher and Generators

## Loaders

- Used to load small standard datasets bundled with sklearn
- Returns a `bunch` object 

In [12]:
import sys
print(sys.executable)
!{sys.executable} -m pip install sklearn

In [2]:
from sklearn.datasets import load_iris
data = load_iris()
type(data)
# print(data.DESCR)

sklearn.utils.Bunch

This return a bunch object data which is a dictionary-like object with the following attributes :

- `data`, which the feature matrix
- `target`, which is the label vector
- `feature_names` contains the name of the features
- `target_names` contain the names of the dataser
- `DESCR` has the full description of dataset
- `filename` has the path to the location of data
    

In [14]:
data.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']

In [15]:
data.target_names

array(['setosa', 'versicolor', 'virginica'], dtype='<U10')

#### The feature matrix 

In [97]:
# The actual data 
data.data 

# The first 5
data.data[:5]

# Shape of the data
data.data.shape

(150, 4)

There are 150 examples and each example with 4 features

#### The taget matrix  ( label )

In [33]:
# The actual data 
data.target

# Shape of the data
data.target.shape

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

There are 50 examples and each of three classes : 0,1 and 2

**We can read additional documentaion about `load_iris` in the following manner**

In [25]:
?load_iris

<br>

**Alternate way :** 
We can obtain the feature matrix and the target ( label ) from `load_iris` and other loaders in general by setting `return_X_y` argument to `True`. This returns a tuple instead of a `Bunch` object

In [5]:
feature_matrix, label_vector = load_iris(return_X_y=True)
print("Shape of the feature matrix : ", feature_matrix.shape)
print("Shape of the label vector : ", label_vector.shape)

Shape of the feature matrix :  (150, 4)
Shape of the label vector :  (150,)


<br>

## Fetchers

- Used to fetch large datasets from the internet and load them into memory
- Returns a `bunch` object 

**Step 1 :** Import the library and access the documentation 

In [35]:
# california_housing is not bundled with sklear. We are fetching it from the internet

from sklearn.datasets import fetch_california_housing
?fetch_california_housing

**Step 2 :** Load the dataset, obtain the `Bunch` object and examine it

In [46]:
housing_data = fetch_california_housing()
housing_data.DESCR

#### Feature and Target Matrix ( Label Vector )

In [48]:
housing_data.feature_names
housing_data.data
housing_data.data.shape

(20640, 8)

In [51]:
# Target Matrix ( Label Vector )

housing_data.target_names
housing_data.target
housing_data.target.shape

(20640,)

<br>

## Generators

- Used to generate controlled synthetic datasets
- Returns a tuple of feature matrix and label vectors or matrix

### For Regression Problems

We use `make_regression` to generate data for regression problems

In [7]:
from sklearn.datasets import make_regression
?make_regression

#### Example 1

Let's generate 100 samples with 5 features for a single label regression problem. 

In [9]:
x, y = make_regression(n_samples=100, n_features=5, n_targets=1, shuffle=True, random_state=2)

It's a good practice to set the seed so that we get to see repeatability in experiment. That's why we set the `random_state` to some seed value.

In [13]:
# Check the created data : 
print(x, y)
print(x.shape, y.shape)

[[-2.17135269e-01 -2.34360319e+00  1.17353150e+00  3.80471970e-01
   1.04082395e+00]
 [ 1.14485538e+00  8.29789046e-01 -1.52117687e-01 -1.64515057e-01
   5.62669078e-01]
 [-4.05286267e-01  1.18604868e+00 -1.37775793e+00 -7.94872445e-01
   3.63433972e-01]
 [-1.35479764e-01  1.90437591e+00  3.35908395e-01  3.76545911e-01
   5.85199353e-02]
 [-8.49995503e-01 -4.79985112e-01 -8.52341797e-01  6.65334278e-01
   8.53644334e-02]
 [ 4.40689872e-01 -5.83414595e-01 -7.19253841e-01  1.83533272e+00
   2.58529487e+00]
 [ 4.15393930e-02  5.39058321e-01 -1.11792545e+00  2.29220801e+00
   5.51454045e-01]
 [ 1.17500122e+00  9.02525097e-03 -7.47870949e-01 -1.91304965e-02
  -5.96159700e-01]
 [ 2.10222927e+00  5.35558351e-01  6.61264168e-02  6.35363758e-01
   4.25606211e-01]
 [-6.57718447e-01 -4.89157001e-01  8.20564332e-01 -8.34437391e-01
  -2.07492237e+00]
 [-1.18761229e+00 -1.53495196e-01 -1.42121723e+00 -6.37655012e-01
  -2.36184031e-01]
 [-2.74242089e-01 -4.47500876e-01  1.74181219e+00  1.11788673e+00

#### Example 2

Let's generate 100 samples with 5 features for a multiple regression problem with 5 outputs

In [63]:
x, y = make_regression(n_samples=100, n_features=5, n_targets=5, shuffle=True, random_state=2)

# Check the created data : 
print(x, y)
print(x.shape, y.shape)

(100, 5) (100, 5)


<br>

### For Classification Problems

We use `make_classification` to generate data for classification problems

In [67]:
from sklearn.datasets import make_classification
?make_classification

#### Example 1

Let's generate a binary classification problem with 10 features and 100 samples

In [94]:
x, y = make_classification(n_samples=100, n_features=10, n_classes=2, n_clusters_per_class=1, random_state=42)

# Check the created data : 
x[:5]
y[:5]
print(x.shape, y.shape)

(100, 10) (100,)


#### Example 2

Let's generate a 3-class classification problem with 10 features and 100 samples

In [82]:
x, y = make_classification(n_samples=100, n_features=10, n_classes=3, n_clusters_per_class=1, random_state=42)

# Check the created data : 
x[:5]
y[:5]
print(x.shape, y.shape)

(100, 10) (100,)


<br>

### For Multi-Label Classification Problems

We use `make_multilabel_classification` to generate data for multi-label classification problems

In [86]:
from sklearn.datasets import make_multilabel_classification
?make_multilabel_classification

#### Example 1

Let's generate a multi-label classification problem with 10 features, 100 samples, 5 labels and on an average 2 labels per example

In [96]:
x, y = make_multilabel_classification(n_samples=100, n_features=10, n_classes=5, n_labels=2, random_state=42)

# Check the created data : 
x[:5]

# This time label vector is a label matrix as we have more than one label
y[:5]

print(x.shape, y.shape)

(100, 10) (100, 5)


<br>

### For Clustering Problems

We use `make_blobs` to generate data for clustering problems

In [104]:
from sklearn.datasets import make_blobs
?make_blobs

#### Example 1

Let's generate a random dataset with 2 features and 100 samples each for clustering

In [110]:
x, y = make_blobs(n_samples=10, n_features=2, centers=3, random_state=42)

# Check the created data : 
x[:5]

# Cluster membership of each point in y
y[:5]

print(x.shape, y.shape)

(10, 2) (10,)
