In [1]:
import sklearn.datasets as skdata
import numpy as np

### `make_classification` parameters
This is a very useful function so it is good to internalize its parameters. In the code cells below, the commented value is the default value. At the core of this method is the generation of an n-dimensional hypercube. Don't get too worried about this. A 2-dimensional hypercube is just a square, i.e., we choose 4 equidistant points on a 2-D plane -

| $x_1$ | $x_2$ |
|-------|-------|
| -2    | -2    |
| 2     | -2    |
| 2     | 2     |
| -2    | 2     |

Similarly an n-dimensional hypercube will just be equidistant points in an n-dimensional space. Nothing too complicated.

So this function generates an n-dimensional hypercube, where $n$ is the number of features. Then for each class it will create Gaussian clusters centered at each vertex. Then it will sample points from each cluster. It will further combine the selected points within each cluster for some reason. Each point will be n-dimensional, thus giving us a full sample. Imagine this in 2D and it will become a lot clearer.

![hypercube](./imgs/hypercubes2.png)

In the picture above, there are 2 features and 2 classes - blue and orange. A 2D hypercube is generated, gaussian clusters created at their vertices, and points sampled from these clusters give us the blue and orange samples. 

<span style="color:red">IMPORTANT: It is not clear from the documentation whether or not a vertex can host clusters of different classes. It seems likely that it does not.</span> This is because there cannot be more clusters than the number of vertices, i.e., $ck \leqslant 2^n$ where $c$ is the number of classes, $k$ is the number of clusters per class, so $ck$ is the total number of clusters, and $n$ is the number of features.

In the cells below, the commented value is the default.

#### Hypercubes
There are a number of parameters that control the hypercube. First of all the `hypercube` parameter determines whether this hypercube scheme is used at all or not. I can set the length of the hypercube with `class_sep`. Higher values means that the classes are easy to discriminate and therefore the classification task is easier. 

<span style="color:red">The following is just a hunch</span>
I can also somewhat control where in the n-dimensional space the vertices of the hypercube will lie, e.g., in the example above, I can somewhat control whether the vertices are (2, 2), (2, -2), (-2, 2), and (-2, -2) or some other points. This is done by the `shift` parameter, where sklearn will shift the points after they have been chosen (randomly?). If I specify `None`, then the points are shifted by some value chosen randomly from $[-l, l]$ where $l$ is the length of the hypercube. So far I have not found a good explanation of this on the Internet and I am too lazy to look up the source code.

Another useful parameter is `scale` which will scale the points by whatever I specify. If this is `None`, then the points are scaled by some value drawn randomly between $[1, 100]$.

In [2]:
# hypercube = True
hypercube = True

# class_sep = 1.0
class_sep = 0.8

# shift = 0.0
shift = 0.0

# scale = 1.0
scale = None

#### Classes
When creating the clusters centered on the hypercube vertices, sklearn does not always create 1 cluster per vertex. I can specify how many clusters to create per class. Further I can specify the proportion of samples for each class. If I don't specify anything, the dataset will be more-or-less balanced. In order to generate noise, I can specify the `flip_y` parameter which is the fraction of samples whose class will be assigned randomly. Obviously, the higher this value, the harder the classification problem.

In [8]:
# n_classes = 2
n_classes = 3

# n_clusters_per_class = 2
n_clusters_per_class = 2

# weights = None
weights = [0.1, 0.2, 0.7]

# flip_y = 0.01
flip_y = 0.1

#### Samples
The total number of samples to generate using the `n_samples` parameter. Pretty strightforward. I can also choose whether to shuffle the samples and the features. Without shuffling the features are in order of informative, redundant, repeated, and useless. The `random_state` sets the random seed for reproducibility.

In [9]:
# n_samples = 100
n_samples = 100

# shuffle = True
shuffle = False

# random_state = None
random_state = 0

#### Number of Features
The main parameter is the `n_informative`, this is the "real" number of features, the hypercubes are based on this. Additionally I can inject a bunch of redundant features which are just random linear combination of the informative features, repeated features that are duplicated from the informative and redundant features. Finally there can be a number of useless features that are completely random. There is no way to specify the number of useless features, but it is calculated from the total number of features.
```
n_features = n_informative + n_redundant + n_repeated + n_useless
```

In [10]:
# n_features = 20
n_features = 7

# n_informative = 2
n_informative = 3

# n_redundant = 2
n_redundant = 2

# n_repeated = 0
n_repeated = 1

# n_useless is 7 - 3 - 2 - 1 = 1

In [11]:
X, y = skdata.make_classification(
    # hypercube
    hypercube=hypercube,
    class_sep=class_sep,
    shift=shift,
    scale=scale,
    
    # classes
    n_classes=n_classes,
    n_clusters_per_class=n_clusters_per_class,
    weights=weights,
    flip_y=flip_y,
    
    # samples
    n_samples=n_samples,
    shuffle=shuffle,
    random_state=random_state,

    # features
    n_features=n_features,
    n_informative=n_informative,
    n_redundant=n_redundant,
    n_repeated=n_repeated,
)

In [12]:
print(X.shape, y.shape)

(100, 7) (100,)
