# Creating a Test Dataset for Classification

## Create a Dataset for Classification
To illustrate ML classification we will first generate a dataset.

### make_classification
**Scikit-learn** has a utility function to generate test data for classification called `make_classification`. With it you can generate a numpy array with *features* along with another array with *predicted classes*.

### Create a classification dataframe

In [61]:
def make_dataset(n_samples=1000, n_features=10, n_informative=6, n_redundant=2, n_classes=2, **kwargs):
    data, target = make_classification(n_features=n_features,
                                     n_informative=n_informative, 
                                     n_redundant=n_redundant, 
                                     n_samples=n_samples, 
                                     n_classes=n_classes,
                                     **kwargs)
    index = pd.date_range(periods=n_samples, freq=pd.tseries.offsets.BDay(), end=pd.datetime.today()).normalize()
    columns = [f'Info{i}' for i in range(n_informative)] + \
              [f'Redun{i}' for i in range(n_redundant)] + \
              [f'Noise{i}' for i in range(n_features - (n_informative + n_redundant))]
    df = pd.DataFrame(data, columns=columns, index=index)
    target = pd.Series(target, index=index)
    return df, target
    
data, target = make_dataset(1000, n_classes=3)

### The features
The generated dataframe contains 10 features.

- **Informative features** These are informative features, meaning features that have a predictive relationship with the target
- **Redundant features** These features are generated as linear random combinations of the informative features
- **Noise** These are just noise, and should have no predictive power

In [62]:
data

Unnamed: 0,Info0,Info1,Info2,Info3,Info4,Info5,Redun0,Redun1,Noise0,Noise1
2016-05-09,-0.837671,1.669862,1.216492,2.102211,-3.835775,6.585448,2.714777,0.467635,1.999206,0.106072
2016-05-10,-0.169175,-1.397993,0.045350,2.000519,-2.109422,1.607660,-1.731406,0.649270,2.300643,3.699819
2016-05-11,-0.352642,-0.660572,-1.569780,3.698225,2.560158,3.156547,1.309530,0.762749,2.381795,-0.005944
2016-05-12,0.224075,-1.981616,-0.287805,-1.047755,1.514996,-1.984338,0.163567,0.379838,1.839880,3.341632
2016-05-13,0.171056,2.283085,1.673806,-0.889547,-1.191857,1.001632,-2.454782,0.295643,2.640628,-0.325261
...,...,...,...,...,...,...,...,...,...,...
2020-03-02,-1.699329,-2.764829,-0.265345,0.778363,-2.016183,-0.236014,-1.197534,0.351709,2.434775,3.124586
2020-03-03,-0.914414,2.335982,0.856029,-0.423509,-0.546396,0.821092,-0.874792,0.632176,-0.405551,-2.480377
2020-03-04,0.882539,1.661710,0.838825,0.551954,-0.555205,2.375841,-0.448172,0.743824,2.088615,-0.341239
2020-03-05,-0.396351,-1.621114,-0.186584,-0.567293,-0.538820,-0.045592,1.996148,-0.795062,1.105916,0.741156


### The target
The target variable contains the values 0,1,2 - three classes since we specified three classes in the **make_dataset** function call. These are roughly evenly distributed, though wecould have specified a different distribution of values.

In [63]:
pd.DataFrame(target, columns=['Target']).Target.value_counts().to_frame().sort_index()

Unnamed: 0,Target
0,337
1,332
2,331
