# "Creating a Dataset for Classification"
> "In this article we learn how to create datasets for machine learning classification"
- toc: false
- branch: master
- badges: true
- comments: false
- categories: [machinelearning]
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2

Machine learning algorithms are fairly easy to use if you have data that matches exactly what the algorithms are designed for. To get going with an ML project with generic data, you could generate the data in the exact shape for your ML project. For example, if you want to do classification, then you will need a target columns with the predicted classes, and you will need a dataset with the features that could predict these classes. 

If your machine leaning project is in Python, then the best way to start is with **scikit-learn**. This easy to use yet powerful library also has convenience functiosn to generate test data, one of which is called **make_classification**.

### make_classification
**Scikit-learn** has a utility function to generate test data for classification called `make_classification`. With it you can generate a numpy array with *features* along with another array with *predicted classes*. This function is in the datasets package so to use it you would do

`from sklearn.datasets import make_classification`

`data, target = make_classification(...)`


and you will get the data and the target with some relationship between the two sufficient to do some machine learning. Here is an example

In [5]:
from sklearn.datasets import make_classification

data, target = make_classification(n_features=12, n_samples=100)

The data array is a numpy array of shape (**n_samples, n_features**)

In [11]:
data[:2], data.shape

(array([[-0.49250734, -0.56512833,  1.34644838, -0.74023749,  0.68843061,
         -0.97001932, -1.04030453,  0.01550148, -0.43045118, -0.97009488,
         -0.97950603,  0.29914428],
        [ 1.61169543,  1.31613403, -0.63130511, -0.49027106,  2.80135421,
          0.29398278,  0.0893581 ,  2.8170629 ,  0.11898778,  0.56789107,
         -1.60475764, -2.28835375]]),
 (100, 12))

The target is a numpy array of shape (**n_samples**). The values will be 0 or 1 because by default **n_classes** is 2

In [12]:
target, target.shape

(array([1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1,
        1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0,
        1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0,
        1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0,
        1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0]),
 (100,))

### Create a classification dataframe

In [1]:
from sklearn.datasets import make_classification
import pandas as pd

def make_dataset(n_samples=1000, n_features=10, n_informative=6, n_redundant=2, n_classes=2, **kwargs):
    data, target = make_classification(n_features=n_features,
                                     n_informative=n_informative, 
                                     n_redundant=n_redundant, 
                                     n_samples=n_samples, 
                                     n_classes=n_classes,
                                     **kwargs)
    index = pd.date_range(periods=n_samples, freq=pd.tseries.offsets.BDay(), end=pd.datetime.today()).normalize()
    columns = [f'Info{i}' for i in range(n_informative)] + \
              [f'Redun{i}' for i in range(n_redundant)] + \
              [f'Noise{i}' for i in range(n_features - (n_informative + n_redundant))]
    df = pd.DataFrame(data, columns=columns, index=index)
    target = pd.Series(target, index=index)
    return df, target
    
data, target = make_dataset(1000, n_features=8, n_informative=4, n_classes=3)

### The features
The generated dataframe contains 8 columns.

- **Informative features** These are informative features, meaning features that have a predictive relationship with the target
- **Redundant features** These features are generated as linear random combinations of the informative features
- **Noise** These are just noise, and should have no predictive power

In [2]:
data

Unnamed: 0,Info0,Info1,Info2,Info3,Redun0,Redun1,Noise0,Noise1
2016-05-10,-0.944,-1.358,-1.274,0.465,1.308,0.414,0.398,-0.483
2016-05-11,1.263,1.536,-0.006,-0.588,1.532,0.720,0.019,0.790
2016-05-12,0.755,0.700,-1.725,0.281,-1.483,-1.000,-0.296,0.450
2016-05-13,0.589,1.267,0.410,-1.858,-1.337,0.468,0.317,0.201
2016-05-16,0.489,-1.645,-0.387,0.396,-1.312,-1.753,0.086,0.678
...,...,...,...,...,...,...,...,...
2020-03-03,-1.128,-1.246,0.636,1.509,0.784,-0.160,0.776,-0.665
2020-03-04,1.748,-0.734,0.256,1.331,0.093,-2.095,2.344,1.587
2020-03-05,-2.043,-1.482,-0.079,-2.589,0.356,2.215,-1.444,-1.404
2020-03-06,-1.213,0.298,-0.405,0.920,-0.055,0.456,-2.922,-1.053


### The target
The target variable contains the values 0,1,2 - three classes since we specified three classes in the **make_dataset** function call. These are roughly evenly distributed, though wecould have specified a different distribution of values.

In [4]:
pd.DataFrame(target, columns=['Target']).Target.value_counts().to_frame().sort_index()

Unnamed: 0,Target
0,334
1,333
2,333
