# "Creating a Dataset for Classification"
> "In this article we learn how to create datasets for machine learning classification"
- toc: false
- branch: master
- badges: true
- comments: false
- categories: [machinelearning]
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2

Machine learning algorithms are fairly easy to use if you have data that matches exactly what the algorithms are designed for. To get going with an ML project with generic data, you could generate the data in the exact shape for your ML project. For example, if you want to do classification, then you will need a target columns with the predicted classes, and you will need a dataset with the features that could predict these classes. 

If your machine leaning project is in Python, then the best way to start is with **scikit-learn**. This easy to use yet powerful library also has convenience functiosn to generate test data, one of which is called **make_classification**.

### make_classification
**Scikit-learn** has a utility function to generate test data for classification called `make_classification`. With it you can generate a numpy array with *features* along with another array with *predicted classes*.

### Create a classification dataframe

In [1]:
from sklearn.datasets import make_classification
import pandas as pd

def make_dataset(n_samples=1000, n_features=10, n_informative=6, n_redundant=2, n_classes=2, **kwargs):
    data, target = make_classification(n_features=n_features,
                                     n_informative=n_informative, 
                                     n_redundant=n_redundant, 
                                     n_samples=n_samples, 
                                     n_classes=n_classes,
                                     **kwargs)
    index = pd.date_range(periods=n_samples, freq=pd.tseries.offsets.BDay(), end=pd.datetime.today()).normalize()
    columns = [f'Info{i}' for i in range(n_informative)] + \
              [f'Redun{i}' for i in range(n_redundant)] + \
              [f'Noise{i}' for i in range(n_features - (n_informative + n_redundant))]
    df = pd.DataFrame(data, columns=columns, index=index)
    target = pd.Series(target, index=index)
    return df, target
    
data, target = make_dataset(1000, n_features=8, n_informative=4, n_classes=3)

### The features
The generated dataframe contains 8 columns.

- **Informative features** These are informative features, meaning features that have a predictive relationship with the target
- **Redundant features** These features are generated as linear random combinations of the informative features
- **Noise** These are just noise, and should have no predictive power

In [2]:
data

Unnamed: 0,Info0,Info1,Info2,Info3,Redun0,Redun1,Noise0,Noise1
2016-05-09,0.086170,0.263946,1.147196,0.339872,-1.649331,-1.385724,1.350134,0.895231
2016-05-10,-0.519554,1.474023,1.574778,-0.404883,-1.380555,-1.680931,2.586144,1.771411
2016-05-11,-2.309424,0.404823,-1.276600,-0.973535,3.035280,0.491335,0.854342,-0.340997
2016-05-12,1.463149,1.588531,1.150813,0.681997,-1.027532,1.080704,-0.465189,1.247441
2016-05-13,0.961113,2.335966,-1.425342,0.219535,1.118655,0.546441,-1.477135,-1.250334
...,...,...,...,...,...,...,...,...
2020-03-02,-2.323392,0.658851,1.093550,0.446788,-1.665774,-2.595020,2.448741,0.270443
2020-03-03,0.573351,1.780362,-1.571254,1.568883,1.301866,1.092169,-2.063039,-1.640765
2020-03-04,2.059537,2.106828,-0.022051,1.284990,-1.075659,-1.453897,0.309927,0.133524
2020-03-05,2.237442,0.569487,-1.701149,-1.311770,0.942068,-0.053449,-1.453072,-1.252078


### The target
The target variable contains the values 0,1,2 - three classes since we specified three classes in the **make_dataset** function call. These are roughly evenly distributed, though wecould have specified a different distribution of values.

In [4]:
pd.DataFrame(target, columns=['Target']).Target.value_counts().to_frame().sort_index()

Unnamed: 0,Target
0,334
1,333
2,333
