# "Creating a Dataset for Classification"
> "In this article we learn how to create datasets for machine learning classification"
- toc: false
- branch: master
- badges: true
- comments: false
- categories: [machinelearning]
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2

Machine learning algorithms are fairly easy to use if you have data that matches exactly what the algorithms are designed for. To get going with an ML project with generic data, you could generate the data in the exact shape for your ML project. For example, if you want to do classification, then you will need a target columns with the predicted classes, and you will need a dataset with the features that could predict these classes. 

If your machine leaning project is in Python, then the best way to start is with **scikit-learn**. This easy to use yet powerful library also has convenience functiosn to generate test data, one of which is called **make_classification**.

### make_classification
**Scikit-learn** has a utility function to generate test data for classification called `make_classification`. With it you can generate a numpy array with *features* along with another array with *predicted classes*.

### Create a classification dataframe

In [2]:
from sklearn.datasets import make_classification
import pandas as pd

def make_dataset(n_samples=1000, n_features=10, n_informative=6, n_redundant=2, n_classes=2, **kwargs):
    data, target = make_classification(n_features=n_features,
                                     n_informative=n_informative, 
                                     n_redundant=n_redundant, 
                                     n_samples=n_samples, 
                                     n_classes=n_classes,
                                     **kwargs)
    index = pd.date_range(periods=n_samples, freq=pd.tseries.offsets.BDay(), end=pd.datetime.today()).normalize()
    columns = [f'Info{i}' for i in range(n_informative)] + \
              [f'Redun{i}' for i in range(n_redundant)] + \
              [f'Noise{i}' for i in range(n_features - (n_informative + n_redundant))]
    df = pd.DataFrame(data, columns=columns, index=index)
    target = pd.Series(target, index=index)
    return df, target
    
data, target = make_dataset(1000, n_features=8, n_classes=3)

### The features
The generated dataframe contains 10 features.

- **Informative features** These are informative features, meaning features that have a predictive relationship with the target
- **Redundant features** These features are generated as linear random combinations of the informative features
- **Noise** These are just noise, and should have no predictive power

In [3]:
data

Unnamed: 0,Info0,Info1,Info2,Info3,Info4,Info5,Redun0,Redun1
2016-05-09,3.728173,-5.004847,1.980065,2.686836,-1.551730,1.810620,-2.581577,0.842732
2016-05-10,1.524033,-3.643525,2.322809,3.942553,2.818465,2.949274,3.403136,-0.521124
2016-05-11,-0.352715,0.966715,-3.196505,-1.948132,1.145954,-0.102820,0.153282,-0.417418
2016-05-12,-1.348569,1.987818,-1.073890,0.019679,-1.318597,1.555168,0.599463,0.360419
2016-05-13,-1.188771,1.940679,-1.090539,-0.158189,-0.907717,1.253296,0.649502,0.626345
...,...,...,...,...,...,...,...,...
2020-03-02,-3.008902,4.285379,-0.289713,-1.989447,-3.866660,1.850734,-3.256944,3.326776
2020-03-03,-1.684865,2.901418,-1.215915,0.628505,-1.353135,2.346419,1.911456,1.164642
2020-03-04,-0.012789,0.462547,1.147512,0.654661,-0.337346,-1.517129,1.040698,-0.432503
2020-03-05,-2.385096,2.855828,0.704913,-1.086289,-1.396062,1.254561,-1.334015,2.461928


### The target
The target variable contains the values 0,1,2 - three classes since we specified three classes in the **make_dataset** function call. These are roughly evenly distributed, though wecould have specified a different distribution of values.

In [4]:
pd.DataFrame(target, columns=['Target']).Target.value_counts().to_frame().sort_index()

Unnamed: 0,Target
0,334
1,333
2,333
