# "Creating a Dataset for Classification"
> "In this article we learn how to create datasets for machine learning classification"
- toc: false
- branch: master
- badges: true
- comments: false
- categories: [machinelearning]
- metadata_key1: metadata_value1
- metadata_key2: metadata_value2

Machine learning algorithms are fairly easy to use if you have data that matches exactly what the algorithms are designed for. To get going with an ML project with generic data, you could generate the data in the exact shape for your ML project. For example, if you want to do classification, then you will need a target columns with the predicted classes, and you will need a dataset with the features that could predict these classes. 

If your machine leaning project is in Python, then the best way to start is with **scikit-learn**. This easy to use yet powerful library also has convenience functiosn to generate test data, one of which is called **make_classification**.

### make_classification
**Scikit-learn** has a utility function to generate test data for classification called `make_classification`. With it you can generate a numpy array with *features* along with another array with *predicted classes*. This function is in the datasets package so to use it you would do

`from sklearn.datasets import make_classification`

`data, target = make_classification(...)`


and you will get the data and the target with some relationship between the two sufficient to do some machine learning. Here is an example

In [16]:
random_state=2

In [17]:
from sklearn.datasets import make_classification

data, target = make_classification(n_features=12, n_samples=100, random_state=random_state)

The **data** array is a numpy array of shape (**n_samples, n_features**)

In [18]:
data[:2], data.shape

(array([[ 0.65755125, -0.73564052, -0.25712497,  2.16246241, -0.46323032,
          0.50442818, -0.1369783 , -2.42825346, -0.49282081, -0.64920516,
          0.27511225, -0.45730883],
        [ 0.54894656, -0.07663956, -0.08224538, -0.15972413,  1.70937948,
         -1.82138864, -0.30466658, -2.02559359,  1.93662278, -1.31756727,
         -1.25432739, -1.71406741]]),
 (100, 12))

The **target** is a numpy array of shape (**n_samples**). The values will be 0 or 1 because by default **n_classes** is 2

In [19]:
target, target.shape

(array([0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1,
        0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1,
        0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0,
        0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0,
        1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0]),
 (100,))

### Training a RandomForestClassifier

Now that we have the data, we can train a classifier and use it to predict a label.

In [20]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(max_depth=2, random_state=0)
clf.fit(data, target)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=2, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=0, verbose=0,
                       warm_start=False)

### Predicting an output label
After training we predict by passing in a data array. For simplicity we just choose one of the data values and we get a predicted label. This is not groundbreaking machine learning, but it shows how to quickly get a dataset that you can use to try different machine learning algorithms.

In [21]:
clf.predict([data[90]])

array([1])

## Create a classification dataframe

The downside of the `make_classification` function is that it create numpy arrays without meaningful feature names. The bigger problem is that the features have different aspects - some are informative, others are redundant, or simply plain noise, without any indicator of which is which. To improve on it you can create dataframes, which allow for meaningful names that help in analysis and explainability.

In [22]:
from sklearn.datasets import make_classification
import pandas as pd

def make_dataset(n_samples=1000, n_features=10, n_informative=6, n_redundant=2, n_classes=2, weights=None, random_state=2):
    data, target = make_classification(n_features=n_features,
                                     n_informative=n_informative, 
                                     n_redundant=n_redundant, 
                                     n_samples=n_samples, 
                                     n_classes=n_classes,
                                     weights=weights,
                                     random_state=random_state)
    index = pd.date_range(periods=n_samples, freq=pd.tseries.offsets.BDay(), end=pd.datetime.today()).normalize()
    columns = [f'Info{i}'  for i in range(n_informative)] + \
              [f'Redun{i}' for i in range(n_redundant)] + \
              [f'Noise{i}' for i in range(n_features - (n_informative + n_redundant))]
    df = pd.DataFrame(data, columns=columns, index=index)
    target = pd.Series(target, index=index)
    return df, target
    
data, target = make_dataset(1000, n_features=8, n_informative=4, n_classes=3, random_state=random_state)

### The features
The generated dataframe contains 8 columns.

- **Informative features** These are informative features, meaning features that have a predictive relationship with the target
- **Redundant features** These features are generated as linear random combinations of the informative features
- **Noise** These are just noise, and should have no predictive power

In [23]:
data

Unnamed: 0,Info0,Info1,Info2,Info3,Redun0,Redun1,Noise0,Noise1
2016-05-10,1.830016,0.113572,1.352016,-0.212425,1.106181,-1.557893,2.185757,0.497971
2016-05-11,-0.773268,2.548761,-2.969503,-1.338087,-1.386198,-1.037654,-1.107727,-0.421909
2016-05-12,1.731988,-0.010095,-0.391647,-0.483542,-1.066762,-0.628849,-0.474407,0.013515
2016-05-13,0.113050,-0.650851,-0.190711,-0.927015,-0.316462,1.270045,1.443103,-0.480245
2016-05-16,-2.019368,0.796679,-0.333953,0.723699,0.639954,-0.055370,0.020932,1.074855
...,...,...,...,...,...,...,...,...
2020-03-03,-0.713723,0.134416,-1.018359,0.638488,-1.182157,0.270404,-0.638665,-0.930648
2020-03-04,0.468723,-0.103892,0.633512,1.475857,-0.051929,-1.100790,-0.282572,-0.311341
2020-03-05,-0.109507,-0.733322,3.064653,3.040285,2.312181,-1.581513,-1.376304,1.056661
2020-03-06,0.362189,2.195827,-2.392010,-1.565179,-1.258144,-1.344796,-0.351114,0.633075


### The target
The target variable contains the values 0,1,2 - three classes since we specified three classes in the **make_dataset** function call. These are roughly evenly distributed, though wecould have specified a different distribution of values.

In [24]:
pd.DataFrame(target, columns=['Target']).Target.value_counts().to_frame().sort_index()

Unnamed: 0,Target
0,334
1,333
2,333


### Redundant Variables
If we plot the redundant variables we can see that it is a linear relationship to the informative variables. They can be safely dropped in a machine learning model, or otherwise handled in a special way. Of course, with real empirical data, you would not necessarily know that beforehand but would learn it during data exploration.

In [28]:
import altair as alt

chart = alt.Chart(data).mark_circle().encode(
    x='Info1',
    y='Redun1'
).properties(
    title='Informative vs Redundant Variables'
)
chart