# 02.A: Working with Datasets

It's important to have a clear and sensible way of representing the datasets that learning algorithms train on. A dataset consists of $n$ examples. Each example consists of $m$ features. This makes $m$ the number dimensions the dataset has. In supervised learning, the dataset is a matrix like this:

$\boldsymbol{D} =\left[\begin{array}{cccccc} 
  x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \cdots & x_m^{(1)} & y^{(1)}\\ 
  x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \cdots & x_m^{(2)} & y^{(2)}\\
  x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & \cdots & x_m^{(3)} & y^{(3)}\\
  \vdots    & \vdots    & \vdots    & \cdots & \vdots & \vdots \\
  x_1^{(n)} & x_2^{(n)} & x_3^{(n)} & \cdots & x_m^{(n)} & y^{(n)}
\end{array}\right]$

Each row of this matrix is an example consisting of the $m$ features plus the target label as the last element in the row. In other words, $\boldsymbol{D}$ consists of both the input matrix $\boldsymbol{X}$ and target vector $y$, where: 

$\boldsymbol{X} =\left[\begin{array}{ccccc} 
  x_1^{(1)} & x_2^{(1)} & x_3^{(1)} & \cdots & x_m^{(1)}\\ 
  x_1^{(2)} & x_2^{(2)} & x_3^{(2)} & \cdots & x_m^{(2)}\\
  x_1^{(3)} & x_2^{(3)} & x_3^{(3)} & \cdots & x_m^{(3)}\\
  \vdots    & \vdots    & \vdots    & \cdots & \vdots \\
  x_1^{(n)} & x_2^{(n)} & x_3^{(n)} & \cdots & x_m^{(n)}
\end{array}\right]$

and

$\boldsymbol{y} =\left[\begin{array}{c} 
  y^{(1)}\\ 
  y^{(2)}\\
  y^{(3)}\\
  \vdots \\
  y^{(n)}
\end{array}\right]$

For unsupervised learning, $\boldsymbol{D}$ is the same as $\boldsymbol{X}$. Here is a class named `DataSet` to represent datasets. It uses pandas' DataFrame.

Another name for $X$ is `inputs`, and another name for $y$ is `target`. In addition, features have names. Let's put all of this together in a class that we will be using in subsequent weeks.

In [1]:
import numpy as np
import pandas as pd

class DataSet:
    """
    A dataset for a machine learning problem. A dataset d has the following properties:
    d.examples   A list of examples. Each one contains both the features and the target.
    d.features   An array of the of feature names.
    d.target     An m by 1 array containing the values of y
    d.y          Same as d.target
    d.inputs     An n by m array containing the values of X
    d.X          Same as d.inputs
    d.N          Number of examples
    d.M          Number of dimensions
    d.name       The name of the data set (for output display only)
    
    """
    def __init__(self, data, features=None, y=None, name=None):
        """
        If y is True, the data contains the target as the last column
        If y is None or False, No target is available
        Else y is an array to be added as the last column of the examples  dataframe
        """
        self.__name = name
        if isinstance(data, pd.DataFrame):
            self.__examples = data
        else:
            self.__examples = pd.DataFrame(data, columns=features)
            
        if y is True:
            self.__examples.columns = [*self.__examples.columns[:-1], 'y']
        elif y is not False and y is not None:
            self.__examples['y'] = y
    
    @property
    def examples(self):
        return self.__examples
    
    @property
    def features(self):
        return self.__examples.columns[:-1].values
    
    @property
    def target(self):
        if 'y' in self.__examples.columns:
            return self.__examples['y'].values.reshape(self.N, 1)
        return None
    
    @property
    def y(self):
        return self.target
    
    @property
    def inputs(self):
        return self.__examples.iloc[:, :-1].values
    
    @property
    def X(self):
        return self.inputs
    
    @property
    def name(self):
        return self.__name
    
    @property
    def N(self):
        return self.__examples.shape[0]
    
    @property
    def M(self):
        return self.inputs.shape[1]
    
    def shuffled(self, random_state=None):
        rgen = np.random.RandomState(random_state)
        indexes = np.arange(self.N)
        rgen.shuffle(indexes)
        return DataSet(self.__examples.iloc[indexes])
    
    def train_test_split(self, start=0, end=None, test_portion=None, shuffle=False, random_state=None):
        """
        Splits the dataset into a training set and atest set. 
        If test_portion is specified, return that portion of the dataset as test 
        and the rest as training. 
        Otherwise, return the examples between start and end as test and the 
        rest as training.
        """
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)

        if test_portion is None:
            end = end or self.N
        else:
            if not isinstance(test_portion, float) or test_portion < 0 or test_portion > 1:
                raise TypeError("Only fractions between ]0,1[ are allowed")

            start = self.N - int(self.N * test_portion)
            end = self.N

        test = DataSet(self.examples.iloc[indexes[range(start, end)]])
        train = DataSet(pd.concat([self.examples.iloc[indexes[range(start)]], 
                                      self.examples.iloc[indexes[range(end, self.N)]]], axis=0))    
        return train, test
    
    def __repr__(self):
        return repr(self.examples)

This class has a couple of properties including `name` (informational), `features` (the names of the features), `inputs`, `target`, `X`, `y`, `N` (number of examples), `M` (number of dimensions).

A DataSet object is created using a NumPy array or a Pandas dataframe. If it is a NumPy array, the class uses it to create a Pandas dataframe. The dataframe storing the data can be retrieved back using the `examples` property.


Let's test this class by creating a $27 \times 3$ input data and a separate $y$ column.

In [2]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")

ds

     x1   x2         x1  y
0   5.0  7.0  12.521934  1
1   3.0  7.0   6.863270  0
2   5.0  8.0   7.969996  1
3   3.0  5.0   7.061669  1
4   8.0  6.0   9.910670  0
5   8.0  7.0  12.187182  0
6   2.0  1.0   7.563446  0
7   2.0  7.0   8.178925  1
8   7.0  5.0   6.765816  0
9   8.0  1.0   9.737150  1
10  2.0  2.0  12.123582  0
11  3.0  7.0  10.587701  0
12  8.0  8.0   8.891404  1
13  5.0  8.0   7.942603  1
14  3.0  5.0   8.848434  1
15  8.0  2.0  10.876815  1
16  7.0  5.0   9.542804  1
17  5.0  8.0   8.371472  1
18  6.0  6.0  10.956047  1
19  6.0  7.0   9.428373  1
20  6.0  7.0  11.381269  1
21  2.0  2.0  12.965070  0
22  8.0  8.0   7.093377  1
23  5.0  4.0  10.817222  0
24  5.0  5.0  11.674168  1
25  5.0  7.0   9.463396  0
26  7.0  5.0   9.252102  0

In [3]:
ds.examples

Unnamed: 0,x1,x2,x1.1,y
0,5.0,7.0,12.521934,1
1,3.0,7.0,6.86327,0
2,5.0,8.0,7.969996,1
3,3.0,5.0,7.061669,1
4,8.0,6.0,9.91067,0
5,8.0,7.0,12.187182,0
6,2.0,1.0,7.563446,0
7,2.0,7.0,8.178925,1
8,7.0,5.0,6.765816,0
9,8.0,1.0,9.73715,1


In [4]:
ds.features

array(['x1', 'x2', 'x1'], dtype=object)

In [5]:
ds.target 

array([[1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0]])

In [6]:
ds.y 

array([[1],
       [0],
       [1],
       [1],
       [0],
       [0],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [1],
       [0],
       [1],
       [0],
       [1],
       [0],
       [0]])

In [7]:
ds.inputs

array([[ 5.        ,  7.        , 12.52193406],
       [ 3.        ,  7.        ,  6.86327032],
       [ 5.        ,  8.        ,  7.9699956 ],
       [ 3.        ,  5.        ,  7.06166915],
       [ 8.        ,  6.        ,  9.91066996],
       [ 8.        ,  7.        , 12.18718221],
       [ 2.        ,  1.        ,  7.56344594],
       [ 2.        ,  7.        ,  8.17892499],
       [ 7.        ,  5.        ,  6.76581606],
       [ 8.        ,  1.        ,  9.73715042],
       [ 2.        ,  2.        , 12.12358222],
       [ 3.        ,  7.        , 10.58770146],
       [ 8.        ,  8.        ,  8.89140353],
       [ 5.        ,  8.        ,  7.94260289],
       [ 3.        ,  5.        ,  8.84843407],
       [ 8.        ,  2.        , 10.87681468],
       [ 7.        ,  5.        ,  9.54280434],
       [ 5.        ,  8.        ,  8.37147199],
       [ 6.        ,  6.        , 10.95604715],
       [ 6.        ,  7.        ,  9.42837285],
       [ 6.        ,  7.        , 11.381

In [8]:
ds.X

array([[ 5.        ,  7.        , 12.52193406],
       [ 3.        ,  7.        ,  6.86327032],
       [ 5.        ,  8.        ,  7.9699956 ],
       [ 3.        ,  5.        ,  7.06166915],
       [ 8.        ,  6.        ,  9.91066996],
       [ 8.        ,  7.        , 12.18718221],
       [ 2.        ,  1.        ,  7.56344594],
       [ 2.        ,  7.        ,  8.17892499],
       [ 7.        ,  5.        ,  6.76581606],
       [ 8.        ,  1.        ,  9.73715042],
       [ 2.        ,  2.        , 12.12358222],
       [ 3.        ,  7.        , 10.58770146],
       [ 8.        ,  8.        ,  8.89140353],
       [ 5.        ,  8.        ,  7.94260289],
       [ 3.        ,  5.        ,  8.84843407],
       [ 8.        ,  2.        , 10.87681468],
       [ 7.        ,  5.        ,  9.54280434],
       [ 5.        ,  8.        ,  8.37147199],
       [ 6.        ,  6.        , 10.95604715],
       [ 6.        ,  7.        ,  9.42837285],
       [ 6.        ,  7.        , 11.381

In [9]:
ds.name

'Sample Data'

In [10]:
ds.N

27

In [11]:
ds.M

3

## Shuffling

The above class also supports a few useful methods. One such method is for shuffling the data, which we do often before training. This method returns a new DataSet instance with the shuffled data. Here is how this method is implemented:

```python
    ...
    def shuffled(self, random_state=None):
        rgen = np.random.RandomState(random_state)
        indexes = np.arange(self.N)
        rgen.shuffle(indexes)
        return DataSet(self.__examples.iloc[indexes])
   ...
```

Here is an example using this function.

In [12]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


ds.shuffled()

     x1   x2         x1  y
3   6.0  5.0   8.042373  0
20  7.0  8.0  11.942688  1
8   5.0  7.0  10.111466  1
15  4.0  5.0  11.623825  1
19  7.0  2.0   9.559170  0
18  8.0  8.0   7.555689  0
12  2.0  5.0  10.147800  1
17  2.0  1.0  10.380362  1
11  5.0  8.0  10.900830  1
5   7.0  2.0  11.708305  1
21  3.0  7.0  11.977027  0
6   5.0  3.0   9.844696  0
10  5.0  2.0  10.734684  1
2   2.0  8.0  11.263257  0
4   6.0  1.0   8.992177  0
23  6.0  2.0   9.728258  1
16  2.0  5.0  10.513307  0
9   4.0  8.0  12.364913  1
7   4.0  8.0   5.113000  1
14  8.0  6.0   9.667326  0
25  2.0  1.0  10.885947  1
1   5.0  5.0  13.883742  0
22  7.0  7.0  10.804425  1
0   5.0  8.0   6.996179  1
24  3.0  3.0  12.227242  0
13  4.0  1.0  12.579431  1
26  4.0  5.0   7.685479  1

## Splitting a dataset into training and test datasets

Another useful method provided by the above dataset class is the `train_test_split` method. This method splits the dataset into a training and test sets. Here is how this method is implemented:

```python
    ...
    def train_test_split(self,start=0, end=None, test_portion=None, shuffle=False, random_state=None):
        """
        Splits the dataset into a training set and atest set. 
        If test_portion is specified, return that portion of the dataset as test 
        and the rest as training. 
        Otherwise, return the examples between start and end as test and the 
        rest as training.
        """
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)

        if test_portion is None:
            end = end or self.N
        else:
            if not isinstance(test_portion, float) or test_portion < 0 or test_portion > 1:
                raise TypeError("Only fractions between ]0,1[ are allowed")

            start = self.N - int(self.N * test_portion)
            end = self.N

        test = DataSet(self.examples.iloc[indexes[range(start, end)]])
        train = DataSet(pd.concat([self.examples.iloc[indexes[range(start)]], 
                                      self.examples.iloc[indexes[range(end, self.N)]]], axis=0))    
        return train, test
   ...
```

If the `start` and end `end` parameters exist, the method returns the examples before them as test and the rest of the data as training. If `test_portion` is provided, then that portion of the data is returned as test and the rest as training. The `shuffle` parameter can be used to instruct the method to shuffle the data before splitting it. The method finally returns two dataset instances: training and test sets.

Here is an example using this method.

In [13]:
ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


ta, te = ds.train_test_split(test_portion=.25, shuffle=False, random_state=17)
print('Training set = \n', ta)
print('Test set = \n', te)

Training set = 
      x1   x2         x1  y
0   4.0  6.0  13.414042  1
1   4.0  1.0  13.297801  1
2   8.0  7.0  11.358705  1
3   6.0  2.0   8.130017  0
4   2.0  3.0  10.681290  0
5   6.0  3.0   9.401849  0
6   8.0  6.0  10.146325  1
7   3.0  3.0  12.599809  1
8   2.0  6.0   9.520197  1
9   2.0  2.0  10.483874  0
10  8.0  2.0  10.654690  1
11  2.0  4.0  11.923819  0
12  6.0  3.0  10.520298  0
13  2.0  3.0   8.953693  1
14  3.0  3.0   7.586914  1
15  6.0  2.0  11.116069  1
16  2.0  8.0   8.228879  1
17  2.0  6.0   8.831084  0
18  7.0  4.0  12.866930  0
19  7.0  3.0   8.730410  0
20  7.0  3.0   8.669619  0
Test set = 
      x1   x2         x1  y
21  3.0  3.0  11.594605  1
22  6.0  5.0  11.566627  0
23  3.0  6.0  10.984172  0
24  7.0  1.0   9.279887  0
25  4.0  4.0   9.555594  0
26  6.0  2.0  10.432902  1


## Using this dataset class inside other notebooks

This class is part of the `mylib` library of this class with is provided to you. Here is how to import this library:

In [14]:
import mylib as my

Once imported, one can use it like this:

In [15]:
ds = my.DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


ta, te = ds.train_test_split(test_portion=.25, shuffle=False, random_state=17)
print('Training set = \n', ta)
print('Test set = \n', te)

Training set = 
      x1   x2         x1  y
0   5.0  8.0  10.225063  0
1   5.0  1.0   9.839005  0
2   8.0  2.0   7.212959  1
3   6.0  2.0   8.383122  1
4   4.0  8.0  10.501762  1
5   8.0  4.0   5.195451  1
6   6.0  6.0  10.446948  0
7   2.0  4.0   8.683767  1
8   6.0  2.0   8.179429  1
9   8.0  6.0   7.097996  0
10  5.0  8.0  12.145517  0
11  8.0  2.0  10.794532  0
12  3.0  5.0   7.951338  1
13  6.0  6.0   8.465409  1
14  8.0  7.0  10.968606  1
15  4.0  8.0   7.879757  0
16  6.0  7.0   8.781603  0
17  4.0  2.0  10.635561  1
18  3.0  7.0   7.855425  0
19  2.0  2.0  14.112246  0
20  3.0  1.0  12.805732  1
Test set = 
      x1   x2         x1  y
21  4.0  5.0   8.836505  0
22  3.0  6.0  11.772730  0
23  7.0  1.0  11.797788  0
24  4.0  4.0   7.237477  0
25  3.0  6.0  10.698070  0
26  4.0  3.0  12.689613  1


## EXERCISE

Refactor the above DataSet class by adding a method named `train_validation_test_split` to it. This method should split the data into three sets: training, validation, and test. This method should receive a dictionary parameter named `portions` specifying how much of the data is in each set. For a 75%/15%/10% split, one can use the following portions parameter:

```python
portions={"training": .75, 'validation': .15, 'test': .10 }
```

The method should support the `shuffle` parameter as well. You may call the `train_test_split` method internally. Make sure to include a comment describing how your implementation of the method works. Test your method on the `ds` dataset above and show that it works.

In [16]:
import numpy as np
import pandas as pd

class DataSet:
    """
    A dataset for a machine learning problem. A dataset d has the following properties:
    d.examples   A list of examples. Each one contains both the features and the target.
    d.features   An array of the of feature names.
    d.target     An m by 1 array containing the values of y
    d.y          Same as d.target
    d.inputs     An n by m array containing the values of X
    d.X          Same as d.inputs
    d.N          Number of examples
    d.M          Number of dimensions
    d.name       The name of the data set (for output display only)
    
    """
    def __init__(self, data, features=None, y=None, name=None):
        """
        If y is True, the data contains the target as the last column
        If y is None or False, No target is available
        Else y is an array to be added as the last column of the examples  dataframe
        """
        self.__name = name
        if isinstance(data, pd.DataFrame):
            self.__examples = data
        else:
            self.__examples = pd.DataFrame(data, columns=features)
            
        if y is True:
            self.__examples.columns = [*self.__examples.columns[:-1], 'y']
        elif y is not False and y is not None:
            self.__examples['y'] = y
    
    @property
    def examples(self):
        return self.__examples
    
    @property
    def features(self):
        return self.__examples.columns[:-1].values
    
    @property
    def target(self):
        if 'y' in self.__examples.columns:
            return self.__examples['y'].values.reshape(self.N, 1)
        return None
    
    @property
    def y(self):
        return self.target
    
    @property
    def inputs(self):
        return self.__examples.iloc[:, :-1].values
    
    @property
    def X(self):
        return self.inputs
    
    @property
    def name(self):
        return self.__name
    
    @property
    def N(self):
        return self.__examples.shape[0]
    
    @property
    def M(self):
        return self.inputs.shape[1]
    
    def shuffled(self, random_state=None):
        rgen = np.random.RandomState(random_state)
        indexes = np.arange(self.N)
        rgen.shuffle(indexes)
        return DataSet(self.__examples.iloc[indexes])
    
    def train_test_split(self, start=0, end=None, test_portion=None, shuffle=False, random_state=None):
        """
        Splits the dataset into a training set and atest set. 
        If test_portion is specified, return that portion of the dataset as test 
        and the rest as training. 
        Otherwise, return the examples between start and end as test and the 
        rest as training.
        """
        indexes = np.arange(self.N)
        if shuffle is True:
            rgen = np.random.RandomState(random_state)
            rgen.shuffle(indexes)

        if test_portion is None:
            end = end or self.N
        else:
            if not isinstance(test_portion, float) or test_portion < 0 or test_portion > 1:
                raise TypeError("Only fractions between ]0,1[ are allowed")

            start = self.N - int(self.N * test_portion)
            end = self.N

        test = DataSet(self.examples.iloc[indexes[range(start, end)]])
        train = DataSet(pd.concat([self.examples.iloc[indexes[range(start)]], 
                                      self.examples.iloc[indexes[range(end, self.N)]]], axis=0))    
        return train, test
    
    def train_validation_test_split(self, portions=None, shuffle=False, random_state=None):
        """
        Approach: split the data in half, twice. The first split will return the training 
        portion and the non-training portion. The second split will be performed on the 
        non-training portion. 
        
        1. Set default portions to 60/20/20
            1a. The first split will be 60/40, the second will be 40/50
        2. Check if user has passed-in their own portions and set values accordingly
        3. Perform the first split on self, then split the non-training set (valTestSet)
        """
        
        valTestPortion = .4
        testPortion = .5
        
        if portions is not None:
            if not isinstance(portions, dict):
                raise TypeError("Portions must be a dictionary")
            elif not all(isinstance(k, str) for k in portions):
                raise TypeError("Keys must be strings")
            elif not all(isinstance(v, float) for v in portions.values()) and not all(0):
                raise TypeError("Values must be floats")
            
            valTestPortion = portions["validation"] + portions["test"]
            testPortion = portions["test"] / valTestPortion

        trainSet, valTestSet = self.train_test_split(test_portion=valTestPortion, shuffle=shuffle, random_state=random_state)
        valSet, testSet = valTestSet.train_test_split(test_portion=testPortion, shuffle=shuffle, random_state=random_state)
        
        return trainSet, valSet, testSet
    
    def __repr__(self):
        return repr(self.examples)

ds = DataSet(np.array([
    np.random.randint(2,9, 27),
    np.random.randint(1,9, 27),
    np.random.normal(loc=10, scale=2, size=27)
]).T, features=['x1', 'x2', 'x1'], y=np.random.randint(0,2, 27), name="Sample Data")


trainSet, valSet, testSet = ds.train_validation_test_split(portions={"training": .75, "validation": .15, "test": .10 }, shuffle=False, random_state=17)
print('Training set = \n', trainSet)
print('Validation set = \n', valSet)
print('Test set = \n', testSet)

Training set = 
      x1   x2         x1  y
0   7.0  6.0   9.717150  1
1   7.0  2.0  12.157076  1
2   2.0  4.0  13.716769  0
3   4.0  1.0  14.251145  0
4   3.0  3.0  10.138991  0
5   4.0  1.0  11.050259  0
6   3.0  1.0   9.324833  0
7   2.0  6.0  11.344301  1
8   4.0  5.0  11.402471  1
9   2.0  8.0   8.262311  0
10  7.0  7.0  10.694043  1
11  6.0  2.0  11.663684  0
12  7.0  4.0   9.380074  0
13  7.0  6.0   9.188347  1
14  4.0  6.0  10.087031  1
15  5.0  2.0  10.341870  0
16  2.0  3.0  10.111974  0
17  4.0  6.0   9.445857  1
18  2.0  5.0  10.697085  1
19  6.0  3.0  11.032611  0
20  2.0  7.0   8.670037  1
Validation set = 
      x1   x2         x1  y
21  2.0  6.0   8.574670  0
22  4.0  5.0  13.863405  0
23  5.0  4.0  10.242535  0
24  6.0  5.0   8.624695  0
Test set = 
      x1   x2         x1  y
25  4.0  3.0   8.299258  0
26  2.0  5.0  11.163730  0
