# Creating Test Datasets From Source

### Append (Stack)

> Pass (no variance in in/out params) - no lengthening

> Pass (no variance in in/out params) - with lengthening

> Pass with warning (new columns)

> Pass with warning (cast child dataset to parent datatype)

> Fail (cast of child fails to meet parent specification)


### Merge (History Preserving)

> Pass (no variance in in/out params) - no lengthening

> Pass (no variance in in/out params) - with lengthening

> Pass with warning (new columns)

> Pass with warning (cast child dataset to parent datatype)

> Fail (cast of child fails to meet parent specification)


### Replace

> Pass (Output == Child)

In [1]:
import pandas as pd

In [2]:
source = pd.read_csv('source_complete.csv', encoding='latin-1')
source.head()

Unnamed: 0,C_CRS_SBJ,C_TITLE,C_CRS_UNIQUE
0,ANTH,PERSPECTIVES ON RACE (DSS)(CI),ANTH3200TO1
1,CHEM,GENERAL CHEMISTRY I (BPS),CHEM1110KB1
2,COMD,ETHICS/CRTCL THNK INTERPRETERS,COMD5920LO1
3,ECN,APPLIED ECONOMETRICS (QI),ECN4330AO1
4,ENVS,HUMAN DIM WILDLIFE MGMT,ENVS4110NO1


### Case Class

This is a bundle container for holding multiple transforms needed for making each pass, fail, case.

In [3]:
from case import Case

In [4]:
c = Case(source=source)
c.report

{'message': 'empty'}

### Transformers

These do the heavy lifting, taking in data, making changes, and returning necessary outputs.

In [5]:
from transformers import Transformer

In [6]:
t = Transformer(name='test')
t.transform(source)
t.report

{'message': 'empty'}

In [7]:
## GOAl UI
## How should the user interact with the transformers?

# case.split(n=, size=[float])  # Split dataset into parts
# case.add_col(n=, column=, specs={DATAGEN})  # Add column of given specificatuion to dataset
# case.remove_col(n=, column=, index=)  # Remove column from dataset
# case.change_type(n=, column=, index=)  # Change datatype of column in dataset

#### Split

Split data into n components of specified size

In [72]:
class SplitTransform(Transformer):
    def __init__(self, n=2, size=None, data: pd.DataFrame = None):
        super().__init__(name='split')
        self.transform(data, n, size)
    
    def transform(self, data, n, size):
        assert type(data) == pd.DataFrame
        if size is not None:
            assert sum(size) == 1, 'Split fractions must sum to 1.0'
            assert len(size) == n, 'Must provide fractions for all splits'
        # Get data length
        dlen = len(data)
        # Define breakpoints
        if size is None:
            size = [1/n for _ in range(n)]
        
        breakpoints = []
        for count, fraction in enumerate(size):
            if count == 0:
                point = fraction * dlen
            else:
                point = fraction * dlen + breakpoints[count - 1]
            
            breakpoints.append(int(point))
        
        # Fix breakpoints :(\)
        breakpoints.insert(0, 0)
        breakpoints.pop()
            
        print('Breakpoints: ', breakpoints)
            
        # Bin dataframe at breakpoints
        new_data = {}
        for index in range(len(breakpoints)):
            if index < len(breakpoints) - 1:
                print('{}:{}'.format(breakpoints[index], breakpoints[index+1]))
                temp = data.iloc[breakpoints[index]:breakpoints[index+1]]
            else:
                temp = data.iloc[breakpoints[index]:dlen]
                      
            new_data[index] = temp
        
        
        # Create report components
        self.input_shape = data.shape
        self.output_shape = {key: new_data[key].shape for key in new_data}
        message = f'Data successfully split into {len(new_data)} pieces'
        
        print(len(new_data))
        
        super().transform(data=new_data, message=message)

In [73]:
st = SplitTransform(n=4, size=[0.25, 0.5, 0.15, 0.1], data=source.copy())
st.report

Breakpoints:  [0, 26908, 80724, 96868]
0:26908
26908:80724
80724:96868
4


{'Transformer': 'split',
 'input_shape': (107633, 3),
 'output_shape': {0: (26908, 3), 1: (53816, 3), 2: (16144, 3), 3: (10765, 3)},
 'message': 'Data successfully split into 4 pieces'}

107633