# Copernicus Tabular
### ```autotabular``` 
This notebook follows ```001``` with the addition of ```AutoTabularBunch``` which will allow for the creation of a ```TabularBunch``` without declaring cont and cat vars but only dependent var.

We will be experimenting with a dataset that contains both ```cont``` and ```cat``` variables.

## VERSION Alpha: 0.1.0

In [13]:
#export
# dependencies & imports
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import pandas as pd
import numpy as np
import torch
from path import Path
from torch.utils.data import Dataset, DataLoader
from torch.autograd import Variable

import torch.nn as nn
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [14]:
import os

In [15]:
#export
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
torch.cuda.set_device(device)

In [16]:
# root directory to all data
root = Path('./data')
(root/'ibm_attrition').listdir()
f = os.listdir(root/'ibm_attrition')[0]
data = pd.read_csv(f'{root}/ibm_attrition/{f}', low_memory=False)

In [8]:
data.head()

Unnamed: 0,Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,...,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
0,41,Yes,Travel_Rarely,1102,Sales,1,2,Life Sciences,1,1,...,1,80,0,8,0,1,6,4,0,5
1,49,No,Travel_Frequently,279,Research & Development,8,1,Life Sciences,1,2,...,4,80,1,10,3,3,10,7,1,7
2,37,Yes,Travel_Rarely,1373,Research & Development,2,2,Other,1,4,...,2,80,0,7,3,3,0,0,0,0
3,33,No,Travel_Frequently,1392,Research & Development,3,4,Life Sciences,1,5,...,3,80,0,8,3,3,8,7,3,0
4,27,No,Travel_Rarely,591,Research & Development,2,1,Medical,1,7,...,4,80,1,6,3,3,2,2,2,2


## ```TabularDataset``` 0.0.1
In this attempt we will just create a simple datatset class that takes in raw data and will split between: ```X Y``` and create seperate ```x_train, x_val, y_train y_val``` which will individually containm ```X1, X2, Y```.

This class will also perform all pre-processing from removing and replacing NAs and perform LabelEncoding to both Categorical variables and our Dependent Variables: A class-map.

In [17]:
#export
# 1.
class TabularBunch():
    """
    
    NOTE: Deprecated: Need to fix multi-layered categorical filterring
    
    df: <pandas df>
    categorical: <df[col]>: columns which represent all categorical (discrete) variables
    
    This class will turn all columns but categorical & dependent into continuous variables. IF there aren't any categorical (categorical==None) then we will treat all but dependent variables as continuous.
    
    NB: Perform some data analysis and exploration before calling this method. For later versions we will automate this full process.
    
    """
    def __init__(self, df, categorical=None, dependent=None, null_thresh=0., verbose=True):
        super().__init__()
        assert dependent # needs dependent variable to work
        self.df = df.copy()
        self.categorical = categorical
        self.verbose = verbose
        
        # cleaning df <OPTIONAL>: removing null values from certain threshold. 
        if null_thresh > 0.:
            # this will clear our categorical list: as we have to clean some columns
            self.categorical = self.purge_nulls_(null_thresh)
            
        # splitting X, y: Initial split | no cat/cont split
        self.Y = self.df[dependent] # all rows should have a target | dependent variable
        X = self.df.drop(columns=dependent, axis=1)
            
        # creating classes
        self.classes = set(y)
        ### TO DO:
        ### Create class index to class string mapping
        ### This will help with any confusion and inference level interpretation

        
        # filling rest of null values: will perform just incase
        X = self.fill_na_(X)
            
        # Splitting X into: x_cont, x_cat
        # x_cont: continuous variables
        # x_cat: categorical (discrete variables which will go into embeddings)
        if categorical is not None: 
            self.X_cat = X[self.categorical].copy()
            self.X_cont = X.drop(columns=self.categorical, axis=1).copy()
            
        else:
            self.X_cat = None
            self.X_cont = X.copy() # do nothing: we will use this placeholder for now
            
        # Turning our categorical dataset: X_cat into LabelEncodings
        # This will turn cat1 -> 0, cat2 -> 1, for each cat within
        # The category columns
        self.labelencoding_()
                
        # Creating our embedding dictionaries
        # this will help with forming nn.Embeddings
        # and will also help with creating our datasets
        if categorical:
            self.emb_c = {n: len(col.cat.categories) for n,col in self.X_cat.items()}
            self.emb_sz = [(c, min(50, (c+1)//2)) for _,c  in self.emb_c.items()]
            self.emb_cols = self.categorical
        
        # our X, Y 
        # this will be used to feed into our dataset class
        if self.X_cat is not None:
            if len(self.X_cont.columns) == 0:
                self.X = self.X_cat
            else:
                self.X = pd.concat([self.X_cat, self.X_cont], axis=1)
        else:
            self.X = self.X_cont
            
        if self.X_cat is not None and self.X_cont is not None:
            self.X_state = 'both'
        elif self.X_cat is not None and self.X_cont is None:
            self.X_state = 'catonly'
        elif self.X_cat is None and self.X_cont is not None:
            self.X_state = 'contonly'
        
        # Clearning our attributes: Don't need to store everything
        del self.X_cat
        del self.X_cont
        del self.categorical
        del self.verbose
        del self.df
        
        if verbose: print('Finished!')
            
    def purge_nulls_(self, null_thresh):
        """Will remove all columns with null values exceeding our threshold"""
        if self.verbose: print(f'Performing null purge. Null threshold: {null_thresh}...')
        
        nt = int(len(self.df) * null_thresh)
        
        for col in self.df.columns:
            if self.df[col].isnull().sum() > nt:
                self.df.drop(col, axis=1, inplace=True)
                
        cols = list(self.df.columns)
        categorical = self.categorical_left_(cols) # removing categories that are missing
        return categorical
                
    def fill_na_(self, X):
        """Will perform a quick processing of NA values."""
        if self.verbose: print(f'Performing NaN Replacement...')
        for col in X.columns:
            if X.dtypes[col] == "object":
                X[col] = X[col].fillna("NA")
            else:
                X[col] = X[col].fillna(0)
        return X
        
    
    def labelencoding_(self):
        if self.verbose: print(f'Performing label encoding operation. This may take a few seconds...')
        if self.X_cat is not None:
            for col in self.X_cat.columns:
                le = LabelEncoder()
                self.X_cat[col] = le.fit_transform(self.X_cat[col])
                self.X_cat[col] = self.X_cat[col].astype('category')
                
        # encoding our y
        le = LabelEncoder()
        self.Y = le.fit_transform(self.Y)
                
    def categories_left_(self, cols):
        """
        This method will return a list of all categorical columns that are left after purging our high-threshold NAs. This will simply update our list therefor not return anything.
        """
        categorical_left = pd.Series()
        for col in self.categorical:
            mask = cols == col
            masks = mask|masks
        return categories_left
    
    def get_train_test_(self, test_size=0.1):
        """
        This get_train_test_ method will take in a bunch object and append x_train, x_val, y_train, y_val using train_test_split from the scikit learn libray. 

        This will allow for everything to be kept in a single bunch object which will feed into the Learner class and Dataset class
        """
        self.x_train, self.x_val, self.y_train, self.y_val = train_test_split(self.X,self.Y, test_size=test_size, random_state=42)

In [30]:
#export
# 1b.
class AutoTabularBunch():
    """
    
    df: <pandas df>
    categorical: <df[col]>: columns which represent all categorical (discrete) variables
    purge_: <string | condition>: This will determine how to handle null values. smart: will fill na with approipriate values using a smart fill algorithm (coming soon)
        To feed a data into our Learner object (tabularlearner | NN) we cannot feed NaNs
    
    This class will turn all columns but categorical & dependent into continuous variables. IF there aren't any categorical (categorical==None) then we will treat all but dependent variables as continuous.
    
    NB: Perform some data analysis and exploration before calling this method. For later versions we will automate this full process.

    
    """
    def __init__(self, df, dependent=None, problem_type=None, null_thresh=.1, cat_thresh=0.9, purge_=None, verbose=True):
        super().__init__()
        # init declarations
        self.problem_type = 'classification' if problem_type is None else problem_type
        purge = 'null_thresh' if purge_ is None else purge_
        assert dependent # needs dependent variable to work
        assert self.problem_type == 'classification' or self.problem_type == 'regression'
        assert purge == 'aggressive' or purge == 'null_thresh' or purge == "smart"
        self.df = df.copy()
        self.verbose = verbose
        self.dependent = dependent
        
        # cleaning df <OPTIONAL>: removing null values from certain threshold.
        if purge == 'null_thresh':
            if null_thresh > 0.:
                # this will clear our categorical list: as we have to clean some columns
                self.purge_nulls_(null_thresh)
                
                if self.df[self.dependent].isnull().any():
                    raise ValueError(f'{self.dependent} column seems to have some missing values or NaN values. This will cause an error when training the model. Suggestion: Use aggressive purge_ type or pre-process the data to have no-nan values on this specific column')
        elif purge == 'aggressive':                        
            if verbose:
                print('Purging all rows with any null values: this is necessary for training otherwise use smart purge')
            mask = self.df[dependent].isnull()
            self.df = self.df[mask==False]
        elif purge == "smart":
            print('coming soon: This will fillna with appropriate values')
            

        # splitting X, y: Initial split | no cat/cont split
        self.Y = self.df[dependent] # all rows should have a target | dependent variable
        X = self.df.drop(columns=dependent, axis=1)
        
        # Grabbing categorical and setting as 'category' object time
        self.categorical = self.set_categorical(cat_thresh, X)
            
        # creating classes
        self.classes = set(self.Y)
        self.nc = len(self.classes)
        ### TO DO:
        ### Create class index to class string mapping
        ### This will help with any confusion and inference level interpretation

        
        # filling rest of null values: will perform just incase
        X = self.fill_na_(X)
            
        # Splitting X into: x_cont, x_cat
        # x_cont: continuous variables
        # x_cat: categorical (discrete variables which will go into embeddings)
        if self.categorical is not None: 
            self.X_cat = X[self.categorical].copy()
            self.X_cont = X.drop(columns=self.categorical, axis=1).copy()
            self.n_cont = len(self.X_cont.columns)
            
        else:
            self.X_cat = pd.DataFrame([])
            self.X_cont = X.copy() # do nothing: we will use this placeholder for now
            self.n_cont = None
            
        # Turning our categorical dataset: X_cat into LabelEncodings
        # This will turn cat1 -> 0, cat2 -> 1, for each cat within
        # The category columns
        self.labelencoding_()
                
        # Creating our embedding dictionaries
        # this will help with forming nn.Embeddings
        # and will also help with creating our datasets
        if self.categorical is not None:
            self.emb_c = {n: len(col.cat.categories) for n,col in self.X_cat.items()}
            self.emb_sz = [(c, min(50, (c+1)//2)) for _,c  in self.emb_c.items()]
            self.emb_cols = self.categorical
        
        # our X, Y 
        # this will be used to feed into our dataset class
        if self.X_cat is not None:
            if len(self.X_cont.columns) == 0:
                self.X = self.X_cat
            else:
                self.X = pd.concat([self.X_cat, self.X_cont], axis=1)
        else:
            self.X = self.X_cont
        
#         if self.X_cat is not None and self.X_cont is not None:
#             self.X_state = 'both'
#         elif self.X_cat is not None and self.X_cont is None:
#             self.X_state = 'catonly'
#         elif self.X_cat is None and self.X_cont is not None:
#             self.X_state = 'contonly'

        if self.X_cat.size and self.X_cont.size:
            self.X_state = 'both'
        elif self.X_cat.size and not self.X_cont.size:
            self.X_state = 'catonly'
        elif not self.X_cat.size and self.X_cont.size:
            self.X_state = 'contonly'
        
        # Clearning our attributes: Don't need to store everything
        del self.X_cat
        del self.X_cont
        del self.categorical
        del self.verbose
        del self.df
        
        if verbose: print('Finished!')
            
    def purge_nulls_(self, null_thresh):
        """Will remove all columns with null values exceeding our threshold"""
        if self.verbose: print(f'Performing null purge. Null threshold: {null_thresh}...')
        
        nt = int(len(self.df) * null_thresh)
        
        for col in self.df.columns:
            if self.df[col].isnull().sum() > nt and col != self.dependent:
                self.df.drop(col, axis=1, inplace=True)
                
    def fill_na_(self, X):
        """Will perform a quick processing of NA values."""
        if self.verbose: print(f'Performing NaN Replacement...')
        for col in X.columns:
            if X.dtypes[col] == "object":
                X[col] = X[col].fillna("NA")
            else:
                X[col] = X[col].fillna(0)
        return X
    
    def set_categorical(self, cat_thresh, X):
        """
        Will create our categorical columns based on our cat algorithm. This is the core of AutoTabularBunch. For messy datasets this may not work well and will require the use of TabularBunch
        """
        assert cat_thresh < 1. and cat_thresh > 0.
        cats_mask = ((X.nunique() < int(len(X) * cat_thresh)) & (X.dtypes == "object")) | (X.dtypes == 'object')
        categorical = [cat for cat, b in cats_mask.items() if b]
        
        if len(categorical) == 0:
            categorical = None # purely continuous variables
        return categorical
        
    
    def labelencoding_(self):
        if self.verbose: print(f'Performing label encoding operation. This may take a few seconds...')
        if self.X_cat is not None:
            for col in self.X_cat.columns:
                le = LabelEncoder()
                self.X_cat[col] = le.fit_transform(self.X_cat[col])
                self.X_cat[col] = self.X_cat[col].astype('category')
        
        # encoding our y if classification
        if self.problem_type == 'classification':
            le = LabelEncoder()
            self.Y = le.fit_transform(self.Y)
    
    def get_train_test_(self, test_size=0.1):
        """
        This get_train_test_ method will take in a bunch object and append x_train, x_val, y_train, y_val using train_test_split from the scikit learn libray. 

        This will allow for everything to be kept in a single bunch object which will feed into the Learner class and Dataset class
        """
        self.x_train, self.x_val, self.y_train, self.y_val = train_test_split(self.X,self.Y, test_size=test_size, random_state=42)

In [19]:
#export
# helper function to return categorical variables
def get_categorical_(dependent, df):
    """
    Helper function to return categorical variables
    TO DO: Need to make this more dynamic, what if there are continuos variables? At the moment this function will turn all but the dependent variable as categorical variables
    """
    return list(df.columns[df.columns != dependent])

In [13]:
# 1. Defining our Dependent, Categorical, and Continuous Variables.
# In this case, we don't have continuous: so will just feed dependent and cateorical
dependent = 'is_female'
categorical = get_categorical_(dependent, data_sample)

In [14]:
# 2. Building tabularbunch
# Testing our class & methods
# in production: We will first have to determine what is
# categorical & continuos. 
# this will be part of the typical data exploration phase
tabularbunch = TabularBunch(df=data_sample, categorical=categorical, dependent=dependent)

Performing NaN Replacement...
Performing label encoding operation. This may take a few seconds...
Finished!


In [12]:
# #export
# # 2. 
# def get_train_test_(bunch, test_size=0.1):
#     """
#     This get_train_test_ function will take in a bunch object and append x_train, x_val, y_train, y_val using train_test_split from the scikit learn libray. 
    
#     This will allow for everything to be kept in a single bunch object which will feed into the Learner class and Dataset class
#     """
#     X, Y = bunch.X, bunch.Y
#     bunch.x_train, bunch.x_val, bunch.y_train, bunch.y_val = train_test_split(X,Y, test_size=test_size, random_state=42)

In [13]:
# # splitting our data
# # this will be appended to the bunch object
# get_train_test_(tabularbunch, test_size=0.1)

In [18]:
# Splitting our data
# We will simply call our split_test_train_ method 
# on our tabular bunch object
tabularbunch.get_train_test_(test_size=0.1)

In [19]:
# checking sizes to see if they match
tabularbunch.x_train.shape, tabularbunch.x_val.shape

((450, 1233), (50, 1233))

In [20]:
tabularbunch.y_train.shape, tabularbunch.y_val.shape

((450,), (50,))

In [53]:
#export
# 2.
# tabular dataset class
class TabularDataset(Dataset):
    def __init__(self, tabularbunch, ds_type='train'):
        """
        This will be our main working dataset class. We will inherit from the Dataset class provided by PyTorch which will allow use to iterate through batches appropriateley when pushed to DataLoaders
        
        If there are both cat & cont variables, this will simply create a split X1, X2 respectively. 
        """
        if ds_type == 'train':
            X = tabularbunch.x_train
            Y = tabularbunch.y_train
        else:
            X = tabularbunch.x_val
            Y = tabularbunch.y_val

        self.X_state = tabularbunch.X_state
        
        # initial split
        # we will split cat & cont variables | IF cont or cat doesn't exist
        # this will just create a single X dataset]
        if self.X_state == 'both':
            self.X1 = X.loc[:,tabularbunch.emb_cols].copy().values.astype(np.int64)
            self.X2 = X.drop(columns=tabularbunch.emb_cols).copy().values.astype(np.float32)
        elif self.X_state == 'catonly':
            self.X1 = X.copy().values.astype(np.int64)
        elif self.X_state == 'contonly':
            self.X2 = X.copy().values.astype(np.float32)
        
        if tabularbunch.problem_type == 'classification':
            if isinstance(Y, pd.Series):
                self.y = Y.values.astype(np.int64)
            else:
                self.y = Y # already int value
        elif tabularbunch.problem_type == 'regression':
            if isinstance(Y, pd.Series):
                self.y = Y.values.astype(np.float32)
            else:
                self.y = Y.astype(np.float32) # convert to float32
        
        # NORMALIZING CONT Dataset | if it exist
        if self.X_state == 'contonly' or self.X_state == 'both':
            self.X2 = (self.X2 - self.X2.mean()) / self.X2.std()
        
    def __len__(self): return len(self.y)
    def __getitem__(self, idx):
        """
        Will iterate through condition: If X2 is not empty we will return both X1 & X2, otherwise just X1
        """
        if self.X_state == 'cat_only': return self.X1[idx], self.y[idx]
        elif self.X_state == 'cont_only': return self.X2[idx], self.y[idx]
        else: return self.X1[idx], self.X2[idx], self.y[idx]

In [138]:
# Grabbing our train and test datasets
# we will feed our tabularbunch object 
train_ds = TabularDataset(tabularbunch, ds_type='train')
valid_ds = TabularDataset(tabularbunch, ds_type='valid')

In [21]:
# helper function to grab train_ds and valid_ds in one grab
def get_datasets_(tabularbunch):
    """
    Will return train and valid datasets
    """
    train_ds = TabularDataset(tabularbunch, ds_type='train')
    valid_ds = TabularDataset(tabularbunch, ds_type='valid')
    return train_ds, valid_ds

In [145]:
# grabbing our train and valid ds using our helper function
train_ds, valid_ds = get_datasets_(tabularbunch)

In [146]:
# Now that we have our train and valid datasets we will create
# dataloaders using PyTorchs DataLoader class
# We will have to feed it bs
bs = 5
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=bs)

In [148]:
# how to quickly check the dataset
train_dl.dataset.X1

array([[ 3, 19,  0, ...,  1,  0,  1],
       [ 1,  9,  5, ...,  1,  2,  1],
       [ 0,  2,  0, ...,  1,  3,  1],
       ...,
       [ 1,  8,  0, ...,  1,  2,  1],
       [ 0,  4,  0, ...,  1,  0,  1],
       [ 0,  3,  0, ...,  1,  5,  1]], dtype=int64)

In [54]:
#export
# 3.
# A
# Creating our DataBunch class which will create data objects that will 
# feed into our learner class 
# this will store everything regarding our data
# and will make the entire process or creating a tabular dataloader 
# much easier.
# simply put: this puts everything together in one place
### V0.1.0 ### 
class TabularData():
    def __init__(self, tabularbunch, test_size:float=0.1, bs:int=64, train_shuffle:bool=True,**kwargs):
        """
        Tabular Data class will host everything associated with tabular data. This will be the main class that our Learner class will use to both train and test on.
        
        TO DO: 
        Create save methods. The user | app, should be able to save the dataloader incase of any error. This will allow for them to load the post-processed dataset into a TabularData class which will hold both DataLoaders (train, valid)
        
        
        """
        self.tabularbunch = tabularbunch 
        self.tabularbunch.get_train_test_(test_size=test_size)
        
        # grabbing our datasets
        train_ds, valid_ds = self.get_datasets_()
        
        # setting our dataloaders
        self.train_dl = DataLoader(train_ds, batch_size=bs, shuffle=train_shuffle, **kwargs)
        self.valid_dl = DataLoader(valid_ds, batch_size=bs, **kwargs)
        
    def get_datasets_(self):
        """
        retrieve test train split
        """
        train_ds = TabularDataset(self.tabularbunch, ds_type='train')
        valid_ds = TabularDataset(self.tabularbunch, ds_type='valid')
        return train_ds, valid_ds

In [25]:
# Getting our data with our TabularData function
data = TabularData(tabularbunch)

# Full Workflow
We will now experiment with several datasets and go through the workflow. Lastly, we will feed our ```TabularData``` object into our ```Learner``` object which will do all the training and inference processing. 

We will cover ```Learner``` in ```02``` notebook.

In [20]:
### 1. ### 
# The first thing we will do is grab our raw data and perform some exploration
# and any extra pre-processing. This can be from feature extraction
# to clairvoyance augmentation
df = pd.read_csv('./data/wids/train.csv', low_memory=False).iloc[:100] # grab first 100

In [21]:
### 2. ###
# Now we will simply grab variable names if they exist
# these will be: Categorical variables, Continuous variables, Dependent variable
# nb: Dependent is what we are either predicting or forecasting

# this dataset does not contain any continuous variables. We will show this in the
# next example
dependent = 'is_female'
categorical = get_categorical_(dependent, df) # custom functon

In [22]:
### 3. ###
# Creating our tabularbunch
# This will be our core object which will perform some initial pre-processing 
# and will store embs for categorical. This will be fed into our TabularDataset
tabularbunch = TabularBunch(df=df, categorical=categorical, dependent=dependent)

Performing NaN Replacement...
Performing label encoding operation. This may take a few seconds...
Finished!


In [25]:
### 4. ###
# Creating our Data Object
# This will be the main component to proccess the data in our Learner object
data = TabularData(tabularbunch, bs=32, num_workers=1)

# Full Workflow W/ AutoTabularBunch

In [23]:
### 1. ### 
# Grabbing our raw data, setting it as a dataframe using pandas DF
root = Path('./data')
(root/'ibm_attrition').listdir()
f = os.listdir(root/'ibm_attrition')[0]
data = pd.read_csv(f'{root}/ibm_attrition/{f}', low_memory=False) # this will be a dragon 'drag-on'

In [24]:
### 2. ### 
# Creating our TabularBunch
# We will be using ```AutoTabularBunch``` which will create categorical, and contincuous variables automatically
dependent = 'Attrition'
tabularbunch = AutoTabularBunch(df=data, dependent=dependent, verbose=False)

In [25]:
### 3. ###
# Creating our Data Object
data = TabularData(tabularbunch, bs=32, num_workers=1)

# Full Workflow W/ AutoTabularBunch & Cont only

In [31]:
root = Path('./data')

In [32]:
### 1. ###
# uploading our data
f = (root/'eeg').listdir()[0]
data = pd.read_csv(f)

In [33]:
### 2. ###
# dropping column: Our UI will direct this 
# data.drop(columns='predefinedlabel', inplace=True)
dependent = 'user-definedlabeln'

# setting our tabularbunch object using Auto
tabularbunch = AutoTabularBunch(df=data, dependent=dependent, verbose=False)

In [35]:
### 3. ###
# Creating our data object: The core to our Learner Object
data = TabularData(tabularbunch, bs=32, num_workers=1)

In [67]:
data.train_dl.dataset.X2

array([[-0.28040695, -0.28043163, -0.2802835 , ..., -0.25300834,
        -0.2714949 , -0.28043574],
       [-0.2804193 , -0.28043163, -0.2802588 , ..., -0.22441238,
        -0.22833765, -0.28043574],
       [-0.28043163, -0.28043163, -0.2802794 , ..., -0.27439973,
        -0.2718734 , -0.28043574],
       ...,
       [-0.2804193 , -0.28043163, -0.2803658 , ..., -0.23607706,
        -0.2234537 , -0.28043574],
       [-0.28043574, -0.28041106, -0.2803658 , ..., -0.21134055,
        -0.26606783, -0.28043163],
       [-0.28041518, -0.28041106, -0.28023824, ..., -0.23924114,
        -0.23973076, -0.28043163]], dtype=float32)

## Full Workflow W/ AutoTabularBunch & Regression Problem Type
Not all problems are classification tasks and are ```regression``` in this type of problem we are predicting a continuous variable opposed to a fixed class

In [43]:
### 1. ###
# loading our regression task dataset
f = (root/'kepler_exoplanet').listdir()[0]
data = pd.read_csv(f)

# dropping unessercary col
data.drop(columns='rowid', axis=1, inplace=True)

In [44]:
### 2. ###
# setting our tabularbunch object
# our dependent is float
# therefor, this will be a 'regression' problem_type which we will define in the class init
dependent = 'koi_score'
tabularbunch = AutoTabularBunch(df=data, dependent=dependent, problem_type='regression', verbose=False)

In [46]:
### 3. ###
# Creating our data object
data = TabularData(tabularbunch, bs=32, num_workers=1)

In [49]:
# in production we cannnot use this dataset as our y contains nans
data.train_dl.dataset.y

array([  nan, 0.   , 0.76 , ..., 0.   ,   nan, 0.001], dtype=float32)

### Aggressive purge
This is a forced method for transforming our data 

In [29]:
### 1. ###
# loading our regression task dataset
f = (root/'kepler_exoplanet').listdir()[0]
data = pd.read_csv(f)

# dropping unessercary col
data.drop(columns='rowid', axis=1, inplace=True)

In [32]:
### 2. ###
# setting our tabularbunch object
# our dependent is float
# therefor, this will be a 'regression' problem_type which we will define in the class init
dependent = 'koi_score'
tabularbunch = AutoTabularBunch(df=data, dependent=dependent, problem_type='regression', purge_='aggressive', verbose=False)

In [35]:
### 3. ###
# Creating our data object
data = TabularData(tabularbunch, bs=32, num_workers=1)

In [37]:
data.train_dl.dataset.y.shape

(7248,)

In [38]:
data.train_dl.dataset.X1.shape

(7248, 5)

In [39]:
data.train_dl.dataset.X2.shape

(7248, 43)

## Mushroom Test

In [55]:
f = (root/'mushroom').listdir()[0]
data = pd.read_csv(f)

In [56]:
tabularbunch = AutoTabularBunch(df=data, dependent='class', verbose=False)

In [57]:
tabularbunch.X_state

'catonly'

In [58]:
data = TabularData(tabularbunch, bs=32, num_workers=1)

## Export: V0.1.0 
* *Updated 1 12/24/2019* - Christmas Eve! 
* *Updated 2 12/26/2019* 

In [48]:
!python notebook2script.py copernicusTabular_002.ipynb

Converted copernicusTabular_002.ipynb to exp\nb_copernicusTabular.py
