# Copernicus Tabular
In this notebook we will begin developing Copernicus Framework that deals with **Tabular** Data. This includes: 
* DataSet Formation
* Model Formation

## VERSION Alpha: 0.1.0

In [1]:
#export
# dependencies & imports
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import pandas as pd
import numpy as np
import torch
from path import Path
from torch.utils.data import Dataset, DataLoader
from torch.autograd import Variable
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

In [2]:
#export
torch.cuda.set_device(0)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") 

In [7]:
# grabbing dummy data
path = Path('./data/wids')
data = pd.read_csv(path/'train.csv', low_memory=False)

In [8]:
data.head()

Unnamed: 0,train_id,AA3,AA4,AA5,AA6,AA7,AA14,AA15,DG1,is_female,...,GN1,GN1_OTHERS,GN2,GN2_OTHERS,GN3,GN3_OTHERS,GN4,GN4_OTHERS,GN5,GN5_OTHERS
0,0,3,32,3.0,,323011,3854,481,1975,1,...,99.0,,99,,99,,99,,99,
1,1,2,26,,8.0,268131,2441,344,1981,1,...,,,1,,2,,2,,2,
2,2,1,16,,7.0,167581,754,143,1995,1,...,1.0,,2,,2,,2,,2,
3,3,4,44,5.0,,445071,5705,604,1980,1,...,,,2,,2,,99,,99,
4,4,4,43,,6.0,436161,5645,592,1958,1,...,,,1,,1,,1,,1,


In [9]:
# removing train_id
data.drop('train_id', axis=1, inplace=True)
data.head(2)

Unnamed: 0,AA3,AA4,AA5,AA6,AA7,AA14,AA15,DG1,is_female,DG3,...,GN1,GN1_OTHERS,GN2,GN2_OTHERS,GN3,GN3_OTHERS,GN4,GN4_OTHERS,GN5,GN5_OTHERS
0,3,32,3.0,,323011,3854,481,1975,1,3,...,99.0,,99,,99,,99,,99,
1,2,26,,8.0,268131,2441,344,1981,1,8,...,,,1,,2,,2,,2,


In [10]:
# Creating datasample to experiment much faster
data_sample = data.iloc[:500].copy()
data_sample.head()

Unnamed: 0,AA3,AA4,AA5,AA6,AA7,AA14,AA15,DG1,is_female,DG3,...,GN1,GN1_OTHERS,GN2,GN2_OTHERS,GN3,GN3_OTHERS,GN4,GN4_OTHERS,GN5,GN5_OTHERS
0,3,32,3.0,,323011,3854,481,1975,1,3,...,99.0,,99,,99,,99,,99,
1,2,26,,8.0,268131,2441,344,1981,1,8,...,,,1,,2,,2,,2,
2,1,16,,7.0,167581,754,143,1995,1,3,...,1.0,,2,,2,,2,,2,
3,4,44,5.0,,445071,5705,604,1980,1,3,...,,,2,,2,,99,,99,
4,4,43,,6.0,436161,5645,592,1958,1,3,...,,,1,,1,,1,,1,


## ```TabularDataset``` 0.0.1
In this attempt we will just create a simple datatset class that takes in raw data and will split between: ```X Y``` and create seperate ```x_train, x_val, y_train y_val``` which will individually containm ```X1, X2, Y```.

This class will also perform all pre-processing from removing and replacing NAs and perform LabelEncoding to both Categorical variables and our Dependent Variables: A class-map.

In [3]:
#export
# 1.
class TabularBunch():
    """
    
    df: <pandas df>
    categorical: <df[col]>: columns which represent all categorical (discrete) variables
    
    This class will turn all columns but categorical & dependent into continuous variables. IF there aren't any categorical (categorical==None) then we will treat all but dependent variables as continuous.
    
    NB: Perform some data analysis and exploration before calling this method. For later versions we will automate this full process.
    
    """
    def __init__(self, df, categorical=None, dependent=None, null_thresh=0., verbose=True):
        super().__init__()
        assert dependent # needs dependent variable to work
        self.df = df.copy()
        self.categorical = categorical
        self.verbose = verbose
        
        # cleaning df <OPTIONAL>: removing null values from certain threshold. 
        if null_thresh > 0.:
            # this will clear our categorical list: as we have to clean some columns
            self.categorical = self.purge_nulls_(null_thresh)
            
        # splitting X, y: Initial split | no cat/cont split
        y = self.df[dependent] # all rows should have a target | dependent variable
        X = self.df.drop(columns=dependent, axis=1)
            
        # creating classes
        self.classes = set(y)
        ### TO DO:
        ### Create class index to class string mapping
        ### This will help with any confusion and inference level interpretation

        
        # filling rest of null values: will perform just incase
        X = self.fill_na_(X)
            
        # Splitting X into: x_cont, x_cat
        # x_cont: continuous variables
        # x_cat: categorical (discrete variables which will go into embeddings)
        if categorical is not None: 
            self.X_cat = X[self.categorical].copy()
            self.X_cont = X.drop(columns=self.categorical, axis=1).copy()
            
        else:
            self.X_cat = None
            self.X_cont = X.copy() # do nothing: we will use this placeholder for now
            
        # Turning our categorical dataset: X_cat into LabelEncodings
        # This will turn cat1 -> 0, cat2 -> 1, for each cat within
        # The category columns
        self.labelencoding_()
                
        # Creating our embedding dictionaries
        # this will help with forming nn.Embeddings
        # and will also help with creating our datasets
        if categorical:
            self.emb_c = {n: len(col.cat.categories) for n,col in self.X_cat.items()}
            self.emb_sz = [(c, min(50, (c+1)//2)) for _,c  in self.emb_c.items()]
            self.emb_cols = self.categorical
        
        # our X, Y 
        # this will be used to feed into our dataset class
        if self.X_cat is not None:
            if len(self.X_cont.columns) == 0:
                self.X = self.X_cat
            else:
                self.X = pd.concat([self.X_cat, self.X_cont])
        else:
            self.X = self.X_cont
        self.Y = y
        
        # Clearning our attributes: Don't need to store everything
        del self.X_cat
        del self.X_cont
        del self.categorical
        del self.verbose
        del self.df
        
        if verbose: print('Finished!')
            
    def purge_nulls_(self, null_thresh):
        """Will remove all columns with null values exceeding our threshold"""
        if self.verbose: print(f'Performing null purge. Null threshold: {null_thresh}...')
        
        nt = int(len(self.df) * null_thresh)
        
        for col in self.df.columns:
            if self.df[col].isnull().sum() > nt:
                self.df.drop(col, axis=1, inplace=True)
                
        cols = list(self.df.columns)
        categorical = self.categorical_left_(cols) # removing categories that are missing
        return categorical
                
    def fill_na_(self, X):
        """Will perform a quick processing of NA values."""
        if self.verbose: print(f'Performing NaN Replacement...')
        for col in X.columns:
            if X.dtypes[col] == "object":
                X[col] = X[col].fillna("NA")
            else:
                X[col] = X[col].fillna(0)
        return X
        
    
    def labelencoding_(self):
        if self.verbose: print(f'Performing label encoding operation. This may take a few seconds...')
        if self.X_cat is not None:
            for col in self.X_cat.columns:
                le = LabelEncoder()
                self.X_cat[col] = le.fit_transform(self.X_cat[col])
                self.X_cat[col] = self.X_cat[col].astype('category')
                
    def categories_left_(self, cols):
        """
        This method will return a list of all categorical columns that are left after purging our high-threshold NAs. This will simply update our list therefor not return anything.
        """
        categorical_left = pd.Series()
        for col in self.categorical:
            mask = cols == col
            masks = mask|masks
        return categories_left
    
    def get_train_test_(self, test_size=0.1):
        """
        This get_train_test_ method will take in a bunch object and append x_train, x_val, y_train, y_val using train_test_split from the scikit learn libray. 

        This will allow for everything to be kept in a single bunch object which will feed into the Learner class and Dataset class
        """
        self.x_train, self.x_val, self.y_train, self.y_val = train_test_split(self.X,self.Y, test_size=test_size, random_state=42)

In [4]:
#export
# helper function to return categorical variables
def get_categorical_(dependent, df):
    """
    Helper function to return categorical variables
    TO DO: Need to make this more dynamic, what if there are continuos variables? At the moment this function will turn all but the dependent variable as categorical variables
    """
    return list(df.columns[df.columns != dependent])

In [13]:
# 1. Defining our Dependent, Categorical, and Continuous Variables.
# In this case, we don't have continuous: so will just feed dependent and cateorical
dependent = 'is_female'
categorical = get_categorical_(dependent, data_sample)

In [14]:
# 2. Building tabularbunch
# Testing our class & methods
# in production: We will first have to determine what is
# categorical & continuos. 
# this will be part of the typical data exploration phase
tabularbunch = TabularBunch(df=data_sample, categorical=categorical, dependent=dependent)

Performing NaN Replacement...
Performing label encoding operation. This may take a few seconds...
Finished!


In [12]:
# #export
# # 2. 
# def get_train_test_(bunch, test_size=0.1):
#     """
#     This get_train_test_ function will take in a bunch object and append x_train, x_val, y_train, y_val using train_test_split from the scikit learn libray. 
    
#     This will allow for everything to be kept in a single bunch object which will feed into the Learner class and Dataset class
#     """
#     X, Y = bunch.X, bunch.Y
#     bunch.x_train, bunch.x_val, bunch.y_train, bunch.y_val = train_test_split(X,Y, test_size=test_size, random_state=42)

In [13]:
# # splitting our data
# # this will be appended to the bunch object
# get_train_test_(tabularbunch, test_size=0.1)

In [18]:
# Splitting our data
# We will simply call our split_test_train_ method 
# on our tabular bunch object
tabularbunch.get_train_test_(test_size=0.1)

In [19]:
# checking sizes to see if they match
tabularbunch.x_train.shape, tabularbunch.x_val.shape

((450, 1233), (50, 1233))

In [20]:
tabularbunch.y_train.shape, tabularbunch.y_val.shape

((450,), (50,))

In [5]:
#export
# 2.
# tabular dataset class
class TabularDataset(Dataset):
    def __init__(self, tabularbunch, ds_type='train'):
        """
        This will be our main working dataset class. We will inherit from the Dataset class provided by PyTorch which will allow use to iterate through batches appropriateley when pushed to DataLoaders
        
        If there are both cat & cont variables, this will simply create a split X1, X2 respectively. 
        """
        if ds_type == 'train':
            X = tabularbunch.x_train
            Y = tabularbunch.y_train
        else:
            X = tabularbunch.x_val
            Y = tabularbunch.y_val

        # initial split
        # we will split cat & cont variables | IF cont or cat doesn't exist
        # this will just create a single X dataset
        self.X1 = X.loc[:,tabularbunch.emb_cols].copy().values.astype(np.int64)
        self.X2 = X.drop(columns=tabularbunch.emb_cols).copy().values.astype(np.float32)
        self.y = Y.to_numpy().astype(np.float32) # vector 
        
        # NORMALIZING CONT Dataset | if it exist
        if self.X2.size:
            self.X2 = (self.X2 - self.X2.mean()) / self.X2.std()
                
        # setting our dataset state
        if self.X1.size and not self.X2.size: self.X_state = 'cat_only'
        elif self.X2.size and not self.X1.size: self.X_state = 'cont_only'
        else: self.X_state = 'both'
        
    
    def __len__(self): return len(self.y)
    def __getitem__(self, idx):
        """
        Will iterate through condition: If X2 is not empty we will return both X1 & X2, otherwise just X1
        """
        if self.X_state == 'cat_only': return self.X1[idx], self.y[idx]
        elif self.X_state == 'cont_only': return self.X2[idx], self.y[idx]
        else: return self.X1[idx], self.X2[idx], self.y[idx]

In [138]:
# Grabbing our train and test datasets
# we will feed our tabularbunch object 
train_ds = TabularDataset(tabularbunch, ds_type='train')
valid_ds = TabularDataset(tabularbunch, ds_type='valid')

In [12]:
# helper function to grab train_ds and valid_ds in one grab
def get_datasets_(tabularbunch):
    """
    Will return train and valid datasets
    """
    train_ds = TabularDataset(tabularbunch, ds_type='train')
    valid_ds = TabularDataset(tabularbunch, ds_type='valid')
    return train_ds, valid_ds

In [145]:
# grabbing our train and valid ds using our helper function
train_ds, valid_ds = get_datasets_(tabularbunch)

In [146]:
# Now that we have our train and valid datasets we will create
# dataloaders using PyTorchs DataLoader class
# We will have to feed it bs
bs = 5
train_dl = DataLoader(train_ds, batch_size=bs, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=bs)

In [148]:
# how to quickly check the dataset
train_dl.dataset.X1

array([[ 3, 19,  0, ...,  1,  0,  1],
       [ 1,  9,  5, ...,  1,  2,  1],
       [ 0,  2,  0, ...,  1,  3,  1],
       ...,
       [ 1,  8,  0, ...,  1,  2,  1],
       [ 0,  4,  0, ...,  1,  0,  1],
       [ 0,  3,  0, ...,  1,  5,  1]], dtype=int64)

In [6]:
#export
# 3.
# A
# Creating our DataBunch class which will create data objects that will 
# feed into our learner class 
# this will store everything regarding our data
# and will make the entire process or creating a tabular dataloader 
# much easier.
# simply put: this puts everything together in one place
### V0.1.0 ### 
class TabularData():
    def __init__(self, tabularbunch, test_size:float=0.1, bs:int=64, train_shuffle:bool=True,**kwargs):
        """
        Tabular Data class will host everything associated with tabular data. This will be the main class that our Learner class will use to both train and test on.
        """
        self.tabularbunch = tabularbunch 
        self.tabularbunch.get_train_test_(test_size=test_size)
        
        # grabbing our datasets
        train_ds, valid_ds = self.get_datasets_()
        
        # setting our dataloaders
        self.train_dl = DataLoader(train_ds, batch_size=bs, shuffle=train_shuffle, **kwargs)
        self.valid_dl = DataLoader(valid_ds, batch_size=bs, **kwargs)
        
    def get_datasets_(self):
        """
        retrieve test train split
        """
        train_ds = TabularDataset(self.tabularbunch, ds_type='train')
        valid_ds = TabularDataset(self.tabularbunch, ds_type='valid')
        return train_ds, valid_ds

In [25]:
# Getting our data with our TabularData function
data = TabularData(tabularbunch)

# Full Workflow
We will now experiment with several datasets and go through the workflow. Lastly, we will feed our ```TabularData``` object into our ```Learner``` object which will do all the training and inference processing. 

We will cover ```Learner``` in ```02``` notebook.

In [20]:
### 1. ### 
# The first thing we will do is grab our raw data and perform some exploration
# and any extra pre-processing. This can be from feature extraction
# to clairvoyance augmentation
df = pd.read_csv('./data/wids/train.csv', low_memory=False).iloc[:100] # grab first 100

In [21]:
### 2. ###
# Now we will simply grab variable names if they exist
# these will be: Categorical variables, Continuous variables, Dependent variable
# nb: Dependent is what we are either predicting or forecasting

# this dataset does not contain any continuous variables. We will show this in the
# next example
dependent = 'is_female'
categorical = get_categorical_(dependent, df) # custom functon

In [22]:
### 3. ###
# Creating our tabularbunch
# This will be our core object which will perform some initial pre-processing 
# and will store embs for categorical. This will be fed into our TabularDataset
tabularbunch = TabularBunch(df=df, categorical=categorical, dependent=dependent)

Performing NaN Replacement...
Performing label encoding operation. This may take a few seconds...
Finished!


In [25]:
### 4. ###
# Creating our Data Object
# This will be the main component to proccess the data in our Learner object
data = TabularData(tabularbunch, bs=32, num_workers=1)

## Export: V0.1.0 
*Updated 12/22/2019*

In [7]:
!python notebook2script.py copernicusTabular_001.ipynb

Converted copernicusTabular_001.ipynb to exp\nb_copernicusTabular.py
