# Deep Learning for tabular data using Pytorch

<b>Dataset</b> - https://www.kaggle.com/c/shelter-animal-outcomes

<b>Problem Statement</b>: Given certain features about a shelter animal (like age, sex, color, breed), predict its outcome.

There are 5 possible outcomes: Return_to_owner, Euthanasia, Adoption, Transfer, Died. We are expected to find the probability of an animal's outcome belonging to each of the 5 categories.

## Library imports

In [5]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import torch
from torch.utils.data import Dataset, DataLoader
import torch.optim as torch_optim
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models
from datetime import datetime

## Load Data

#### Training set 

In [6]:
train = pd.read_csv('Data/train_animal.csv')
print("Shape:", train.shape)
train.head()

Shape: (26729, 10)


Unnamed: 0,AnimalID,Name,DateTime,OutcomeType,OutcomeSubtype,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,A671945,Hambone,2014-02-12 18:22:00,Return_to_owner,,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White
1,A656520,Emily,2013-10-13 12:44:00,Euthanasia,Suffering,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby
2,A686464,Pearce,2015-01-31 12:28:00,Adoption,Foster,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White
3,A683430,,2014-07-11 19:09:00,Transfer,Partner,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream
4,A667013,,2013-11-15 12:52:00,Transfer,Partner,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan


#### Test set

In [7]:
test = pd.read_csv('Data/test_animal.csv')
print("Shape:", test.shape)
test.head()

Shape: (11456, 8)


Unnamed: 0,ID,Name,DateTime,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color
0,1,Summer,2015-10-12 12:15:00,Dog,Intact Female,10 months,Labrador Retriever Mix,Red/White
1,2,Cheyenne,2014-07-26 17:59:00,Dog,Spayed Female,2 years,German Shepherd/Siberian Husky,Black/Tan
2,3,Gus,2016-01-13 12:20:00,Cat,Neutered Male,1 year,Domestic Shorthair Mix,Brown Tabby
3,4,Pongo,2013-12-28 18:12:00,Dog,Intact Male,4 months,Collie Smooth Mix,Tricolor
4,5,Skooter,2015-09-24 17:59:00,Dog,Neutered Male,2 years,Miniature Poodle Mix,White


#### Sample submission file

For each row, each outcome's probability needs to be filled into the columns

In [9]:
sample = pd.read_csv('Data/sample_submission_animal.csv')
sample.head()

Unnamed: 0,ID,Adoption,Died,Euthanasia,Return_to_owner,Transfer
0,1,1,0,0,0,0
1,2,1,0,0,0,0
2,3,1,0,0,0,0
3,4,1,0,0,0,0
4,5,1,0,0,0,0


## Very basic data exploration

#### How balanced is the dataset?

In [10]:
Counter(train['OutcomeType'])

Counter({'Return_to_owner': 4786,
         'Euthanasia': 1555,
         'Adoption': 10769,
         'Transfer': 9422,
         'Died': 197})

Adoption and Transfer seem to occur a lot more than the rest

#### What are the most common names and how many times do they occur? 


In [11]:
Counter(train['Name']).most_common(5)

[(nan, 7691), ('Max', 136), ('Bella', 135), ('Charlie', 107), ('Daisy', 106)]

There are a lot of NaN values. Name might not be a very important factor then.

## Data preprocessing


In [12]:
train_X = train.drop(columns= ['AnimalID','OutcomeType','OutcomeSubtype'])
Y = train['OutcomeType']
test_X = test

OutcomeSubtype column seems to be of no use, so we drop it. Also, sice animal ID unique, it doesn't help in training.

#### Stacking Train and Test sets so that they undergo the same preprocessing 

In [14]:
stacked_df = train_X.append(test_X.drop(columns=['ID']))

#### splitting Datetime into Month and Year

In [15]:
stacked_df['DateTime'] = pd.to_datetime(stacked_df['DateTime'])
stacked_df['year'] = stacked_df['DateTime'].dt.year
stacked_df['month'] = stacked_df['DateTime'].dt.month
stacked_df = stacked_df.drop(columns=['DateTime'])
stacked_df.head()

Unnamed: 0,Name,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color,year,month
0,Hambone,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White,2014,2
1,Emily,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby,2013,10
2,Pearce,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White,2015,1
3,,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream,2014,7
4,,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan,2013,11


#### Dropping columns with too many nulls

In [18]:
for col in stacked_df.columns:
    if stacked_df[col].isnull().sum() > 10000:
        print('dropping', col, stacked_df[col].isnull().sum())
        stacked_df = stacked_df.drop(columns = [col])

dropping Name 10916


In [19]:
stacked_df.head()

Unnamed: 0,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color,year,month
0,Dog,Neutered Male,1 year,Shetland Sheepdog Mix,Brown/White,2014,2
1,Cat,Spayed Female,1 year,Domestic Shorthair Mix,Cream Tabby,2013,10
2,Dog,Neutered Male,2 years,Pit Bull Mix,Blue/White,2015,1
3,Cat,Intact Male,3 weeks,Domestic Shorthair Mix,Blue Cream,2014,7
4,Dog,Neutered Male,2 years,Lhasa Apso/Miniature Poodle,Tan,2013,11


#### Label encoding

In [21]:
for col in stacked_df.columns:
    if stacked_df.dtypes[col] == 'object':
        stacked_df[col] = stacked_df[col].fillna('NA')
    else:
        stacked_df[col] = stacked_df[col].fillna(0)
    stacked_df[col] = LabelEncoder().fit_transform(stacked_df[col])

In [24]:
stacked_df.tail()

Unnamed: 0,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color,year,month
11451,0,3,19,775,6,1,6
11452,0,0,20,775,46,1,9
11453,0,0,5,775,156,1,8
11454,1,3,38,841,40,2,8
11455,1,1,31,1022,183,1,6


In [23]:
# making all variable categorical
for col in stacked_df.columns:
    stacked_df[col] = stacked_df[col].astype('category')

#### Splitting back train and test

In [27]:
X = stacked_df[0:26729]
test_processed = stacked_df[26729:]

# check if shape[0] matches original
print('train shape', X.shape, 'orginal: ', train.shape)
print('test shape', test_processed.shape, 'original', test.shape)

train shape (26729, 7) orginal:  (26729, 10)
test shape (11456, 7) original (11456, 8)


#### Encoding target

In [29]:
y = LabelEncoder().fit_transform(Y)

# check to see that numbers match and matich with previous Counter to create targe dictionary
print(Counter(train['OutcomeType']))
print(Y)
target_dict = {
    'Adoption'  : 0,
    'Died' : 1,
    'Euthanasia' : 2,
    'Return_to_owner' : 3,
    'Transfer' : 4,
}

Counter({'Adoption': 10769, 'Transfer': 9422, 'Return_to_owner': 4786, 'Euthanasia': 1555, 'Died': 197})
0        Return_to_owner
1             Euthanasia
2               Adoption
3               Transfer
4               Transfer
              ...       
26724           Transfer
26725           Transfer
26726           Adoption
26727           Transfer
26728           Transfer
Name: OutcomeType, Length: 26729, dtype: object


#### Split of train-valid

In [30]:
X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.10, random_state=0)
X_train.head()

Unnamed: 0,AnimalType,SexuponOutcome,AgeuponOutcome,Breed,Color,year,month
6917,1,3,5,1293,146,2,6
13225,0,4,33,1515,231,1,7
2697,1,4,5,1353,43,2,8
21905,1,3,31,245,40,3,0
17071,0,4,37,775,156,1,7


#### Choosing columns for embedding

In [31]:
# categorical embedding do columns having more than the values
embedded_cols = {n: len(col.cat.categories) for n,col in X.items() if len(col.cat.categories) > 2}
embedded_cols

{'SexuponOutcome': 6,
 'AgeuponOutcome': 46,
 'Breed': 1678,
 'Color': 411,
 'year': 4,
 'month': 12}

In [35]:
embedded_col_names = embedded_cols.keys()
# how many numbers of numerical columns
len(X.columns) - len(embedded_cols)

1

#### Determining size of embedding 

In [37]:
embedding_sizes = [(n_categories, min(50, (n_categories+1)//2)) for _,n_categories in embedded_cols.items()]
embedding_sizes

[(6, 3), (46, 23), (1678, 50), (411, 50), (4, 2), (12, 6)]

## Pytorch Dataset

## Making device (GPU/CPU) compatible 

In order to make use of a GPU if available (more [here](https://jovian.ml/aakashns/04-feedforward-nn)), we'll have to move our data and model to it and for that we will produce some functions like:

## Model


#### Optimizer

#### Training function

#### Evaluation function

## Training 

## Test Output