![MLU Logo](../data/MLU_Logo.png)

# <a name="0">Machine Learning Accelerator - Tabular Data - Lecture 1</a>


## Final Project 

In this notebook, we build a ML model to predict the __Time at Center__ field of our final project dataset.

1. <a href="#1">Read the dataset</a> (Given) 
2. <a href="#2">Train a model</a> (Implement)
    * <a href="#21">Exploratory Data Analysis</a>
    * <a href="#22">Select features to build the model</a>
    * <a href="#23">Data processing</a>
    * <a href="#24">Model training</a>
3. <a href="#3">Make predictions on the test dataset</a> (Implement)
4. <a href="#4">Write the test predictions to a CSV file</a> (Given)

__Austin Animal Center Dataset__:

In this exercise, we are working with pet adoption data from __Austin Animal Center__. We have two datasets that cover intake and outcome of animals. Intake data is available from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Intakes/wter-evkm) and outcome is from [here](https://data.austintexas.gov/Health-and-Community-Services/Austin-Animal-Center-Outcomes/9t4d-g238). 

In order to work with a single table, we joined the intake and outcome tables using the "Animal ID" column and created a training.csv, test_features.csv and y_test.csv files. Similar to our review dataset, we didn't consider animals with multiple entries to the facility to keep it simple. If you want to see the original datasets, they are available under data/review folder: Austin_Animal_Center_Intakes.csv, Austin_Animal_Center_Outcomes.csv.

__Dataset schema:__ 
- __Pet ID__ - Unique ID of pet
- __Outcome Type__ - State of pet at the time of recording the outcome
- __Sex upon Outcome__ - Sex of pet at outcome
- __Name__ - Name of pet 
- __Found Location__ - Found location of pet before entered the center
- __Intake Type__ - Circumstances bringing the pet to the center
- __Intake Condition__ - Health condition of pet when entered the center
- __Pet Type__ - Type of pet
- __Sex upon Intake__ - Sex of pet when entered the center
- __Breed__ - Breed of pet 
- __Color__ - Color of pet 
- __Age upon Intake Days__ - Age of pet when entered the center (days)
- __Time at Center__ - Time at center (0 = less than 30 days; 1 = more than 30 days). This is the value to predict. 


## 1. <a name="1">Read the datasets</a> (Given)
(<a href="#0">Go to top</a>)

Let's read the datasets into dataframes, using Pandas.

In [17]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings("ignore")
  
training_data = pd.read_csv('../data/final_project/training.csv')
test_data = pd.read_csv('../data/final_project/test_features.csv')

print('The shape of the training dataset is:', training_data.shape)
print('The shape of the test dataset is:', test_data.shape)


The shape of the training dataset is: (71538, 13)
The shape of the test dataset is: (23846, 12)


# Data Exploration

In [3]:
# Implement here
print(training_data.info())
print(training_data.describe())
print(training_data.isna().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 71538 entries, 0 to 71537
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype 
---  ------                --------------  ----- 
 0   Pet ID                71538 non-null  object
 1   Outcome Type          71533 non-null  object
 2   Sex upon Outcome      71537 non-null  object
 3   Name                  44360 non-null  object
 4   Found Location        71538 non-null  object
 5   Intake Type           71538 non-null  object
 6   Intake Condition      71538 non-null  object
 7   Pet Type              71538 non-null  object
 8   Sex upon Intake       71537 non-null  object
 9   Breed                 71538 non-null  object
 10  Color                 71538 non-null  object
 11  Age upon Intake Days  71538 non-null  int64 
 12  Time at Center        71538 non-null  int64 
dtypes: int64(2), object(11)
memory usage: 7.1+ MB
None
       Age upon Intake Days  Time at Center
count          71538.000000    71538.0000

# Data Preparation

In [5]:
#Drop NaNs
training_data = training_data.dropna(how='all')
training_data = training_data.drop(columns=['Name'])

In [6]:
#Drop correlated numeric features
corr_matrix = training_data.corr().abs()

upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

to_drop = [column for column in upper.columns if any(upper[column] > 0.95)]

In [24]:
to_drop

[]

# Training using autogluon

In [13]:
#Find the best model for training
from autogluon.tabular import TabularPredictor

predictor = TabularPredictor(label = 'Time at Center').fit(train_data=training_data)

No path specified. Models will be saved in: "AutogluonModels/ag-20210314_074735/"
Beginning AutoGluon training ...
AutoGluon will save models to "AutogluonModels/ag-20210314_074735/"
AutoGluon Version:  0.1.0
Train Data Rows:    71538
Train Data Columns: 11
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [0, 1]
	If 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping:  class 1 = 1, class 0 = 0
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    2753.85 MB
	Train Data (Original)  Memory Usage: 49.57 MB (1.8% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes 

Fitting model: NeuralNetMXNet ...
	0.9412	 = Validation accuracy score
	447.89s	 = Training runtime
	0.33s	 = Validation runtime
Fitting model: NeuralNetFastAI ...


Epoch 25: early stopping
█

	0.9132	 = Validation accuracy score
	211.33s	 = Training runtime
	1.13s	 = Validation runtime
Fitting model: LightGBMLarge ...
		`import lightgbm` failed. If you are using Mac OSX, Please try 'brew install libomp'. Detailed info: dlopen(/Users/anshbordia/opt/anaconda3/lib/python3.8/site-packages/lightgbm/lib_lightgbm.so, 6): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
  Referenced from: /Users/anshbordia/opt/anaconda3/lib/python3.8/site-packages/lightgbm/lib_lightgbm.so
  Reason: image not found


█

Fitting model: WeightedEnsemble_L2 ...
	0.9464	 = Validation accuracy score
	0.87s	 = Training runtime
	0.01s	 = Validation runtime
AutoGluon training complete, total runtime = 941.19s ...
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20210314_074735/")


# Test Set Prediction

In [None]:
#Load Test Data
test_data['Time at Center'] = pd.read_csv('../data/final_project/y_test.csv')

#Drop missing values and Name column
test_data = training_data.dropna(how='all')
test_data = training_data.drop(columns=['Name'])

In [25]:
#Evaluate selected predictor on test data
predictor.evaluate(test_data)

Predictive performance on given data: accuracy = 0.9776901786463138


0.9776901786463138

In [23]:
#Performance of other predictors
predictor.leaderboard(test_data)

                 model  score_test  score_val  pred_time_test  pred_time_val    fit_time  pred_time_test_marginal  pred_time_val_marginal  fit_time_marginal  stack_level  can_infer  fit_order
0     RandomForestEntr    0.995233     0.9452        2.338551       0.114932   38.717107                 2.338551                0.114932          38.717107            1       True          2
1     RandomForestGini    0.995177     0.9436        2.696404       0.118220   39.689618                 2.696404                0.118220          39.689618            1       True          1
2       ExtraTreesEntr    0.994674     0.9292        3.040918       0.107698   51.737143                 3.040918                0.107698          51.737143            1       True          4
3       ExtraTreesGini    0.994604     0.9272        3.056309       0.107866   52.460546                 3.056309                0.107866          52.460546            1       True          3
4  WeightedEnsemble_L2    0.977690     0

Unnamed: 0,model,score_test,score_val,pred_time_test,pred_time_val,fit_time,pred_time_test_marginal,pred_time_val_marginal,fit_time_marginal,stack_level,can_infer,fit_order
0,RandomForestEntr,0.995233,0.9452,2.338551,0.114932,38.717107,2.338551,0.114932,38.717107,1,True,2
1,RandomForestGini,0.995177,0.9436,2.696404,0.11822,39.689618,2.696404,0.11822,39.689618,1,True,1
2,ExtraTreesEntr,0.994674,0.9292,3.040918,0.107698,51.737143,3.040918,0.107698,51.737143,1,True,4
3,ExtraTreesGini,0.994604,0.9272,3.056309,0.107866,52.460546,3.056309,0.107866,52.460546,1,True,3
4,WeightedEnsemble_L2,0.97769,0.9464,92.257412,1.490689,253.877923,0.060285,0.005184,0.873863,2,True,10
5,NeuralNetMXNet,0.954695,0.9412,7.840255,0.331226,447.893905,7.840255,0.331226,447.893905,1,True,8
6,CatBoost,0.947133,0.94,0.27211,0.08954,2.551583,0.27211,0.08954,2.551583,1,True,7
7,NeuralNetFastAI,0.926766,0.9132,84.991728,1.129154,211.332399,84.991728,1.129154,211.332399,1,True,9
8,KNeighborsUnif,0.911404,0.9132,3.983826,0.158151,0.42067,3.983826,0.158151,0.42067,1,True,5
9,KNeighborsDist,0.911404,0.9132,4.594738,0.151879,0.402972,4.594738,0.151879,0.402972,1,True,6
