<a href="https://colab.research.google.com/github/dcruzsteven/autotuning/blob/master/autotuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Automated Hyperparameter Tuning**

## Installation of Required Dependencies

In [1]:
# Download and install kaggle and xgboost
!pip install kaggle
!pip install xgboost
!pip install sklearn



## Required Imports and Library Definitions

In [0]:
import os # Library for operating system manipulation
import numpy as np # Library for processing numeric vectors
import imageio # Library for dealing with images
import matplotlib.pyplot as plt # Library for plotting images
import pandas as pd # Library for pandas dataframe support
from google.colab import files # Library for colab file upload/download
import xgboost as xgb # Library for gradient-boosted trees
from sklearn.model_selection import KFold # Function for k-fold CV
from sklearn.model_selection import cross_val_score # Function for CV assessment
from sklearn.model_selection import train_test_split # Function for partitioning data

## Process Kaggle Credentials

### Specify kaggle.json (downloaded from Kaggle Profile)

In [11]:
# Must search for kaggle.json file downloaded from Kaggle profile
files.upload()

{}

### Configure Kaggle Environment

For the purpose of this auto-tuning a really old house price
dataset (meant for regression exploration) will be used.  In
order to download the data one must first register for the
competition using the Kaggle account associated with the
kaggle.json file above

In [12]:
# Construct .kaggle subdirectory (required by the kaggle python library)
!mkdir -p ~/.kaggle
# Move downloaded kaggle.json file to .kaggle directory
!mv kaggle.json ~/.kaggle/
# Change permissions on file
!chmod 600 /root/.kaggle/kaggle.json
# Download kaggle competition
!kaggle competitions download -c house-prices-advanced-regression-techniques

mv: cannot stat 'kaggle.json': No such file or directory
sample_submission.csv: Skipping, found more recently modified local copy (use --force to force download)
test.csv: Skipping, found more recently modified local copy (use --force to force download)
train.csv: Skipping, found more recently modified local copy (use --force to force download)
data_description.txt: Skipping, found more recently modified local copy (use --force to force download)


## Necessary Data Manipulation

### Process Raw Data

In [0]:
# Import raw dataset
rawData = pd.read_csv("train.csv")
# Remove entries with missing sale prices
rawData = rawData[~rawData.SalePrice.isna()]
# Extract labels
dataLabels = rawData['SalePrice']
# Extract features#(['SalePrice'], axis=1)

### Partition Raw Data

In [0]:
# Construct Training+Validation and Test Partitions
dataFeaturesTrainValid, dataFeaturesTest, \
dataLabelsTrainValid,  dataLabelsTest = train_test_split(dataFeatures.as_matrix(), dataLabels.as_matrix(), test_size=0.15)

##Construct Model Template

### Define Objective Function (for Auto-Tuner to Minimize)



In [0]:
def objective():
  """
  Construct the objective function for the optimizer to minimize
  """

In [6]:
rawData[:5]

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [0]:
!cat data_description.txt

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM