# Fixed Operations
# Using Machine Learning Algorithms for Trending and Predictive Modeling

Overview:

The following code can be used to identify trends and predictive scores within dealership data.  The example below provides a prediction on the Total Gross Profit for the Fixed Ops department based on the KNN (K-nearest neighbor) algorithm.

## Upload the Data

_To execute a code cell, click inside it and press **Shift+Enter**._

In [4]:
# Import libraries
import numpy as np
import pandas as pd

from sklearn.utils import shuffle
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV

# Read the data
fixed_ops_data = pd.read_csv("fixed_operations_data.csv")
print "\n"
print "Result..."
print "Fixed Operations data read successfully!"

# Identify number of dealerships and number of data-points per dealership
n_dealerships = len(fixed_ops_data)
n_columns = len(fixed_ops_data.columns)

print "Number of dealerships used for trending/predictive modeling: {}".format(n_dealerships)
print "Number of columns/data-points per dealership: {}".format(n_columns)



Result...
Fixed Operations data read successfully!
Number of dealerships used for trending/predictive modeling: 25
Number of columns/data-points per dealership: 40


## Prepare the Data
Prepare the data for modeling, training and testing.

### A) Identify feature and target columns
In this exercise, the "Total GP $ Fixed Ops Department" will be target column.  The algorithm will predict the Total GP using the data provided.

### B) Split data into training and test sets
Split the data (both features and target) into training and test sets.  This is to identify and prevent "overfitting" of the data.

In [5]:
# Extract feature (X) and target (y) columns
feature_cols = list(fixed_ops_data.columns[1:-1])  # all columns but the first and last are features
target_col = fixed_ops_data.columns[-1]  # last column is the target/label -> Total GP for fixed ops department

X_all = fixed_ops_data[feature_cols]  # feature values for all dealerships
y_all = fixed_ops_data[target_col]  # corresponding targets (total GP for Fixed Department)

# Identify training vs test split
num_all = fixed_ops_data.shape[0]  # all dealerships
num_train = 20  # training group size
num_test = num_all - num_train  # testing group size

# Randomly shuffle
X_all, y_all = shuffle(X_all, y_all, random_state=0)

# Divide into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size=num_test, random_state=0)

print "Training set: {} dealerships".format(X_train.shape[0])
print "Test set: {} dealerships".format(X_test.shape[0])

Training set: 20 dealerships
Test set: 5 dealerships


## Train Machine Learning Model

KNeighborsRegressor model:  regression based on k-nearest neighbors.

The target is predicted by local interpolation of the targets associated of the nearest neighbors in the training set.

In [6]:
# Train a model
import time

def train_knn(knn, X_train, y_train):
    print "Training {}...".format(knn.__class__.__name__)
    start = time.time()
    
    # parameters for gridsearchCV
    k = [1,2,3,4,5,6,7,8,9,10]
    parameters = {'n_neighbors': k}
    
    # Implement GridSearchCV
    knn = GridSearchCV(knn, parameters, cv=10)
    knn.fit(X_train, y_train)
    
    print "best parameter: ", knn.best_params_
    print "best score: ", knn.best_score_
    
    end = time.time()
    return knn
    
# Apply model, import and instantiate object
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor()

# Fit model to training data
train_knn(knn, X_train, y_train)
print knn

Training KNeighborsRegressor...
best parameter:  {'n_neighbors': 2}
best score:  -9.56912768025
KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_neighbors=5, p=2, weights='uniform')


In [14]:
# Update knn with best parameter
# knn = KNeighborsRegressor(n_neighbors=2, weights='distance')
knn = KNeighborsRegressor(n_neighbors=2)

# Fit data - compare between training data and testing data for "overfitting"
# knn.fit(X_test, y_test)
knn.fit(X_train, y_train)

# Predict the output of a particular sample
# Read the data
fixed_ops_data = pd.read_csv("fixed_operations_input_sheet.csv")
feature_cols = list(fixed_ops_data.columns[1:-1])  # all columns but last are features
dealership_name = fixed_ops_data.dealership[0]  # all columns but last are features
x = fixed_ops_data[feature_cols]  # feature values for all dealerships
y = knn.predict(x)  

print "dealership: ", dealership_name
print "Prediction on Total GP for Fixed Ops Department: ${:,.2f}".format(y[0])

dealership:  Gosselin_County_Nissan (fictitious account)
Prediction on Total GP for Fixed Ops Department: $717,739.00
