# Grading Rubric

## 2.0 Data Preparation (15 points total)

### 2.1 [10 points] Define and prepare your class variables. Use proper variable representations (int, float, one-hot, etc.). Use pre-processing methods (as needed) for dimensionality reduction, scaling, etc. Remove variables that are not needed/useful for the analysis.

### 2.1.1 Categorization (Target = IsBadBuy)
Much of our data preparation for classification took place in previous labs, so in that regard we refer the reader to that former work. To generally outline this process, we performed necessary imputations and recategorizations for the data to create a complete and tidy dataset. This included the creation of several features such as Mileage categories, Luxury categories, as well as Axel and Cylinder categories. 

After importing the data set, we took two approaches to preparing the data. This was largely due to our experimentation with different packages to perform the analysis. For models leveraging SKLEARN as a model package, transformed the data set into a “one-hot” form for all categories, and the original category fields were dropped. In the case where we leveraged H2O as a model package, we did not perform “one-hot” transformations to the data since H2O performs these functions automatically. 


We scaled our dataset for continuous variables. This was performed after performing our train/test split. We chose a stratified shuffle split given the overall imbalance to the dataset, which was roughly 1/5 in terms of positive to negative IsBadBuy markers for the labels.

### 2.1.2 Regression (Target = MMRAcquisitionAuctionAveragePrice)
INSERT TEXT HERE


### 2.2 [5 points] Describe the final dataset that is used for classification/regression (include a description of any newly formed variables you created).

### 2.2.1 Classification
The following fields were dropped from the analysis since they were not needed or duplicative.
* PurchDate
* VehYear
* VNZIP1
* WheelTypeID
* Nationality
* TopThreeAmericanName
* Trim
* Submodel
* Zip
* State

This left the following fields to model upon:

### 2.2.2.1 Classification Data Dictionary


|Field Name	|Definition|
|-----------|---------------------|
|IsBadBuy|Identifies if the kicked vehicle was an avoidable purchase |
|Auction|Auction provider at which the  vehicle was purchased|
|VehicleAge|The Years elapsed since the manufacturer's year|
|Make|Vehicle Manufacturer |
|Model|Vehicle Model|
|Color|Vehicle Color|
|Transmission|Vehicles transmission type (Automatic, Manual)|
|WheelType|The vehicle wheel type description (Alloy, Covers, Special)|
|VehOdo|The vehicles odometer reading|
|Size|The size category of the vehicle (Compact, SUV, etc.)|
|MMRAcquisitionAuctionAveragePrice|Acquisition price for this vehicle in average condition at time of purchase|
|MMRAcquisitionRetailAveragePrice|Acquisition price for this vehicle in the retail market in average condition at time of purchase|
|BYRNO|Unique number assigned to the buyer that purchased the vehicle|
|VNST|State where the the car was purchased|
|VehBCost|Acquisition cost paid for the vehicle at time of purchase|
|IsOnlineSale|Identifies if the vehicle was originally purchased online|
|WarrantyCost|Warranty price (term=36month  and millage=36K)|
|Mileage*|Mileage categorization (LOW, GOOD, HIGH)|
|Luxury*|Luxury car identifier|
|Axel*|2WD OR 4WD|
|Cylinder*|Thme number of engine cylinders for the car engine| 
|*|Indicates feature engineering|

### 2.2.2 Regression
INSERT TEXT HERE

### 2.2.2.1 Regression Data Dictionary
INSERT TABLE HERE

## 3.0 Modeling and Evaluation (70 points total)

### 3.1 [10 points] Choose and explain your evaluation metrics that you will use (i.e., accuracy, precision, recall, F-measure, or any metric we have discussed). Why are the measure(s) appropriate for analyzing the results of your modeling? Give a detailed explanation backing up any assertions.

### 3.1.1 Classification
For evaluating the classification of IsBadBUy for our imbalanced dataset, we explored several options including the likes of ROC-AUC, F1, F2, and its varients. Given the small positive class for this binomial classification, the ability to correctly detect positive samples remains the priority. Correctly identifying negative samples is of lesser concern, compartively. As such, we utlize Precision and Recall to measure performance of our models. This is calculated directly through our hyperparameter and model search exercise. 

Given a confusion matrix defined as:


| 	|Predicted Negatives|Predicted Positives|
|-----------|---------------------|--|
Actual Negative|True Negatives (TN)|False Negatives (FN) |
Actual Positive|False Positives (FP)|True Positives (TP)|

Precision can be found by:

**Precision = $\frac{TP}{TP + FP}$**

And its analogue Recall as:

**Recall = $\frac{TP}{TP + FN}$**

To be explicit, precision is set up as the primary optimization function for model choice. We also aggregate metrics for AUC, Accuracy, RMSE, F1, and F2 for comparative analysis purposes. 


### 3.1.1 Regression
INSERT TEXT HERE

### 3.2 [10 points] Choose the method you will use for dividing your data into training and testing splits (i.e., are you using Stratified 10-fold cross validation? Why?). Explain why your chosen method is appropriate or use more than one method as appropriate.

### 3.2.1 Classification
We utilize a stratified 10-fold shuffle split to segment our data into statified folds. This method was specificially tested, and thereby chosen, due to the aforementioned 1/5 data imbalance of postive to negative classes of the target (IsBadBuy). This imbalance is shown representatively in **Plot XX: Counts of Training Labels.**

Stratification was chossen specifically because it helps us ensure that each fold holds representative class proportions, across each fold. Plus, as compared to a regular shuffle split, we tested and yielded better baseline results for precision with the Stratified Shuffle Split. As a result we chose it as the predominant cross validation and split technique for our go-forward classification modeling.



### 3.2.1 Regression
INSERT TEXT HERE


### 3.3 [20 points] Create three different classification/regression models (e.g., random forest, KNN, and SVM). Two modeling techniques must be new (but the third could be SVM or logistic regression). Adjust parameters as appropriate to increase generalization performance using your chosen metric.

### 3.3.1 Classification
Our overall approach to model choice and tuning included the following:
1. Baseline evaluation of relevant sklearn classifiers. This included KNeighborsClassifier, xgb.XGBClassifier, HistGradientBoostingClassifier, BaggingClassifier, DecisionTreeClassifier, RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, GaussianNB, LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis.

2. After baseline results were gathered and charted (see **Plot XX: Classifier Precision** and **Plot XX: Classifier Logloss**), we evaluated candidate modelers. Of the models, the  HistGradientBoostingClassifier performed very well. However, due to its "experimental status," we chose to eliminate that from consideration. XGBoost and ADAGradientBoosting were chosen for their repeated high scores on trial baseline runs. 

3. We also performed an AutoML procedure using the H2O package. This was done primarily to rank potential well-performing modelers. From this analysis, we chose DistributedRandomForest (DRF) as our final classification model for classification. 

**Model 1:** XGBoost
INSERT TEXT HERE

**Model 2:** ADAGradientBoosting
INSERT TEXT HERE

**Model 3:** DistributedRandomForest (DRF)

Random Forest is an ensemble learning method used for classification and regression where a majority voting pattern is used to determine the class label for unlabeled instances. The ensembles tend to offer better results in terms of both speed and accuracy. H20 Distributed Random Forest (DRF) generates a forest of classification or regression tree versus a single tree. DRF classification takes the average prediction of all trees to make a final class prediction. 

The ability to fetch and inspect each tree individually is important aspect of model consideration where no tree is lost, and internal structure of the object fetched which contains all the information available about every node in the tree. 

We employed a 3-tier approach to define our DRF model with AutoML leaderboard technique in H2O environment was considered for underlying training, tuning and selection activities for a binary (IsBadBuy) classification. 

1.	We define our base model by specifying number of trees and random number generator seed for randomization with default stopping metrics within H2O environment. 

2.	Building onto base model we consider introducing cross-validation and binary classification to build twice as many trees hoping to achieve higher accuracy while sacrificing speed. Furthermore, we introduce the relative tolerance for the metric-based stopping to stop further training in case the improvement is less. Since we grappled with imbalanced in our dataset, we tried to tackle any oversampling of the minority classes by balancing the class distribution. 

3.	Shifting gears away from parameterizing of model building, we employed H2O AutoML interface which is designed to have as few parameters as possible while limiting the number of total models (max_models) that are trained with the goal of finding the “best” model with a desired stopping metric (aucpr - area under the precision-recall curve) and algorithm (DRF) preference. 

After reviewing results of both approach (base (2) and leaderboard (8)) of models that were trained in the process, we selected the best model trained by AutoML leaderboard as a foundation to our final model. 


### 3.3.1 Regression

Our overall approach to model choice and tuning included the following:

1.	We also performed an AutoML procedure using the H2O package. This was done primarily to rank potential well-performing modelers. From this analysis, we chose DistributedRandomForest (DRF) as our final regression model for regression task.


2.	
INSERT TEXT HERE


3.	
INSERT TEXT HERE


**Model 1:** DistributedRandomForest(DRF) 

The speed and scale of AutoML feature we witnessed during binary classification phase, we decided to pivot in our model building approach to get maximum benefit from our constrained computing resources, efficiency in training and optimum results in our final model for DRF regression phase. 

For the AutoML regression leaderboard session, the goal was to predict the Automobile Auction’s Average Acquisition price (MMRAcquisitionAuctionAveragePrice) within the dataset we used for binary classification.  we employed H2O AutoML interface which is designed to have as few parameters as possible while limiting the number of total models (max_models) that are trained with the goal of finding the “best” model with a desired stopping metric (RMSE) and algorithm preference (DRF). 


**Model 2:** GBM 

INSERT TEXT HERE

**Model 3:** NAME 3  

INSERT TEXT HERE

### 3.4 [10 points] Analyze the results using your chosen method of evaluation. Use visualizations of the results to bolster the analysis. Explain any visuals and analyze why they are interesting to someone that might use this model.

### 3.4.1 Classification
INSERT TEXT HERE

### 3.4.1 Regression
INSERT TEXT HERE

### 3.5 [10 points] Discuss the advantages of each model for each classification task, if any. If there are not advantages, explain why. Is any model better than another? Is the difference significant with 95% confidence? Use proper statistical comparison methods.

### 3.5.1 Classification
INSERT TEXT HERE

### 3.5.1 Regression
INSERT TEXT HERE

### 3.6 [10 points] Which attributes from your analysis are most important? Use proper methods discussed in class to evaluate the importance of different attributes. Discuss the results and hypothesize about why certain attributes are more important than others for a given classification task.

### 3.6.1 Classification
INSERT TEXT HERE

### 3.6.1 Regression
INSERT TEXT HERE

## 4.0 Deployment (5 points total)
### 4.1 [5 points] How useful is your model for interested parties (i.e., the companies or organizations that might want to use it for prediction)? How would you measure the model's value if it was used by these parties? How would your deploy your model for interested parties? What other data should be collected? How often would the model need to be updated, etc.?

### 4.1.1 Classification
INSERT TEXT HERE

### 4.1.1 Regression
INSERT TEXT HERE

### 4.2 Exceptional Work (10 points total) You have free reign to provide additional modeling.One idea: grid search parameters in a parallelized fashion and visualize the performances across attributes. Which parameters are most significant for making a good model for each classification algorithm?**

### 4.2.1 Classification
INSERT TEXT HERE

### 4.2.1 Regression
INSERT TEXT HERE