# Mt. Rainier Hike Prediction Tool - Logistic Regression & SVMs

Mt. Rainier hikes are an incredibly popular adventure with outdoor enthusiasts, as Mount Rainier National Park is the gem of Washington State. Right now I am trying to develop a feature in a smartphone app, which, based on weather and other relevant data recorded for Mt Rainer National park on a given day, can predict and notify an app user if they should climb it or not on that day.

Input: A set of weather related and other attributes for a given day for Mt. Rainer.

Output: User should climb Mt. Rainer (class 1) or not on that day (class 0)

### 1. Data Loading and Exploratory Analysis

The dataset is sourced from https://www.kaggle.com/datasets/codersree/mount-rainier-weather-and-climbing-data. It matches the weather report to climbing data from the period of 2014 to 2015. The data contains Date, the various weather parameters averaged daily (Temperature, Battery Voltage, Relative Humidity, Wind Speed, Wind Direction, Solare Radision), the climbing statistics and the target the success percentage.

In [2]:
# Import relavant software
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import pandas as pd
import numpy

#Import File
ranier_df = pd.read_csv("/Users/audreydahlkemper/Downloads/MtRainier_data (1).csv")
ranier_df = ranier_df.drop_duplicates()


# Apply dropna() to get rid of duplicates
ranier_df = ranier_df.dropna()
print (f"Shape of data {ranier_df.shape}")
ranier_df.head()

Shape of data (1895, 10)


Unnamed: 0.1,Unnamed: 0,Date,Route,Succeeded,Battery Voltage AVG,Temperature AVG,Relative Humidity AVG,Wind Speed Daily AVG,Wind Direction AVG,Solare Radiation AVG
0,0,11/27/2015,Disappointment Cleaver,0,13.64375,26.321667,19.715,27.839583,68.004167,88.49625
1,1,11/21/2015,Disappointment Cleaver,0,13.749583,31.3,21.690708,2.245833,117.549667,93.660417
2,2,10/15/2015,Disappointment Cleaver,0,13.46125,46.447917,27.21125,17.163625,259.121375,138.387
3,3,10/13/2015,Little Tahoma,0,13.532083,40.979583,28.335708,19.591167,279.779167,176.382667
4,4,10/9/2015,Disappointment Cleaver,0,13.21625,38.260417,74.329167,65.138333,264.6875,27.791292


In [10]:
ranier_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1895 entries, 0 to 1894
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Unnamed: 0             1895 non-null   int64  
 1   Date                   1895 non-null   object 
 2   Route                  1895 non-null   object 
 3   Succeeded              1895 non-null   int64  
 4   Battery Voltage AVG    1895 non-null   float64
 5   Temperature AVG        1895 non-null   float64
 6   Relative Humidity AVG  1895 non-null   float64
 7   Wind Speed Daily AVG   1895 non-null   float64
 8   Wind Direction AVG     1895 non-null   float64
 9   Solare Radiation AVG   1895 non-null   float64
dtypes: float64(6), int64(2), object(2)
memory usage: 162.9+ KB


The features Route and Date are categorical. 

### 2. Feature Engineering

In [11]:
# Create the dataframes for features and labels
ranier_features_df = ranier_df[["Date", "Route", "Battery Voltage AVG", "Temperature AVG", "Relative Humidity AVG", "Wind Speed Daily AVG", "Wind Direction AVG", "Solare Radiation AVG"]]
ranier_labels_df = ranier_df[["Succeeded"]]

ranier_features_df.head()
ranier_labels_df.head()

Unnamed: 0,Date,Route,Battery Voltage AVG,Temperature AVG,Relative Humidity AVG,Wind Speed Daily AVG,Wind Direction AVG,Solare Radiation AVG
0,11/27/2015,Disappointment Cleaver,13.64375,26.321667,19.715,27.839583,68.004167,88.49625
1,11/21/2015,Disappointment Cleaver,13.749583,31.3,21.690708,2.245833,117.549667,93.660417
2,10/15/2015,Disappointment Cleaver,13.46125,46.447917,27.21125,17.163625,259.121375,138.387
3,10/13/2015,Little Tahoma,13.532083,40.979583,28.335708,19.591167,279.779167,176.382667
4,10/9/2015,Disappointment Cleaver,13.21625,38.260417,74.329167,65.138333,264.6875,27.791292


Unnamed: 0,Succeeded
0,0
1,0
2,0
3,0
4,0


Transform the first categorical variable, Route, into numerical values.

In [14]:
from sklearn.preprocessing import OneHotEncoder
# Transform categorical features into 1-hot
route_to_list = ranier_features_df["Route"].to_list()
route_to_list_of_lists = []

for name in route_to_list:
    route_to_list_of_lists.append([name])
    
route_encoder = OneHotEncoder()
route_encoder.fit(route_to_list_of_lists)

print(f"Unique vocabulary items {len(route_encoder.categories_[0])}\n")

route_transformed = route_encoder.transform(route_to_list_of_lists)
route_transformed = route_transformed.toarray()
route_transformed_df = pd.DataFrame(route_transformed)

ranier_features_df.reset_index(drop=True, inplace=True)
route_transformed_df.reset_index(drop=True, inplace=True)

ranier_features_transformed_df = pd.concat([ranier_features_df,route_transformed_df], axis=1)
ranier_features_transformed_df = ranier_features_transformed_df.drop(columns=["Route"], axis=1)
ranier_features_df = ranier_features_transformed_df
ranier_features_df.head()

OneHotEncoder()

Unique vocabulary items 22



Unnamed: 0,Date,Battery Voltage AVG,Temperature AVG,Relative Humidity AVG,Wind Speed Daily AVG,Wind Direction AVG,Solare Radiation AVG,0,1,2,...,12,13,14,15,16,17,18,19,20,21
0,11/27/2015,13.64375,26.321667,19.715,27.839583,68.004167,88.49625,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,11/21/2015,13.749583,31.3,21.690708,2.245833,117.549667,93.660417,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,10/15/2015,13.46125,46.447917,27.21125,17.163625,259.121375,138.387,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,10/13/2015,13.532083,40.979583,28.335708,19.591167,279.779167,176.382667,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,10/9/2015,13.21625,38.260417,74.329167,65.138333,264.6875,27.791292,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Transform the first categorical variable, Date, into numerical values.

In [16]:
date_to_list = ranier_features_df["Date"].to_list()
date_to_list_of_lists = []

for name in date_to_list:
    date_to_list_of_lists.append([name])
    
date_encoder = OneHotEncoder()
date_encoder.fit(date_to_list_of_lists)

print(f"Unique vocabulary items {len(date_encoder.categories_[0])}\n")

date_transformed = date_encoder.transform(date_to_list_of_lists)
date_transformed = date_transformed.toarray()
date_transformed_df = pd.DataFrame(date_transformed)

ranier_features_df.reset_index(drop=True, inplace=True)
date_transformed_df.reset_index(drop=True, inplace=True)

ranier_features_transformed_df = pd.concat([ranier_features_df,date_transformed_df], axis=1)
ranier_features_transformed_df = ranier_features_transformed_df.drop(columns=["Date"], axis=1)
ranier_features_df = ranier_features_transformed_df
ranier_features_df.head()

OneHotEncoder()

Unique vocabulary items 204



Unnamed: 0,Battery Voltage AVG,Temperature AVG,Relative Humidity AVG,Wind Speed Daily AVG,Wind Direction AVG,Solare Radiation AVG,0,1,2,3,...,194,195,196,197,198,199,200,201,202,203
0,13.64375,26.321667,19.715,27.839583,68.004167,88.49625,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,13.749583,31.3,21.690708,2.245833,117.549667,93.660417,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,13.46125,46.447917,27.21125,17.163625,259.121375,138.387,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,13.532083,40.979583,28.335708,19.591167,279.779167,176.382667,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,13.21625,38.260417,74.329167,65.138333,264.6875,27.791292,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
ranier_features_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1895 entries, 0 to 1894
Columns: 232 entries, Battery Voltage AVG to 203
dtypes: float64(232)
memory usage: 3.4 MB


### 3. Scale the Data Using Standard Scaler

Scale the data using Standard Scaler

In [20]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
all_columns = ranier_features_df.columns

ranier_features_df[all_columns] = scaler.fit_transform(ranier_features_df[all_columns])
ranier_features_df.head()



Unnamed: 0,Battery Voltage AVG,Temperature AVG,Relative Humidity AVG,Wind Speed Daily AVG,Wind Direction AVG,Solare Radiation AVG,0,1,2,3,...,194,195,196,197,198,199,200,201,202,203
0,2.003522,-1.580891,-1.269311,1.895222,-0.958813,-1.567664,-0.032504,0.681507,-0.429389,-0.051434,...,-0.022978,-0.022978,-0.051434,-0.022978,-0.056359,-0.072836,-0.03982,-0.056359,-0.022978,-0.03982
1,3.506158,-1.033951,-1.180109,-0.902775,-0.414849,-1.520897,-0.032504,0.681507,-0.429389,-0.051434,...,-0.022978,-0.022978,-0.051434,-0.022978,-0.056359,-0.072836,-0.03982,-0.056359,-0.022978,-0.03982
2,-0.587638,0.630261,-0.930861,0.72809,1.139477,-1.11585,-0.032504,0.681507,-0.429389,-0.051434,...,-0.022978,-0.022978,-0.051434,-0.022978,-0.056359,-0.072836,-0.03982,-0.056359,-0.022978,-0.03982
3,0.418063,0.029488,-0.880092,0.993477,1.36628,-0.771758,-0.032504,-1.467337,-0.429389,-0.051434,...,-0.022978,-0.022978,-0.051434,-0.022978,-0.056359,-0.072836,-0.03982,-0.056359,-0.022978,-0.03982
4,-4.066182,-0.269251,1.196481,5.972851,1.200588,-2.117412,-0.032504,0.681507,-0.429389,-0.051434,...,-0.022978,-0.022978,-0.051434,-0.022978,-0.056359,-0.072836,-0.03982,-0.056359,-0.022978,-0.03982


### 4. Splitting Data

In [21]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# First extract test data and store it in x_test, y_test
features = ranier_features_df.to_numpy()
labels = ranier_labels_df.to_numpy()
_x, x_test, _y, y_test = train_test_split(features, labels, test_size=0.10, random_state=42)

# set k = 5
k = 5

kfold_spliiter = KFold(n_splits=k)

folds_data = []

fold = 1
for train_index, validation_index in kfold_spliiter.split(_x):
    x_train , x_valid = _x[train_index,:],_x[validation_index,:]
    y_train , y_valid = _y[train_index,:] , _y[validation_index,:]
    print (f"Fold {fold} training data shape = {(x_train.shape,y_train.shape)}")
    print (f"Fold {fold} validation data shape = {(x_valid.shape,y_valid.shape)}")
    fold+=1
    folds_data.append((x_train,y_train,x_valid,y_valid))

Fold 1 training data shape = ((1364, 232), (1364, 1))
Fold 1 validation data shape = ((341, 232), (341, 1))
Fold 2 training data shape = ((1364, 232), (1364, 1))
Fold 2 validation data shape = ((341, 232), (341, 1))
Fold 3 training data shape = ((1364, 232), (1364, 1))
Fold 3 validation data shape = ((341, 232), (341, 1))
Fold 4 training data shape = ((1364, 232), (1364, 1))
Fold 4 validation data shape = ((341, 232), (341, 1))
Fold 5 training data shape = ((1364, 232), (1364, 1))
Fold 5 validation data shape = ((341, 232), (341, 1))


### 5. Define the Models: Logisitic Regression, SVC, and Gradient Boosting Classifier  

In [27]:
# Define the models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.ensemble import GradientBoostingClassifier

lr_vanilla = LogisticRegression(penalty= 'none')
svm_linear = SVC(kernel="linear")
grad_boost = GradientBoostingClassifier()

# Keep all the models in a dictionary

all_models = {"lr_vanilla":lr_vanilla, 
              #"lr_L2":lr_L2,
              "svm_linear":svm_linear,
             # "svm_poly":svm_poly,
             "grad_boost":grad_boost}

### Cross Validation 

In [28]:
best_validation_accuracy = 0
best_model_name = ""
best_model = None

# Iterate over all models
for model_name in all_models.keys():
    
    print (f"Evaluating {model_name} ...")
    model = all_models[model_name]
    
    # Store training and validation accuracies for all folds
    train_acc_for_all_folds = []
    valid_acc_for_all_folds = []
    
    #Iterate over all folds
    for i, fold in enumerate(folds_data):
        x_train, y_train, x_valid, y_valid = fold

        # Train the model
        _ = model.fit(x_train,y_train.flatten())

        # Evluate model on training data
        y_pred_train = model.predict(x_train)
        
        # Evaluate the model on validation data
        y_pred_valid = model.predict(x_valid)
        
        # Compute training accuracy
        train_acc = accuracy_score(y_pred_train , y_train.flatten())
        
        # Store training accuracy for each folds
        train_acc_for_all_folds.append(train_acc)
        
        # Compute validation accuracy
        valid_acc = accuracy_score(y_pred_valid , y_valid.flatten())

        # Store validation accuracy for each folds
        valid_acc_for_all_folds.append(valid_acc)
    
    #average training accuracy across k folds
    avg_training_acc = sum(train_acc_for_all_folds)/k
    
    print (f"Average training accuracy for model {model_name} = {avg_training_acc}")
    
    #average validation accuracy across k folds
    avg_validation_acc = sum(valid_acc_for_all_folds)/k
    
    print (f"Average validation accuracy for model {model_name} = {avg_validation_acc}")
    
    # Select best model based on average validation accuracy
    if avg_validation_acc > best_validation_accuracy:
        best_validation_accuracy = avg_validation_acc
        best_model_name = model_name
        best_model = model
    print (f"-----------------------------------")

print (f"Best model for the task is {best_model_name} which offers the validation accuracy of {best_validation_accuracy}")

Evaluating lr_vanilla ...
Average training accuracy for model lr_vanilla = 0.7145161290322581
Average validation accuracy for model lr_vanilla = 0.619941348973607
-----------------------------------
Evaluating svm_linear ...
Average training accuracy for model svm_linear = 0.709090909090909
Average validation accuracy for model svm_linear = 0.6205278592375366
-----------------------------------
Evaluating grad_boost ...
Average training accuracy for model grad_boost = 0.7118768328445748
Average validation accuracy for model grad_boost = 0.6416422287390029
-----------------------------------
Best model for the task is grad_boost which offers the validation accuracy of 0.6416422287390029


In [31]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# First extract our test data and store it in x_test, y_test
features = ranier_features_df.to_numpy()
labels = ranier_labels_df.to_numpy()
_x, x_test, _y, y_test = train_test_split(features, labels, test_size=0.10, random_state=42)

k = 5 # 5-fold

# Use sklearn's cross validation score directly
# Speed up training using n_jobs parameter which specifies how many cpu_cores to use

best_model_name = ""
best_model_valid_accuracy = 0
best_model = None

for model_name in all_models.keys():
    model = all_models[model_name]
    cv_scores = cross_val_score(model,_x,_y.flatten(), cv=k, n_jobs=4)
    average_cv_score = cv_scores.mean()
    print (f"Mean cross validation accuracy for model {model_name} = {average_cv_score}")

    if average_cv_score > best_model_valid_accuracy :
        best_model_name = model_name
        best_model_valid_accuracy  = average_cv_score
        best_model = model

print (f"Best model is {best_model_name} with {k}-fold accuracy of {best_model_valid_accuracy}")

Mean cross validation accuracy for model lr_vanilla = 0.6263929618768328
Mean cross validation accuracy for model svm_linear = 0.6269794721407624
Mean cross validation accuracy for model grad_boost = 0.6439882697947213
Best model is grad_boost with 5-fold accuracy of 0.6439882697947213


The best model in both the cross valdation and the training accuracy tests is the gradient boost classifier. It received a validation accuracy score of .643 compared to 0.626 for logistic regression and 0.626 for SVC. 

### Three Model Candidates - Logistic Regression, SVC, and Gradient Descent 

#### A. Logistic Regression

Feature ablation of Gradient Boost Classifier model. 

In [32]:
# Run ablation tests on the best model
best_model = LogisticRegression()

feature_names = ranier_features_df.columns

# Maintain an accuracy dictionary

accuracy_drop_log = {"No ablation":0}

for i in range(len(feature_names)):
    # Drop one feature at a time
    feature_name = feature_names[i]
    print (f"Removing feature {feature_name}")

    x_ablated = numpy.delete(_x,i,axis=1) # axis = 1 means columns
    
    cv_scores = cross_val_score(best_model,x_ablated,_y.flatten(), cv=k, n_jobs=4)
    average_cv_score = cv_scores.mean()
    print (f"Mean cross validation accuracy = {average_cv_score}")
    accuracy_drop_log[feature_name] = best_model_valid_accuracy-average_cv_score

Removing feature Battery Voltage AVG
Mean cross validation accuracy = 0.6269794721407624
Removing feature Temperature AVG
Mean cross validation accuracy = 0.6281524926686217
Removing feature Relative Humidity AVG
Mean cross validation accuracy = 0.6269794721407624
Removing feature Wind Speed Daily AVG
Mean cross validation accuracy = 0.6281524926686217
Removing feature Wind Direction AVG
Mean cross validation accuracy = 0.6275659824046921
Removing feature Solare Radiation AVG
Mean cross validation accuracy = 0.6252199413489736
Removing feature 0
Mean cross validation accuracy = 0.6269794721407624
Removing feature 1
Mean cross validation accuracy = 0.6269794721407624
Removing feature 2
Mean cross validation accuracy = 0.6269794721407624
Removing feature 3
Mean cross validation accuracy = 0.6263929618768328
Removing feature 4
Mean cross validation accuracy = 0.6269794721407624
Removing feature 5
Mean cross validation accuracy = 0.6269794721407624
Removing feature 6
Mean cross validation 

In [33]:
def criteria(l):
    return l[1]

sorted_accs =  sorted(accuracy_drop_log.items(),key=criteria, reverse=True)

print (f"Features are ranked from best to worst (based on how removing them impacts the accuracy of {best_model_name})")
print (f"**************************************")

i=1
for entry in sorted_accs:
    feature_name = entry[0]
    acc_drop = entry[1]
    
    if feature_name != "No ablation":
        print (f"Feature {i}.{feature_name}, drop in acc {acc_drop}")
        i=i+1


Features are ranked from best to worst (based on how removing them impacts the accuracy of grad_boost)
**************************************
Feature 1.173, drop in acc 0.02580645161290318
Feature 2.191, drop in acc 0.02346041055718473
Feature 3.43, drop in acc 0.02287390029325509
Feature 4.56, drop in acc 0.02228739002932545
Feature 5.53, drop in acc 0.02170087976539581
Feature 6.91, drop in acc 0.02170087976539581
Feature 7.167, drop in acc 0.02170087976539581
Feature 8.187, drop in acc 0.02170087976539581
Feature 9.55, drop in acc 0.02111436950146628
Feature 10.57, drop in acc 0.02111436950146628
Feature 11.110, drop in acc 0.02111436950146628
Feature 12.126, drop in acc 0.02111436950146628
Feature 13.35, drop in acc 0.02052785923753664
Feature 14.46, drop in acc 0.02052785923753664
Feature 15.154, drop in acc 0.02052785923753664
Feature 16.188, drop in acc 0.02052785923753664
Feature 17.65, drop in acc 0.019941348973606998
Feature 18.70, drop in acc 0.019941348973606998
Feature 19.

#### B. SVC

Feature ablation of SVC model. 

In [34]:
best_model = SVC()

feature_names = ranier_features_df.columns

# Maintain an accuracy dictionary

accuracy_drop_log = {"No ablation":0}

for i in range(len(feature_names)):
    # Drop one feature at a time
    feature_name = feature_names[i]
    print (f"Removing feature {feature_name}")

    x_ablated = numpy.delete(_x,i,axis=1) # axis = 1 means columns
    
    cv_scores = cross_val_score(best_model,x_ablated,_y.flatten(), cv=k, n_jobs=4)
    average_cv_score = cv_scores.mean()
    print (f"Mean cross validation accuracy = {average_cv_score}")
    accuracy_drop_log[feature_name] = best_model_valid_accuracy-average_cv_score

Removing feature Battery Voltage AVG
Mean cross validation accuracy = 0.6334310850439883
Removing feature Temperature AVG
Mean cross validation accuracy = 0.6340175953079179
Removing feature Relative Humidity AVG
Mean cross validation accuracy = 0.6334310850439883
Removing feature Wind Speed Daily AVG
Mean cross validation accuracy = 0.6328445747800586
Removing feature Wind Direction AVG
Mean cross validation accuracy = 0.6334310850439883
Removing feature Solare Radiation AVG
Mean cross validation accuracy = 0.6346041055718474
Removing feature 0
Mean cross validation accuracy = 0.6316715542521993
Removing feature 1
Mean cross validation accuracy = 0.6340175953079179
Removing feature 2
Mean cross validation accuracy = 0.6340175953079179
Removing feature 3
Mean cross validation accuracy = 0.6340175953079179
Removing feature 4
Mean cross validation accuracy = 0.6334310850439883
Removing feature 5
Mean cross validation accuracy = 0.6340175953079179
Removing feature 6
Mean cross validation 

In [36]:
def criteria(l):
    return l[1]

sorted_accs =  sorted(accuracy_drop_log.items(),key=criteria, reverse=True)

print (f"Features are ranked from best to worst (based on how removing them impacts the accuracy of {best_model_name})")
print (f"**************************************")

i=1
for entry in sorted_accs:
    feature_name = entry[0]
    acc_drop = entry[1]
    
    if feature_name != "No ablation":
        print (f"Feature {i}.{feature_name}, drop in acc {acc_drop}")
        i=i+1

Features are ranked from best to worst (based on how removing them impacts the accuracy of grad_boost)
**************************************
Feature 1.123, drop in acc 0.015249266862170097
Feature 2.199, drop in acc 0.015249266862170097
Feature 3.71, drop in acc 0.014076246334310816
Feature 4.143, drop in acc 0.014076246334310816
Feature 5.177, drop in acc 0.014076246334310816
Feature 6.198, drop in acc 0.014076246334310816
Feature 7.124, drop in acc 0.013489736070381175
Feature 8.178, drop in acc 0.013489736070381175
Feature 9.196, drop in acc 0.013489736070381175
Feature 10.54, drop in acc 0.012903225806451535
Feature 11.76, drop in acc 0.012903225806451535
Feature 12.109, drop in acc 0.012903225806451535
Feature 13.119, drop in acc 0.012903225806451535
Feature 14.40, drop in acc 0.012316715542522005
Feature 15.51, drop in acc 0.012316715542522005
Feature 16.58, drop in acc 0.012316715542522005
Feature 17.63, drop in acc 0.012316715542522005
Feature 18.79, drop in acc 0.012316715542

### C. Gradient Boosting Classifier 

Feature ablation of Gradient Boost Classifier model. 

In [37]:
best_model = GradientBoostingClassifier()

feature_names = ranier_features_df.columns

# Maintain an accuracy dictionary

accuracy_drop_log = {"No ablation":0}

for i in range(len(feature_names)):
    # Drop one feature at a time
    feature_name = feature_names[i]
    print (f"Removing feature {feature_name}")
    
    # Remove the feature by not selecting the column from the index i

    x_ablated = numpy.delete(_x,i,axis=1) # axis = 1 means columns
    
    cv_scores = cross_val_score(best_model,x_ablated,_y.flatten(), cv=k, n_jobs=4)
    average_cv_score = cv_scores.mean()
    print (f"Mean cross validation accuracy = {average_cv_score}")
    accuracy_drop_log[feature_name] = best_model_valid_accuracy-average_cv_score

Removing feature Battery Voltage AVG
Mean cross validation accuracy = 0.6387096774193548
Removing feature Temperature AVG
Mean cross validation accuracy = 0.6439882697947213
Removing feature Relative Humidity AVG
Mean cross validation accuracy = 0.6439882697947213
Removing feature Wind Speed Daily AVG
Mean cross validation accuracy = 0.644574780058651
Removing feature Wind Direction AVG
Mean cross validation accuracy = 0.6451612903225806
Removing feature Solare Radiation AVG
Mean cross validation accuracy = 0.632258064516129
Removing feature 0
Mean cross validation accuracy = 0.6439882697947213
Removing feature 1
Mean cross validation accuracy = 0.6410557184750733
Removing feature 2
Mean cross validation accuracy = 0.6422287390029325
Removing feature 3
Mean cross validation accuracy = 0.6439882697947213
Removing feature 4
Mean cross validation accuracy = 0.6410557184750733
Removing feature 5
Mean cross validation accuracy = 0.6434017595307917
Removing feature 6
Mean cross validation ac

In [39]:
def criteria(l):
    return l[1]

sorted_accs =  sorted(accuracy_drop_log.items(),key=criteria, reverse=True)

print (f"Features are ranked from best to worst (based on how removing them impacts the accuracy of {best_model_name})")
print (f"**************************************")

i=1
for entry in sorted_accs:
    feature_name = entry[0]
    acc_drop = entry[1]
    
    if feature_name != "No ablation":
        print (f"Feature {i}.{feature_name}, drop in acc {acc_drop}")
        i=i+1

Features are ranked from best to worst (based on how removing them impacts the accuracy of grad_boost)
**************************************
Feature 1.Solare Radiation AVG, drop in acc 0.011730205278592365
Feature 2.Battery Voltage AVG, drop in acc 0.005278592375366542
Feature 3.79, drop in acc 0.005278592375366542
Feature 4.71, drop in acc 0.002346041055718451
Feature 5.110, drop in acc 0.002346041055718451
Feature 6.20, drop in acc 0.0017595307917888103
Feature 7.47, drop in acc 0.0017595307917888103
Feature 8.51, drop in acc 0.0017595307917888103
Feature 9.57, drop in acc 0.0017595307917888103
Feature 10.68, drop in acc 0.0017595307917888103
Feature 11.72, drop in acc 0.0017595307917888103
Feature 12.83, drop in acc 0.0017595307917888103
Feature 13.88, drop in acc 0.0017595307917888103
Feature 14.128, drop in acc 0.0017595307917888103
Feature 15.130, drop in acc 0.0017595307917888103
Feature 16.131, drop in acc 0.0017595307917888103
Feature 17.139, drop in acc 0.0017595307917888103

# Comments, Insights, and Result Analysis 

The ability to predict hiking rates can determine a hikers success, safety, and overall happiness. In order to predict if the user should hike or not I utilized three model candidates - Logisitic Regression, SVC, and Gradient Boosting Classifier. Each model returned a very similar validation score. The best model for the task is grad_boost which offers the validation accuracy of 0.6416422287390029. Whereas Logisitc Regression and SVC both received an accuracy score of 0.626. 

The top five features for each model. 

Linear Regression
Feature 1.173, drop in acc 0.02580645161290318
Feature 2.191, drop in acc 0.02346041055718473
Feature 3.43, drop in acc 0.02287390029325509
Feature 4.56, drop in acc 0.02228739002932545
Feature 5.53, drop in acc 0.02170087976539581

SVC
Feature 1.123, drop in acc 0.015249266862170097
Feature 2.199, drop in acc 0.015249266862170097
Feature 3.71, drop in acc 0.014076246334310816
Feature 4.143, drop in acc 0.014076246334310816
Feature 5.177, drop in acc 0.014076246334310816

Gradient 
Feature 1.Solare Radiation AVG, drop in acc 0.011730205278592365
Feature 2.Battery Voltage AVG, drop in acc 0.005278592375366542
Feature 3.79, drop in acc 0.005278592375366542
Feature 4.71, drop in acc 0.002346041055718451
Feature 5.110, drop in acc 0.002346041055718451

Because the variables "Route" and "Date" were transformed into numerical data points they appear as individual values. "Route" was coded as 0-21 and "Date" was coded as 22-203. Therefore, an interpretation of the values listed below would appear like this. 

Linear Regression
"Date", drop in acc 0.02580645161290318
"Date", drop in acc 0.02346041055718473
"Date", drop in acc 0.02287390029325509
"Date", drop in acc 0.02228739002932545
"Date", drop in acc 0.02170087976539581

SVC
"Date", drop in acc 0.015249266862170097
"Date", drop in acc 0.015249266862170097
"Date", drop in acc 0.014076246334310816
"Date", drop in acc 0.014076246334310816
"Date", drop in acc 0.014076246334310816

Gradient 
Feature 1.Solare Radiation AVG, drop in acc 0.011730205278592365
Feature 2.Battery Voltage AVG, drop in acc 0.005278592375366542
"Date", drop in acc 0.005278592375366542
"Date", drop in acc 0.002346041055718451
"Date", drop in acc 0.002346041055718451

Based on the rankings of each three models it can be concluded that removing the feature "date" would improve the accuracy of each model. Date is highly variable year to year, ecspecially when it is recorded on a day/month/year basis. Shortening the Date down to month and date would also leave less room for variance within the data. As, while the hiking success can vary on a dod, yoy basis, weather trends are more likely to be consistent month to month. 