This notebook will let the user see how the results of machine learning models will change when a different number of features are used. 

Lets load the API.

In [1]:
# import the API
APILoc = r"*insert API location*"

import sys
sys.path.insert(0, APILoc)

from API import *

The following cell will make a dictionary 'allModelsDict' that will contain all of the evaluations of each of the models. It will have the form: 
*
{'model 1 name': {'metric 1': [list of 10 values], 'metric 2': [list of 10 values]... }, 'model 2 name': ...} 
*

This dictionary will be used to find the standard deviation of the 10 values of each method, and will be used for ttests to see what method was significantly better then the others.

The results will then be saved to excel files.

In [None]:
# location to save the results
saveLoc = allFeatModelsLoc

# intialize dictionary that has model names as its keys and dictionaries holding the evaluation metrics for that model
modelsDict = {}

for model in modelList:
    # load the tuple
    location = allFeatModelsLoc + "\\" + model
    aTuple = joblib.load(location) # has the form: (modelsList, metricsDict)
    
    # get model name
    stopIndex = model.find("Tup.")
    modelName = model[:stopIndex]
    
    # get the dictionary holding the evaluation values
    metricsDict = aTuple[1]
    
    # get index of the model with the highest R-squared value. This will be considered to be the index of the 'best' model.
    bestRSqIndex = metricsDict['rSq'].index(max(metricsDict['rSq']))
    bestMAEIndex = metricsDict['mae'].index(max(metricsDict['mae']))
    
    # find the average values of each evaluation metric
    avgMAE = statistics.mean(metricsDict['mae'])
    avgMAPE = statistics.mean(metricsDict['mape'])
    avgRMSE = statistics.mean(metricsDict['rmse'])
    avgR = statistics.mean(metricsDict['r'])
    avgRSq = statistics.mean(metricsDict['rSq'])
    
    # find the evaluations values for the model with the highest RSq
    bestMAE = metricsDict['mae'][bestRSqIndex]
    bestMAPE = metricsDict['mape'][bestRSqIndex]
    bestRMSE = metricsDict['rmse'][bestRSqIndex]
    bestR = metricsDict['r'][bestRSqIndex]
    bestRSq = metricsDict['rSq'][bestRSqIndex]
    
    # find the standard deviation of each metric
    stDevMAE = statistics.stdev(metricsDict['mae']) 
    stDevMAPE = statistics.stdev(metricsDict['mape'])
    stDevRMSE = statistics.stdev(metricsDict['rmse'])
    stDevR = statistics.stdev(metricsDict['r'])
    stDevRSq = statistics.stdev(metricsDict['rSq'])
    
    # make a dictionary holding all of the data
    modelDict = {}
    
    # fill modelDict
    modelDict['avgMAE'] = round(avgMAE, 3)
    modelDict['avgMAPE'] = round(avgMAPE, 3)
    modelDict['avgRMSE'] = round(avgRMSE, 3)
    modelDict['avgR'] = round(avgR, 3)
    modelDict['avgRSq'] = round(avgRSq, 3)
    
    modelDict['bestMAE'] = round(bestMAE, 3)
    modelDict['bestMAPE'] = round(bestMAPE, 3)
    modelDict['bestRMSE'] = round(bestRMSE, 3)
    modelDict['bestR'] = round(bestR, 3)
    modelDict['bestRSq'] = round(bestRSq, 3)
    
    modelDict['stDevMAE'] = round(stDevMAE, 3)
    modelDict['stDevMAPE'] = round(stDevMAPE, 3)
    modelDict['stDevRMSE'] = round(stDevRMSE, 3)
    modelDict['stDevR'] = round(stDevR, 3)
    modelDict['stDevRSq'] = round(stDevRSq, 3)
    
    # add model dict to modelsDict
    modelsDict[modelName] = modelDict
    
    # make a dataframe with all the results of all 10 models for each model type
    rawResultsDf = pd.DataFrame.from_dict(metricsDict)
    
    # rename columns
    colRename = {"mae": "Mean absolute error (lbs./acre)",
             "mape": "Mean absolute percentage error (%)",
             "rmse": "Root mean squared error (lbs./acre)",
             "r": "R",
             "rSq": "R^2",
             "explainedVariance": "Explained variance"
            }

    # rename rows and columns of df
    rawResultsDf = rawResultsDf.rename(columns=colRename)
    
    # save the dataframe as csv 
    rawResultsDf.to_excel(allFeatModelsLoc + r"\\" + modelName + "RawResults.xlsx")
    
# make a dataframe storing everything
df = pd.DataFrame.from_dict(modelsDict, orient='index')

# reorganize so the models with the highest RSquared value is highest
df = df.sort_values(by ='avgRSq', ascending=False)

# make rename dictionary that will rename all row and column names to something more readable
rowRename = {'knn': 'K-nearest neighbors',
             'rf': 'Random Forest',
             'dt': 'Regression Tree',
             'nn': 'Neural network',
             'svr': 'Support vector Machine',
             'linReg': 'Linear Regression',
             'bayes': 'Bayesian ridge Regression'
            }

colRename = {"avgMAE": "Mean absolute error (lbs./acre)",
             "avgMAPE": "Mean absolute percentage error (%)",
             "avgRMSE": "Root mean squared error (lbs./acre)",
             "avgR": "R",
             "avgRSq": "R^2",
             "bestMAE": "Mean absolute error of best model (lbs./acre)",
             "bestMAPE": "Mean absolute percentage error of best model (%)",
             "bestRMSE": "Root mean squared error of best model (lbs./acre)",
             "bestR": "R of best model",
             "bestRSq": "R^2 of best model",
             "stDevMAE": "Standard deviation of mean absolute error (lbs./acre)",
             "stDevMAPE": "Standard deviation of mean absolute percentage error (%)",
             "stDevRMSE": "Standard deviation of root mean squared error (lbs./acre)",
             "stDevR": "Standard deviation of R",
             "stDevRSq": "Standard deviation of R^2",
            }

# rename rows and columns of df
df = df.rename(index=rowRename)
df = df.rename(columns=colRename)

## make a dataframe holding just the average values with the standard deviations
avgDf = pd.DataFrame()

# change the values of the average values to be the average value +/- the standard deviation
for index, row in df.iterrows():
    avgDf.loc[index, "Mean absolute error (lbs./acre)"] = str(row["Mean absolute error (lbs./acre)"]) + " +/- "  + str(2*row["Standard deviation of mean absolute error (lbs./acre)"])
    avgDf.loc[index, "Mean absolute percentage error (%)"] = str(row["Mean absolute percentage error (%)"]) + " +/- " + str(2*row["Standard deviation of mean absolute percentage error (%)"])
    avgDf.loc[index, "Root mean squared error (lbs./acre)"] = str(row["Root mean squared error (lbs./acre)"]) + " +/- " + str(2*row["Standard deviation of root mean squared error (lbs./acre)"])
    avgDf.loc[index, "R"] = str(row["R"]) + " +/- " + str(2*row["Standard deviation of R"])
    avgDf.loc[index, "R^2"] = str(row["R^2"]) + " +/- " + str(2*row["Standard deviation of R^2"])

# sort avgDf so the highest RSquared value is highest
avgDf = avgDf.sort_values(by ='R^2', ascending=False)

## make a dataframe holding just the values of the best model
colsToDrop = ["Mean absolute error (lbs./acre)", 
              "Mean absolute percentage error (%)",
              "Root mean squared error (lbs./acre)",
              "R",
              "R^2",
              "Standard deviation of mean absolute error (lbs./acre)",
              "Standard deviation of mean absolute percentage error (%)",
              "Standard deviation of root mean squared error (lbs./acre)",
              "Standard deviation of R",
              "Standard deviation of R^2"]
# make bestDf
bestDf = df.drop(columns=colsToDrop)

# sort bestDf so the highest RSquared value is highest
bestDf = bestDf.sort_values(by ='R^2 of best model', ascending=False)

# save the dataframes to folder where the model info was retrieved
df.to_csv(saveLoc + r"\\allMetrics.csv")
avgDf.to_csv(saveLoc + r"\\avgResults2TimesSDev.csv")
bestDf.to_csv(saveLoc + r"\\bestModelResults.csv")

print("FINISHED")

The next cell will produce a graph showing how the results change with different number of features. To work, there must be a working directory which itself contains folders called 'Nfeatures', where 'N' is an integer. Each 'Nfeatures' folder, must only contain the results from the function 'saveMLResults'.

In [None]:
# all features were available for feature selection

aWorkingDir = r"C:\Users\chris\Documents\Thesis\results\APPS\afterThesisDefense\modelResults\selectKBest"

aFeaturesList = ["3features", "4features", "5features", "6features", "7features", "8features", "9features",
                "10features", "11features"]
aModelList = ["rfTup.pkl", "knnTup.pkl", "svrTup.pkl","nnTup.pkl","dtTup.pkl"]

plotFeaturesAndR(aWorkingDir, aFeaturesList, aModelList)