This notebook will load in a dataset, the API, and will make machine learning models from them.

Lets first import the API.

In [2]:
# import the API
APILoc = r"*insert directory of API"

import sys
sys.path.insert(0, APILoc)

from API import *

Load the dataset

In [3]:
# get aggregate data
aggDataLoc = r"*insert location of csv of dataset"

aggDf = pd.read_csv(aggDataLoc)
aggDf = aggDf.drop("Unnamed: 0",axis=1)


List the first few rows of the dataset to check that it loaded correctly

In [4]:
aggDf.head()

Unnamed: 0,Julian Day,Time Since Sown (Days),Time Since Last Harvest (Days),Total Radiation (MJ/m^2),Total Rainfall (mm),Avg Air Temp (C),Avg Min Temp (C),Avg Max Temp (C),Avg Soil Moisture (%),Day Length (hrs),Percent Cover (%),Yield (tons/acre)
0,340,422,99,1413.37,199.136,15.61899,10.121818,21.740505,0.13408,10.0,90.915344,0.76
1,340,422,99,1413.37,199.136,15.61899,10.121818,21.740505,0.13408,10.0,90.968254,0.79
2,340,422,99,1413.37,199.136,15.61899,10.121818,21.740505,0.13408,10.0,87.925926,0.75
3,340,422,99,1413.37,199.136,15.61899,10.121818,21.740505,0.13408,10.0,88.883598,0.7
4,340,422,99,1413.37,199.136,15.61899,10.121818,21.740505,0.13408,10.0,86.883598,0.69


Now lets filter out features that will not be made available for feature selection. All of the features in the list 'XColumnsToKeep' will be made available for feature selection. The features that could have been included
from my original project are: <br>
"Julian Day" <br>
"Time Since Sown (Days)" <br>
"Time Since Last Harvest (Days)" <br>
"Total Radiation (MJ/m^2)" <br>
"Total Rainfall (mm)" <br>
"Avg Air Temp (C)" <br>
"Avg Min Temp (C)" <br>
"Avg Max Temp (C)"<br>
"Avg Soil Moisture (%)"<br>
"Day Length (hrs)"<br>
"Percent Cover (%)"<br>

In [5]:
# filter out the features that will not be used by the machine learning models

# the features to keep:
xColumnsToKeep = ["Julian Day", "Time Since Sown (Days)", "Total Radiation (MJ/m^2)",
                "Total Rainfall (mm)"]

    
# the target to keep
yColumnsToKeep = ["Yield (tons/acre)"]

# get a dataframe containing the features and the targets
xDf = aggDf[xColumnsToKeep]
yDf = aggDf[yColumnsToKeep]

# reset the index
xDf = xDf.reset_index(drop=True)
yDf = yDf.reset_index(drop=True)

pd.set_option('display.max_rows', 2500)
pd.set_option('display.max_columns', 500)

xCols = list(xDf)

Now lets look at the first few rows of the input feature data and the target data

In [6]:
xDf.head()

Unnamed: 0,Julian Day,Time Since Sown (Days),Total Radiation (MJ/m^2),Total Rainfall (mm)
0,340,422,1413.37,199.136
1,340,422,1413.37,199.136
2,340,422,1413.37,199.136
3,340,422,1413.37,199.136
4,340,422,1413.37,199.136


In [7]:
yDf.head()

Unnamed: 0,Yield (tons/acre)
0,0.76
1,0.79
2,0.75
3,0.7
4,0.69


Lets now define the parameters that will be used to run the machine learning experiments. Note that parameter grids could be made that will allow sci-kit learn to use a 5-fold gridsearch to find the model's best hyperparameters. The parameter grids that are defined here will specify the possible values for the grid search. <br>
<br>
Once the parameter grids are defined, a list of tuples must also be defined. The tuples must take the form of: <br>
(sci-kit learn model, appropriate parameter grid, name of the file to be saved). <br>
<br>
Then the number of iterations should be made. This is represented by the variable 'N'. Each model will be evaluated N times (via N-fold cross validation), and the average results of the models over those N iterations will be returned. <br>
<br>
'workingDir' is the directory in which all of the results will be saved. <br>
<br>
'numFeatures' is the number of features that will be selected (via feature selection).

In [None]:
# hide the warnings because training the neural network caues lots of warnings.
import warnings
warnings.filterwarnings('ignore')

# make the parameter grids for sklearn's gridsearchcv
rfParamGrid = {
        'model__n_estimators': [5, 10, 25, 50, 100], # Number of estimators
        'model__max_depth': [5, 10, 15, 20], # Maximum depth of the tree
        'model__criterion': ["mae"]
    }
knnParamGrid ={
        'model__n_neighbors':[2,5,10],
        'model__weights': ['uniform', 'distance'],
        'model__leaf_size': [5, 10, 30, 50]    
    }
svrParamGrid = {
        'model__kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
        'model__C': [0.1, 1.0, 5.0, 10.0],
        'model__gamma': ["scale", "auto"],
        'model__degree': [2,3,4,5]
    }
nnParamGrid = {
        'model__hidden_layer_sizes':[(3), (5), (10), (3,3), (5,5), (10,10)],
        'model__solver': ['sgd', 'adam'],
        'model__learning_rate' : ['constant', 'invscaling', 'adaptive'],
        'model__learning_rate_init': [0.1, 0.01, 0.001]      
    }

linRegParamGrid = {}

bayesParamGrid={
        'model__n_iter':[100,300,500],
        'model__lambda_1': [1.e-6, 1.e-4, 1.e-2, 1, 10],
        'model__lambda_1': [1.e-6, 1.e-4, 1.e-2, 1, 10]
    }

dtParamGrid = {
    'model__criterion': ['mae'],
    'model__max_depth': [5,10,25,50,100]
    }

aModelList = [(RandomForestRegressor(), rfParamGrid, "rfTup.pkl"), 
             (KNeighborsRegressor(), knnParamGrid, "knnTup.pkl"),
             (SVR(), svrParamGrid, "svrTup.pkl"),
             (MLPRegressor(), nnParamGrid, "nnTup.pkl"),
            (LinearRegression(), linRegParamGrid, "linRegTup.pkl"),
            (BayesianRidge(), bayesParamGrid, "bayesTup.pkl"),
            (DecisionTreeRegressor(), dtParamGrid, "dtTup.pkl")]

# the number of folds to do. This will also be the number of models that will be made for each method.
N = 10

# The location where all of the results will be saved
workingDir = r"*insert location where all of the machine learning models and their evaluation metrics should be saved*"

# the number of features that should be kept if doing feature selection
numFeatures = 4


Now lets run the tests and save the results.

In [None]:
saveMLResults(N, xDf, yDf, aModelList, workingDir, numFeatures, doSelection=False, printResults=True, metricToOptimize='r2')