# RF SIE Output Modes by Month

This is the script for the main version of the random forest model. This script looks at building and evaluating models for SIE regression with different modes of prediction, it does not actually compute a 20th century regression (for that, look at _RF SIE 1903 Prediction_).

---

By __different modes of prediction__ I mean the following: We can build a random forest predicting sea ice for each intersection of time (i.e. month) and longitude (10 degree section), meaning that we have a separate model for each grid square on a graph representing month against longitude. However, this does not consider the relationship between the dependent variables (the relationship between sea ice in one month or longitude section and another), so we can create different output 'modes' to do this.

The modes of prediction are:
- Individual = consider each time/longitude section separately
- Month = consider data from all longitudes in a month as one target
- Longitudinal = consider data from all months in that longitude as one target
- Year = consider the entire year's data as a target variable

For example, in the 'Month' mode, we create 12 different models (12 different RFs), and each one sees all of the data for all the longitudes for one specific month, and it then goes on to predict data for all the longitudes for that one month, taking into account the relationship between SIE at different longitudes as it does so. This prediction is then unravelled back into the values for each longitude section for the purpose of analysing the error.

The Individual mode uses 12 * 36 = 432 distinct models for the prediction, Month uses 12, Longitudinal uses 36, and Year uses one. This means that the ratio of predictors to dependent variables is different in each model (highest in Individual and lowest in Year), and different relationships between sections of SIE are considered in each mode (more relationships in Year and none in Individual), resulting in different accuracy of prediction across the modes.

__Note:__ I believe there is room for improvement here, by investigating different groupings of SIE, for example using a model that only forecasts the Bellingshausen Sea, or one that considers the Sea in the summer and winter independently. There is a lot that can be done in searching for a combination of groupings that produce the best predictions for the whole SIE.

---

### Creating the model

In [None]:
import numpy as np
import xarray as xr
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.ensemble import RandomForestRegressor
from calendar import month_abbr 

Grab in the list of month contractions and pop the empty string so it can be used in the graphs for axis tick labels.

In [None]:
months = list(month_abbr)
months.pop(0)

Set the number of estimators (= trees in each forest), as well as seeding a random state for reproducibility.

In [None]:
n_est = 200 #number of estimators
rs = 0 #random state
np.random.seed(rs)

Next we specify the years which will be used for:
- the start of the prediction period (predstartyear)
- the start of the training/testing data (startyear)
- the end of the training/testing data (endyear)
- the number of years we are predicting for (predyears)

In [None]:
predstartyear = 1979 #start of predicted period
startyear = 1979 #start of training
endyear = 1995 #end of training

Then we specify filepaths and open the data for our prediction and training sections.

In [None]:
yvar = 'SIE'
folder = '~/Desktop/IMAS/ncfiles/'  #filepath
pfile = 'Proxy_combined_data_v4.nc'  #proxy filename
xfile = 'CombinedProxies.nc'  #proxy filename
df = xr.open_dataset(folder+pfile).sel(year= slice(startyear, endyear))
X = xr.open_dataset(folder+xfile).sel(year= slice(predstartyear, endyear))

Then we drop predictors containing NaN values in this range.

__Note:__ We must do this as RFs cannot handle NaN values, there are many alternative options here such as:
- using the mean to impute values
- using a RF based on other proxies in that year (or only complete years) to impute
- using a known distribution to impute

In [None]:
X = X.to_dataframe().dropna(axis=1).to_xarray()

As mentioned in the preamble, there are four modes of prediction, so we set that number here and construct an array with their names in the fixed order.

In [None]:
outputmodes = 4 #number of output modes
outputname = ['Individual','Month','Longitudinal','Year']

Then log the mean SIE for use in calculating error%.

__Note:__ you could also calculate the mean SIE in each month, or at each longitude, for different ways of considering the error. I have used this as it was the simplest to implement in a short amount of time.

In [None]:
mean_sie = df['SIE'].mean('month').mean('lon').mean('year')

Next generate a train test split.

I am doing this here by randomly selecting arrays of years instead of using sklearn traintest split as not all the splits are done simultaneously and therefore can't all be done in one call of the function to maintain the same splits. So, I need to generate year lists here to ensure the years in train/test data are consistent  across X and y's. Note that the split is seeded by the numpy random seed above to be reproducible.

In [None]:
years = np.arange(startyear, endyear+1)
testyears = np.sort(np.random.choice(years, size = int(len(years)*0.33), replace = False))
trainyears = [x for x in years if x not in testyears]

Select the correct section of predictor data for train/test and then standard scale them.

In [None]:
X_train = X.sel(year = trainyears).to_dataframe().values
X_test = X.sel(year = testyears).to_dataframe().values

#Standardise values of proxies
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Group the SIE into bins of 10 degree longitude sections for the model.

In [None]:
y = df[yvar].groupby_bins('lon', np.arange(0,361,10)).mean()

Initialise an empty array to store each array of the Root Mean Squared Error as Percentage of Mean (rmsepom) for the each model on the test set.

In [None]:
errorarrays = []

Loop through each of the modes, producing the models for each section and logging their error in rmsepom before appending rmsepom to errorarrays to save it.

For the multioutput modes (not Individual) the test prediction is generated as an array of values, so we need to iterate through the array produced by the model to deconstruct it into the results for each time/longitude intersection.

Note that the xarray dimension for month starts at 1.

In [None]:
for mode in range(outputmodes):
    #Array to store root mean square error as percentage of mean
    #RMSEPOM = Root Mean Squared Error as Percentage Of Mean
    rmsepom = np.zeros([12,36]) #Initialise an array of correlations
    
    if(mode == 0): #Individual = consider each month/lon pair separately
        for m in range(12): #loop over the months
            m += 1 #month index in xarray starts at 1
            monthdata = y.sel(month=m)
            for long in range(36): 
                longend = (long+1)*10 #end of range of longitude section
                monthlondata = monthdata.sel(lon_bins=longend)
                y_train = monthlondata.sel(year = trainyears).to_numpy()
                y_test = monthlondata.sel(year = testyears).to_numpy()
                
                #Training model
                regressor = RandomForestRegressor(n_estimators = n_est, random_state = rs)
                regressor.fit(X_train, y_train)
                y_pred = regressor.predict(X_test)
                rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
                
                #Put error into heatmap
                rmsepom[m-1,int((longend-10)/10)] = 100*rmse/mean_sie
        errorarrays.append(rmsepom) #Add in error array for this mode
        
    elif(mode == 1): #Month = consider data from all longitudes in a month as one target
        for m in range(12): #loop over the months
            m += 1 #month index in xarray starts at 1
            monthdata = y.sel(month=m)
            y_train = monthdata.sel(year = trainyears).to_numpy()
            y_test = monthdata.sel(year = testyears).to_numpy()
            
            #Training model
            regressor = RandomForestRegressor(n_estimators = n_est, random_state = rs)
            regressor.fit(X_train, y_train)
            y_pred = regressor.predict(X_test)
            
            #Convert nested arrays from longitudes to year time series
            y_test = y_test.transpose()
            y_pred = y_pred.transpose() 
            
            for long in range(36):
                rmse = np.sqrt(metrics.mean_squared_error(y_test[long], y_pred[long]))
                rmsepom[m-1,long] = 100*rmse/mean_sie
                
        errorarrays.append(rmsepom) #Add in error array for this mode
        
    elif(mode == 2): #Longitudinal = consider data from all months in that longitude as one target
        for long in range(36):
            longend = (long+1)*10
            longdata = y.sel(lon_bins = longend)
            y_train = longdata.sel(year = trainyears).to_numpy()
            y_test = longdata.sel(year = testyears).to_numpy()
            
            #Training model
            regressor = RandomForestRegressor(n_estimators = n_est, random_state = rs)
            regressor.fit(X_train, y_train)
            y_pred = regressor.predict(X_test)
            
            #Convert nested arrays from months to year time series
            y_test = y_test.transpose()
            y_pred = y_pred.transpose() 
            
            for m in range(12):
                rmse = np.sqrt(metrics.mean_squared_error(y_test[m], y_pred[m]))
                rmsepom[m, long] = 100*rmse/mean_sie
        
        errorarrays.append(rmsepom) #Add in error array for this mode
    
    elif(mode == 3): #Year = consider the entire year's data as a target variable
        y_train = y.sel(year = trainyears).to_numpy()
        y_train = y_train.reshape(y_train.shape[0], y_train.shape[1] * y_train.shape[2]) # 432 entries per year
        y_test = y.sel(year = testyears).to_numpy()
        
        #Training model
        regressor = RandomForestRegressor(n_estimators = n_est, random_state = rs)
        regressor.fit(X_train, y_train)
        y_pred = regressor.predict(X_test)
        
        #Transform pred and test into standard format to get time series
        y_pred = y_pred.reshape(y_pred.shape[0],12,36).transpose()
        y_test = y_test.transpose()
        
        
        for l in range(36):
            for m in range(12):
                rmse = np.sqrt(metrics.mean_squared_error(y_test[l,m], y_pred[l,m]))
                rmsepom[m,l] = 100*rmse/mean_sie   
                
        errorarrays.append(rmsepom) #Add in error array for this mode                

---
### Analysing the model

First, define a function to plot a heatmap of the error from the model stored in rmsepom. This shows the RMSEPOM at each month/longitude section for the model with a colour scale.

In [None]:
def draw(rmsepom):
    ax = plt.subplot()
    
    rmsepomxr = xr.DataArray(rmsepom)
    rmsepomxr.plot.pcolormesh(levels = clev)
    
    ax.set_title('Error in '+yvar+' predictions using RFs')
    ax.set_ylabel('Month')
    #plt.yticks(ticks=np.arange(12))
    ax.set_xlabel('Longitude')
    ax.set_xticks(ticks=np.arange(3,36,3), labels=['','60E','','120E','','180','','120W','','60W',''])
    ax.set_yticks(ticks=np.arange(12), labels=months, rotation=0)
    ax.annotate("Mode = {}".format(outputname[mode]), xy = (0,0), xytext = (-6,-2.1))
    
    plt.show()

Next, define a function to plot the error distribution. This produces a histogram of the error values into 2% bins. It is also annotated with green, orange, and red lines to show where the 50th, 75th, and 95th percentiles of error respecitvely fall.

In [None]:
def drawerror(rmsepom):
    ax = plt.subplot()
    
    ordered = rmsepom.copy()
    ordered.ravel().sort()
    
    x1 = np.percentile(ordered.ravel(), 50)
    plt.plot([x1, x1], [0, 100], color='g', linestyle='dashed', linewidth=2)
    ax.annotate("50%: {}".format(round(x1,1)), xy = (35,100), color = 'g')
    
    x1 = np.percentile(ordered.ravel(), 75)
    ax.annotate("75%: {}".format(round(x1,1)), xy = (35,92.5), color = 'tab:orange')
    plt.plot([x1, x1], [0, 100], color='tab:orange', linestyle='dashed', linewidth=2)
    
    x1 = np.percentile(ordered.ravel(), 95)
    plt.plot([x1, x1], [0, 100], color='r', linestyle='dashed', linewidth=2)
    ax.annotate("95%: {}".format(round(x1,1)), xy = (35,85), color='r')
    
    plt.hist(ordered.ravel(), bins = np.arange(0,41,2))
    
    ax.set_title('Error distribution for mode: {}'.format(outputname[mode]))
    ax.set_ylabel('Number of Values')
    ax.set_xlabel('Error%')
    ax.set_yticks(ticks=np.arange(0,120,10))
    ax.set_xticks(ticks=np.arange(0,41,2))
    ax.annotate("Estimators = {}".format(n_est), xy = (0,0), xytext = (-6,-20))
    
    plt.show()

Finally, loop through each of the modes, putting the RMSEPOM array for that mode into the rmsepom variable. Then, produce some generic metrics and draw the 3 error charts for each mode.

In [None]:
for mode in range(outputmodes):
    rmsepom = errorarrays[mode]

    print("=========================================================")
    print('Error in '+yvar+' predictions using RFs with Output Mode = {}'.format(outputname[mode]))
    print('RMSE % minimum: ', rmsepom.min())
    print('RMSE % maximum: ', rmsepom.max())
    print('RMSE % mean: ', rmsepom.mean())
    print("=========================================================")
    
    
    
    #Contour levels for 'zoomed out'
    clev = np.arange(0,30,2)  #contour levels
    draw(rmsepom)
    
    #Contour levels for 5%
    clev = np.arange(0,5,.25)  #contour levels
    draw(rmsepom)
    
    drawerror(rmsepom)