# Analyzing Backup Performance

There are several knobs we can turn to tune backup performance, but there isn't a great deal of guidance on what the best settings are for our environment, other than "try them and see!"  A big part of this is that the underlying hardware makes so much of a difference:  being I/O bound on backups means you want to configure things differently from if you are CPU-bound.  Also, if you are backing up a very busy system, you don't want to make the backup so well-tuned that it suddenly takes up 100% of your CPU.  This leads to a series of tradeoffs in configurable settings.

The most important of those settings are:
* Block Size -- The physical block size.  This really only matters for backup tapes and CD-ROMs but it is still settable.  Valid values:  { 0.5kb, 1kb, 2kb, 4kb, 8kb, 16kb, 32kb, 64kb }
* Max Transfer Size -- Maximum amount of data to be transferred per operation.  Valid values:  { 64kb, 128kb, 256kb, 512kb, 1mb, 2mb, 4mb }
* Buffer Count -- Number of buffers of size [MaxTransferSize] to be created.  Valid values:  { 1:N } but I probably wouldn't go above about 1024 without good reason, as with a MaxTransferSize of 4MB, that's up to 4GB of memory used for a single backup.
* File Count -- Tell SQL Server to stripe your backup across multiple files.  This is a nice way of getting extra throughput out of your backups.  Valid values:  { 1:N } but I probably wouldn't go above 10-12 without good reason.
* Compression -- Tell SQL Server whether or not you want to compress your backup.  This has a very minor cost of CPU but typically leads to **much** smaller backups, so my default is to say yes.  Valid values:  { TRUE, FALSE }

Armed with this knowledge, let's say you now want to go tune your systems.  Well, there are a **lot** of combinations.  Let's suppose that we go with the following options:
* Block Size:  { 0.5kb, 1kb, 2kb, 4kb, 8kb, 16kb, 32kb, 64kb }
* Max Transfer Size:  { 64kb, 128kb, 256kb, 512kb, 1mb, 2mb, 4mb }
* Buffer Count:  { 7, 15, 30, 60, 128, 256, 512, 1024 }
* File Count:  { 1, 2, 4, 6, 8, 10, 12 }
* Compression:  { TRUE }

This gives us 3136 separate options.  If your full backup averages 10 minutes, that's an expectation of 224 hours straight of backups to try each of these options.  If you have a terabyte-sized backup which takes 90 minutes to complete, you'll get your answer in approximately 84 days.

But there's a not-so-secret weapon we can use:  sampling.  Without getting into the statistics of the problem, we can decide to take a random sample of the full set of options and perform an analysis on it.  With a reasonable-sized sample, we can get somewhere close to the actual population values in a fraction of the time.

My sample today is from two databases at six sizes.  I have one database called BAC which includes four separate versions:  the full 136 GB, 89.24 GB, 57.89 GB, and 31.73 GB, where the difference comes from dropping the largest tables one at a time.  In addition, I have two versions of the Stack Overflow database:  one from 2010 when it was 10 GB in size, and another from 2013 when it was 50 GB in size.

I built a Powershell script which builds a Cartesian product of my input arrays (that is, the parameters I laid out above) and runs the [dbatools](https://dbatools.io) cmdlet Backup-DbaDatabase.  I'm writing the output results to an output file.  Then, I manually added a header with the variable names to make it easier to import into R.  I'm sampling the Cartesian product, performing only about 3% of the total number of tests.  That's still a lot of tests, but it's a much more tractable problem:  it means taking about 100 database backups rather than 3000.

The Powershell code is available in the `SampleBackupOptions.ps1` script.

First, we will load the `tidyverse` package.  Then, we will load a package called `randomForest`.  This lets us use a random forest model to analyze our data.  We will load the `caret` package to help us partition training from test data.  Finally, the `evtree` pakage will let us build expected value trees using evolutionary learning (genetic algorithms).

In [None]:
if(!require(tidyverse)) {
  install.packages("tidyverse", repos = "http://cran.us.r-project.org")
  library(tidyverse)
}

if(!require(randomForest)) {
  install.packages("randomForest", repos = "http://cran.us.r-project.org")
  library(randomForest)
}

if(!require(caret)) {
  install.packages("caret", repos = "http://cran.us.r-project.org")
  library(caret)
}

if(!require(partykit)) {
    install.packages("partykit", repos = "http://cran.us.r-project.org")
    library(partykit)
}

if(!require(evtree)) {
    install.packages("evtree_1.0-8.tar.gz", repos = NULL, type="source")
    library(evtree)
}

I am using data from six databases of different sizes.  Each file has the same set of variables in the same order.

**NOTE** -- If you get an error when trying to load your own files, make sure that the file is in UTF-8 or ASCII format.  Powershell generates UCS-2 LE BOM files by default and R has trouble reading those.

In [None]:
bac_31gb <- readr::read_delim("../data/BAC_31.73GB_PerfTest.csv", delim = ",",
  col_names = c("BlockSize", "BufferCount", "MaxTransferSize", "FileCount", "Duration"),
  col_types = cols(
      BlockSize = col_integer(),
      BufferCount = col_integer(),
      MaxTransferSize = col_integer(),
      FileCount = col_integer(),
      Duration = col_integer()
))

bac_58gb <- readr::read_delim("../data/BAC_57.89GB_PerfTest.csv", delim = ",",
  col_names = c("BlockSize", "BufferCount", "MaxTransferSize", "FileCount", "Duration"),
  col_types = cols(
      BlockSize = col_integer(),
      BufferCount = col_integer(),
      MaxTransferSize = col_integer(),
      FileCount = col_integer(),
      Duration = col_integer()
))

bac_89gb <- readr::read_delim("../data/BAC_89.24GB_PerfTest.csv", delim = ",",
  col_names = c("BlockSize", "BufferCount", "MaxTransferSize", "FileCount", "Duration"),
  col_types = cols(
      BlockSize = col_integer(),
      BufferCount = col_integer(),
      MaxTransferSize = col_integer(),
      FileCount = col_integer(),
      Duration = col_integer()
))

bac_136gb <- readr::read_delim("../data/BAC_136GB_PerfTest.csv", delim = ",",
  col_names = c("BlockSize", "BufferCount", "MaxTransferSize", "FileCount", "Duration"),
  col_types = cols(
      BlockSize = col_integer(),
      BufferCount = col_integer(),
      MaxTransferSize = col_integer(),
      FileCount = col_integer(),
      Duration = col_integer()
))

so_10gb <- readr::read_delim("../data/SO2010_10GB_PerfTest.csv", delim = ",",
  col_names = c("BlockSize", "BufferCount", "MaxTransferSize", "FileCount", "Duration"),
  col_types = cols(
      BlockSize = col_integer(),
      BufferCount = col_integer(),
      MaxTransferSize = col_integer(),
      FileCount = col_integer(),
      Duration = col_integer()
))

so_50gb <- readr::read_delim("../data/SO2013_50GB_PerfTest.csv", delim = ",",
  col_names = c("BlockSize", "BufferCount", "MaxTransferSize", "FileCount", "Duration"),
  col_types = cols(
      BlockSize = col_integer(),
      BufferCount = col_integer(),
      MaxTransferSize = col_integer(),
      FileCount = col_integer(),
      Duration = col_integer()
))


I want to be able to combine the sets of data together and draw conclusions across the broader scope.  In order to differentiate the sets of data, I have added in a new variable, DatabaseSize.

In [None]:
bac_31gb$DatabaseSize <- 31.73
bac_58gb$DatabaseSize <- 57.89
bac_89gb$DatabaseSize <- 89.24
bac_136gb$DatabaseSize <- 136.
so_10gb$DatabaseSize <- 10.
so_50gb$DatabaseSize <- 50.

backupstats <- rbind(bac_31gb, bac_58gb, bac_89gb, bac_136gb, so_10gb, so_50gb)

To help interpret the results a bit easier, I'm converting block size to kilobytes.  This is a linear transformation of an independent variable, so this change does not affect the end results aside from scaling the betas.

In [None]:
backupstats$BlockSizeKB <- backupstats$BlockSize / 1024.0
backupstats$BlockSize <- NULL

We are also going to create a pair of measures, *MemoryUsageMB* and *SecPerGB*.  The *MemoryUsageMB* measure combines the max transfer size with buffer count.  This is important because the **total amount of memory used** plays a role in backup duration, regardless of whether that memory comes in the form of more buffers or a larger buffer size.  For example, 7 buffers and a 128 KB max transfer size means that we will use 7 * 128KB = 896KB of memory for the backup itself.

The *SecPerGB* measure gives us a measure of (inverse) throughput:  how many seconds does it take to transfer one GB of data to a backup?  This prevents database size from dominating our results.

In [None]:
backupstats$MemoryUsageMB <- (backupstats$MaxTransferSize / (1024.0 * 1024.0)) * backupstats$BufferCount
backupstats$BufferCount <- NULL
backupstats$MaxTransferSize <- NULL

backupstats$SecPerGB <- backupstats$Duration / backupstats$DatabaseSize

## Building Training and Test Data Sets

We are going to use the `caret` package to split out our data into separate training and test data sets.  This way, we can use the training data set to build a model for our given algorithm, and then our testing data set to give us an idea of how the model will perform on data it has not seen.

In [None]:
set.seed(20191119)
randbackupstats <- backupstats[sample(nrow(backupstats)), ]

trainIndex <- caret::createDataPartition(randbackupstats$SecPerGB, p = 0.7, list = FALSE, times = 1)
train_data <- randbackupstats[trainIndex,]
test_data <- randbackupstats[-trainIndex,]

nrow(train_data)
nrow(test_data)

Let's take a quick look at our training data to make sure that everything turned out alright.

In [None]:
head(train_data)

## Building A Random Forest -- Take 1

I'd first like to try creating a random forest with this input data.  I'm going to create 2000 trees and will include importance information.

In [None]:
model <- randomForest::randomForest(Duration ~ BlockSizeKB + MemoryUsageMB + FileCount + DatabaseSize,
               data = train_data,
               ntree=2000,
               importance=TRUE
           )

Because I included importance information, I can call the `importance` function to see which variables are most effective in describing duration.  By default, this function call returns two variables:  percent increase in mean squared error (`%IncMSE`) and increase in node purity `IncNodePurity`.  The increase in node purity is a biased measure which we should only use if `%IncMSE` is too expensive to calculate ([source](https://stats.stackexchange.com/questions/162465/in-a-random-forest-is-larger-incmse-better-or-worse)), so we will focus on the MSE changes.

In [None]:
randomForest::importance(model, scale=TRUE)

One of our variables has a negative percent included in Mean Squared Error value.  This might be a bit weird to think about:  a negative percentage means that the feature is not relevant.  I'm a bit surprised that this model thinks file count doesn't have any effect on backup time.

The biggest contender was obviously database size:  larger databases take more time.  After that is memory usage.  Block size is not particularly important.

The next thing I want to look at is the percent of variance explained by the model, which I can see by just calling `model`.

In [None]:
model

Our model explains 90% of **training** data set's variance.  That's an okay start.  Because I am using a separate test data set, I can compare my model's predictions against reality with the `predict()` function.

In [None]:
modelPred <- predict(model, test_data)

The `modelPred` result needs to be converted to a data frame; after that, we can column bind it to our `test_data` data set to show predictions along with input data.

In [None]:
outcomes <- cbind(test_data, as.data.frame(modelPred))

outcomes$BlockSizeKB <- NULL
outcomes$BlockSize <- NULL
outcomes$PredictedSecPerGB <- outcomes$modelPred / outcomes$DatabaseSize

Now let's look at the outcomes.  We'll look at a few sample values, calculate the Root Mean Squared Error, and then plot the residuals.

In [None]:
head(outcomes)

In [None]:
RMSE = function(m, o){
  sqrt(mean((m - o)^2))
}

RMSE(outcomes$Duration, outcomes$modelPred)

In [None]:
options(repr.plot.width=8, repr.plot.height=6)
ggplot(outcomes, aes(x = Duration, y = modelPred - Duration)) +
    geom_point()

## Building a Random Forest -- Take 2

So far, the dominating factor in our model is the datbase size, and that makes a lot of sense:  it takes more time to back up a larger database than a smaller one.  What would be great is if we could get an idea of this independent of database size.  To do that, we'll change our dependent variable from number of seconds to seconds needed to process one gigabyte of data into a backup.

In [None]:
model2 <- randomForest::randomForest(SecPerGB ~ BlockSizeKB + MemoryUsageMB + FileCount + DatabaseSize,
               data = train_data,
               ntree=2000,
               importance=TRUE
           )

We will try the same thing as before, except instead of predicting duration, we want to predict (inverse) throughput.

In [None]:
randomForest::importance(model2, scale=TRUE)

Now things get interesting:  notice that `FileCount`'s sign has flipped:  now everything is positive.  The reason is that database size dominated everything else, so having a label which reduces the effect of database size allows the other features to step up.

Let's see how the model scores overall.

In [None]:
model2

Our explained variance has dropped considerably.  It would appear that there are some factors which don't explain very well how quickly we process a database.

In [None]:
modelPred2 <- predict(model2, test_data)

The `modelPred2` result needs to be converted to a data frame; after that, we can column bind it to our `test_data` data set to show predictions along with input data.

In [None]:
outcomes2 <- cbind(test_data, as.data.frame(modelPred2))
outcomes2$PredictedDuration <- outcomes2$modelPred2 * outcomes2$DatabaseSize

outcomes2$MaxTransferSize64KB <- NULL
outcomes2$BlockSizeKB <- NULL
outcomes2$BLockSize <- NULL

Now let's look at the outcomes.

In [None]:
head(outcomes2)

In [None]:
RMSE(outcomes2$Duration, outcomes2$PredictedDuration)

In [None]:
options(repr.plot.width=8, repr.plot.height=6)
ggplot(outcomes2, aes(x = Duration, y = PredictedDuration - Duration)) +
    geom_point()

This model, despite having a lower variance explained in training, actually performs **much** better than the first model.  It shows us just how valuable it is to get the right measure for prediction.

## Testing The Boundaries

What I'm going to do next is keep three of my four variables fixed and modify the memory usage to get a better understanding of how the model works.  Remember that our prediction is seconds per GB, so lower numbers are better.

In [None]:
buffer_test <- data.frame(16, c(7, 1, 2, 4, 16, 30, 120, 128, 256, 512, 1024, 2048, 4095, 4096, 20480, 409600, 81920000), 4, 31.73)
names(buffer_test) <- c("BlockSizeKB", "MemoryUsageMB", "FileCount", "DatabaseSize")
buffer_test$prediction <- predict(model2, buffer_test)
buffer_test$seconds <- buffer_test$DatabaseSize * buffer_test$prediction
buffer_test %>% arrange(DatabaseSize, MemoryUsageMB)

SQL Server's default for backups is often 7 buffers and 1MB max transfer size, for a total of 7MB memory usage.  At that level, a backup of 17GB is expected to take about 65 seconds given this model.

If we bump the buffers up, we max out somewhere between 2GB and 4GB.  The largest values we have in the actual dataset are 4GB, so we should not trust a random forest regression above that level.

Let's compare this to actual inputs in our test data set and see how they relate.

In [None]:
backupstats %>% 
    filter(FileCount == 4 & DatabaseSize == 31.73) %>%
    inner_join(buffer_test, by = c("MemoryUsageMB" = "MemoryUsageMB")) %>%
    select(DatabaseSize.x, MemoryUsageMB, SecPerGB, Duration, prediction, seconds) %>%
    arrange(MemoryUsageMB)

Now I'd like to see what happens if we fix the file count and block size but let database size grow.  This reinforces the idea that database size is a relevant feature.

In [None]:
data <- backupstats %>% filter(FileCount == 4 & BlockSizeKB == 16) %>% arrange(DatabaseSize)
data

In [None]:
options(repr.plot.width=8, repr.plot.height=6)
ggplot(data, aes(x = SecPerGB, y = DatabaseSize)) +
    geom_point() +
    geom_smooth()

## EV Trees and Genetic Algorithms

Another approach we can take is to use genetic algorithms.  This talk won't get into genetic algorithms directly, though if you are interested, you can review [my talk on the topic](https://csmore.info/on/genetics) and [my blog series on the topic](https://36chambers.wordpress.com/genetics-in-action/).

We will build an evolutionary tree which matches the regression tree from earlier.  Just like our prior demo, we will use the training data and compare against test data.

In [None]:
ev <- evtree(SecPerGB ~ BlockSizeKB + MemoryUsageMB + FileCount + DatabaseSize,
             data = train_data, minbucket = 10, maxdepth = 4)

We can get a visual interpretation of our model using the `plot()` function.

In [None]:
options(repr.plot.width=8, repr.plot.height=6)
plot(ev)

Because the visual interpretation can be a bit tricky, we can also get a treeview version, including some error information.

In [None]:
ev

Let's now build out our predictions and append them to test_data.

In [None]:
test_data$PredSecPerGB <- predict(ev, test_data)

test_data$PredDuration <- test_data$PredSecPerGB * test_data$DatabaseSize

test_data$MaxTransferSize64KB <- NULL
test_data$BlockSizeKB <- NULL
test_data$BLockSize <- NULL

In [None]:
test_data %>%
    select(FileCount, DatabaseSize, MemoryUsageMB, SecPerGB, Duration, PredSecPerGB, PredDuration) %>%
    head()

Not as many obvious hits here, though it does seem like we're missing in both directions so it doesn't appear too biased.  Of course, drawing these sorts of conclusions from the first six results is a terrible idea.

Let's look at the Root Mean Squared Error.  This gives us a measure of how far off we are in the unit of our dependent variable.

In [None]:
RMSE(test_data$Duration, test_data$PredDuration)

Our average duration is 11 seconds off, and our seconds per GB difference is 0.3.  These numbers are close to our random forest model, so that lends some credence to the evolutionary model.

In [None]:
options(repr.plot.width=8, repr.plot.height=6)
ggplot(test_data, aes(x = Duration, y = PredDuration - Duration)) +
    geom_point()

Still, we can see that there's a bit wider of a spread as well as some strange linearities in our result.  This indicates that we're doing okay but the evolutionary model might be a little too simplistic and is missing something which could make it a little more accurate.

## Conclusion

There are a few interesting takeaways here.

* If you are looking at estimating backup duration, database size will dominate.
* Converting our measure to instead minimize processing time (seconds per gigabyte of data), database size no longer dominates, though it is interesting that it still remains pertinent.
* Block size is not particularly helpful in any of these models.
* The linear model does a mediocre job of estimating backup performance, telling us that our problem is not linear in nature but we can kind of estimate it as linear with enough data.