## Attempted use of non-trivial imputation

We attempted to use the R package `mice` for data imputation. This package has a variety of different methods for predicting missing values in a dataset, the first of which I tried was "predictive mean matching" which is used for numerical variables. There are also many other methods that you can specify in this package allowing for different imputation methods, however we ran into many problems trying to get any of these to work successfully. Many different errors were observed, due to non-invertability, and the handling of categorical variables. We tried to fix this by removing discrete data from the dataset, as the ip address counted as it's own categorical data, and therefore in the imputation process would generate over 700 dummy variables, which was not ideal, as not only did the process fail, but it took a long time to even reach this conclusion, freezing computers and filling the memory in the process.

Further evidence of this attempted implementation can be found in this [file](https://github.com/Galeforse/DST-Assessment-02/blob/main/Gabriel%20Grant/Extra%20testing%20on%20MICE.ipynb), if you so wish to observe some of the attempts made.

Both `mice`, and the `amelia` package which did generate some results (found later in this document), are both examples of multiple imputation which you can specify within each function (generally I used `n=5`). This generates multiple implementations, of which you can then choose the best or "complete" to combine them together into a completely predicted model.

From the package description: "*The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data. In addition, MICE can impute continuous two-level data, and maintain consistency between imputations by means of passive imputation. Many diagnostic plots are implemented to inspect the quality of the imputations.*"

This made me think it would be suitable for usage with our data however after many failed attempts I decided to move on with other work in the project.

## Amelia

I also attempted to use the `amelia` package. This package again uses multiple imputation to predict missing values. The model assumes that the complete data is multivariate normal; a common assumption amongst missing data, but not necessarily true. The algorithm itself computes missing data by drawing from data that is already available and calculating the likelihood that a value would fit into the missing data slot.

Again in this instance I could only get the function to work upon removal of all categorical data and it resulted in not particularly good looking predictions for any variable other than the logduration, what follows is the general structure of what was attempted on the testing data that we are using.

In [1]:
library(lubridate)
library(dlookr)
library(dplyr)
library(mice)
library(VIM)
library(Amelia)
library(Zelig)


Attaching package: 'lubridate'


The following objects are masked from 'package:base':

    date, intersect, setdiff, union


Loading required package: mice


Attaching package: 'mice'


The following object is masked from 'package:stats':

    filter


The following objects are masked from 'package:base':

    cbind, rbind


Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo 


Attaching package: 'dlookr'


The following object is masked from 'package:base':

    transform



Attaching package: 'dplyr'


The following objects are masked from 'package:stats':

    filter, lag


The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union


Loading required package: colorspace

Loading required package: grid

VIM is ready to use.


Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues


Attaching package: 'VIM'


The following object is masked from 'package:datasets':

 

In [2]:
temp2 <- tempfile()
start <- proc.time()
download.file("https://github.com/Galeforse/DST-Assessment-02/raw/main/Data/test_missing1.csv.gz",temp2)
data <- (read.csv(gzfile(temp2)))
print("Data imported in:")
print(seconds_to_period((proc.time()-start)[3]))

[1] "Data imported in:"
[1] "3.53S"


In [3]:
dat <- data %>%
    mutate(
        ts = as.numeric(ts),
        orig_ip = as.factor(orig_ip),
        resp_ip = as.factor(resp_ip),
        orig_port = as.numeric(orig_port),
        resp_port = as.numeric(resp_port),
        proto = as.factor(proto),
        conn_state = as.factor(conn_state),
        history = as.factor(history),
        duration = as.numeric(duration),
        orig_bytes = as.numeric(orig_bytes),
        resp_bytes = as.numeric(resp_bytes),
        missed_bytes = as.numeric(missed_bytes),
        orig_pkts = as.numeric(orig_pkts),
        resp_pkts = as.numeric(resp_pkts),
        orig_ip_bytes = as.numeric(orig_ip_bytes),
        resp_ip_bytes = as.numeric(resp_ip_bytes)
    )
str(dat)

'data.frame':	194980 obs. of  17 variables:
 $ ts           : num  1.33e+09 1.33e+09 1.33e+09 1.33e+09 1.33e+09 ...
 $ orig_ip      : Factor w/ 200 levels "::","0.0.0.0",..: 15 15 21 106 21 81 106 21 21 81 ...
 $ orig_port    : num  2633 4094 16066 42997 38566 ...
 $ resp_ip      : Factor w/ 2913 levels "10.10.0.7","10.10.10.10",..: 1382 1523 2327 1269 1187 1187 533 953 2327 1186 ...
 $ resp_port    : num  80 80 12486 28745 32754 ...
 $ proto        : Factor w/ 3 levels "icmp","tcp","udp": 2 2 2 2 2 2 2 2 2 2 ...
 $ service      : chr  "http" "" "" "" ...
 $ duration     : num  NA 0.01 NA NA NA NA NA NA NA 0.01 ...
 $ orig_bytes   : num  NA 7085 NA NA NA ...
 $ resp_bytes   : num  NA 172 NA NA NA NA NA NA NA 0 ...
 $ conn_state   : Factor w/ 13 levels "OTH","REJ","RSTO",..: 3 3 2 2 2 2 2 2 2 2 ...
 $ missed_bytes : num  0 0 0 0 0 0 0 0 0 0 ...
 $ history      : Factor w/ 251 levels "-","a","A","Aa",..: 86 86 251 251 251 251 251 251 251 251 ...
 $ orig_pkts    : num  8 10 1 1 1 1 1 1 1 

In [4]:
dat2 <- subset(dat, select = -c(service,proto,conn_state,orig_ip,resp_ip,history,missed_bytes))
str(dat2)

'data.frame':	194980 obs. of  10 variables:
 $ ts           : num  1.33e+09 1.33e+09 1.33e+09 1.33e+09 1.33e+09 ...
 $ orig_port    : num  2633 4094 16066 42997 38566 ...
 $ resp_port    : num  80 80 12486 28745 32754 ...
 $ duration     : num  NA 0.01 NA NA NA NA NA NA NA 0.01 ...
 $ orig_bytes   : num  NA 7085 NA NA NA ...
 $ resp_bytes   : num  NA 172 NA NA NA NA NA NA NA 0 ...
 $ orig_pkts    : num  8 10 1 1 1 1 1 1 1 1 ...
 $ orig_ip_bytes: num  813 7497 48 60 48 ...
 $ resp_pkts    : num  9 9 1 1 1 1 1 1 1 1 ...
 $ resp_ip_bytes: num  8505 544 40 40 40 ...


In [5]:
dat2[,"logduration"]=log(dat2[,"duration"])
head(dat2)

Unnamed: 0_level_0,ts,orig_port,resp_port,duration,orig_bytes,resp_bytes,orig_pkts,orig_ip_bytes,resp_pkts,resp_ip_bytes,logduration
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1,1331915797,2633,80,,,,8,813,9,8505,
2,1331921224,4094,80,0.01,7085.0,172.0,10,7497,9,544,-4.60517
3,1331903910,16066,12486,,,,1,48,1,40,
4,1331988939,42997,28745,,,,1,60,1,40,
5,1331918792,38566,32754,,,,1,48,1,40,
6,1331901863,63805,45078,,,,1,44,1,40,


In [6]:
set.seed(462)
completed_data <- amelia(dat2,m=5,p2s=0,)

In [7]:
write.amelia(obj=completed_data,file.stem="G:\\Users\\Gabriel\\Documents\\Education\\UoB\\GitHubDesktop\\DST-Assessment-02\\Data\\amelia_predict_")

The above function writes our 5 amelia outputs to 5 seperate csv files, one for each imputation. Which we can access using the usual methods. I tried to find a way to combine these into a singular dataset but with much exploration, and many failed attempts ended up running out of time to do so, thus for analysis will look at the results of just one selected at random.

A better implementation of non-trivial models follows in the next document.

### References

[About Amelia pdf](https://cran.r-project.org/web/packages/Amelia/vignettes/amelia.pdf)

[Amelia R Documentation](https://www.rdocumentation.org/packages/Amelia/versions/1.7.6)

[MICE R Documentation](https://www.rdocumentation.org/packages/mice/versions/3.12.0/topics/mice)

[Data Science article about imputing data with MICE](https://datascienceplus.com/imputing-missing-data-with-r-mice-package/)

[Medium article about usage with MICE](https://medium.com/coinmonks/dealing-with-missing-data-using-r-3ae428da2d17)

[Using Amelia](https://www.linkedin.com/pulse/amelia-packager-missing-data-imputation-ramprakash-veluchamy)