# Week 10 HW
## Question 14.1
The breast cancer data set breast-cancer-wisconsin.data.txt has missing values.
1. Use the mean/mode imputation method to impute values for the missing data.
2. Use regression to impute values for the missing data.
3. Use regression with perturbation to impute values for the missing data.
4. (Optional) Compare the results and quality of classification models (e.g., SVM, KNN) build using
(1) the data sets from questions 1,2,3;
(2) the data that remains after data points with missing values are removed; and
(3) the data set when a binary variable is introduced to indicate missing values.

### 14.1 Part 1
In the first part of question 14.1, I am going to use the mean/mode imputation method to impute values for the missing data. In the first step below, I did some basic set up and viewed the data we're working with. From the explanation on the website hosting this dataset, the mising values are stored in the column bare_nuclei. I renamed all of the columns to match the explanation, and then stored the missing data points of which there are 16.

In [12]:
# Load necessary libraries
suppressWarnings(library(tidyr))
suppressWarnings(library(dplyr))

# Read and view the data
df <- tbl_df(read.table('breast-cancer-wisconsin.data.txt', sep=","))
head(df)

V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11
1000025,5,1,1,1,2,1,3,1,1,2
1002945,5,4,4,5,7,10,3,2,1,2
1015425,3,1,1,1,2,2,3,1,1,2
1016277,6,8,8,1,3,4,3,7,1,2
1017023,4,1,1,3,2,1,3,1,1,2
1017122,8,10,10,8,7,10,9,7,1,4


In [13]:
# Rename columns from data explanation
df <- df %>%
    rename(code_number = V1,
    clump_thickness = V2,
    cell_size_uniformity = V3,
    cell_shape_uniformity = V4,
    marginal_adhesion = V5,
    single_epithelial_cell_size = V6,
    bare_nuclei = V7,
    bland_chromatin = V8,
    normal_nucleoli = V9,
    mitosis = V10,
    class = V11)

# Get missing value count
missing_df <- df %>%
    filter(bare_nuclei == '?')

nrow(missing_df)
head(missing_df)

code_number,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_epithelial_cell_size,bare_nuclei,bland_chromatin,normal_nucleoli,mitosis,class
1057013,8,4,5,1,2,?,7,3,1,4
1096800,6,6,6,9,6,?,7,8,1,2
1183246,1,1,1,1,1,?,2,1,1,2
1184840,1,1,3,1,2,?,2,1,1,2
1193683,1,1,2,1,3,?,1,1,1,2
1197510,5,1,1,1,2,?,3,1,1,2


In the next step I am going to implement the mean/mode imputation method. This method takes the mean or mode of the non-missing data, and inputs that for the missing data. The code block directly below executes this process.

In [19]:
# Capture non-missing data
bare_nuclei_mean <- df %>%
    filter(bare_nuclei != '?') %>%
    select(bare_nuclei) 

# Calculate and store mean of non-missing data
bare_nuclei_mean <- mean(as.integer(bare_nuclei_mean$bare_nuclei))

# Print captured mean
bare_nuclei_mean

In [36]:
# Copy original data set for modification
imputed_df_mean <- df

# Convert bare nuclei to string
imputed_df_mean$bare_nuclei <- as.character(imputed_df_mean$bare_nuclei)

# Replace missing values with mean calculated above
imputed_df_mean$bare_nuclei[imputed_df_mean$bare_nuclei == '?'] <- as.character(bare_nuclei_mean)
imputed_df_mean$bare_nuclei <- as.integer(imputed_df_mean$bare_nuclei)

# Confirm that data has been replaced
imputed_df_mean %>% filter(bare_nuclei == '?') %>% nrow()

### 14.1 Part 2
In part 2 of question 14.1, I am going to impute the missing values using regression. From the lecture notes, this method will likely be more accurate than imputation through a mean, which I will analyze in part 4. To build the model, I will use bare_nuclei as the outcome variable, and the remaining columns as predictors. The training data set I will use includes data where bare_nuclei is not missing, and the test set will be the dataset with missing values.

In [29]:
# Create training data set
training <- df %>%
    filter(bare_nuclei != '?')

# Create test data set, split by predictors and outcome
test_predictors <- missing_df %>%
    select(-bare_nuclei)
test_outcome <- missing_df %>%
    select(bare_nuclei)

# Print data lengths
nrow(training)
nrow(test_predictors)
nrow(test_outcome)

With my data split by training and test sets, this next code block trains and evaluates a basic linear regression model using all of the data available. You can see from the model output that the RSquared value is 0.2792. To avoid over-fitting the data, I then built a second model based on the output P-Values from model 1. You can see that marginal_adhesion, normal_nucleoli, and class are the only significant predictors, so I built model 2 with this values.

In [30]:
set.seed(1)
# Train model
lm_model <- lm(as.integer(bare_nuclei) ~ ., data = training)
summary(lm_model)


Call:
lm(formula = as.integer(bare_nuclei) ~ ., data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.8154 -0.5699 -0.4230 -0.2522  7.4785 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  3.336e-01  3.210e-01   1.039 0.299052    
code_number                 -2.602e-08  1.142e-07  -0.228 0.819796    
clump_thickness             -2.289e-02  3.645e-02  -0.628 0.530190    
cell_size_uniformity         4.789e-02  6.195e-02   0.773 0.439790    
cell_shape_uniformity        4.326e-02  6.027e-02   0.718 0.473185    
marginal_adhesion           -1.282e-01  3.796e-02  -3.378 0.000771 ***
single_epithelial_cell_size  1.255e-02  5.081e-02   0.247 0.805011    
bland_chromatin             -2.836e-02  4.903e-02  -0.578 0.563202    
normal_nucleoli              7.875e-02  3.651e-02   2.157 0.031379 *  
mitosis                      6.559e-03  4.800e-02   0.137 0.891367    
class                        1.077e+00  1.64

In [32]:
lm_model2 <- lm(as.integer(bare_nuclei) ~ marginal_adhesion + 
                normal_nucleoli + 
                class, 
                data = training)
summary(lm_model2)


Call:
lm(formula = as.integer(bare_nuclei) ~ marginal_adhesion + normal_nucleoli + 
    class, data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6202 -0.5246 -0.4280 -0.3161  7.9230 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        0.12015    0.25120   0.478  0.63259    
marginal_adhesion -0.11190    0.03533  -3.167  0.00161 ** 
normal_nucleoli    0.09656    0.03375   2.861  0.00436 ** 
class              1.16159    0.12163   9.551  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.836 on 679 degrees of freedom
Multiple R-squared:  0.2751,	Adjusted R-squared:  0.2719 
F-statistic: 85.89 on 3 and 679 DF,  p-value: < 2.2e-16


You can see from the output from model 2, the Rsquared value is 0.2751, which is only about 0.004 lower than the model which used every field as a predictor. Model 2 has considerably less complexity with a similar Rsquared, so I'm going to use model 2 to imput the missing values. In the code block below, I used the trained model 2 to predict the missing values in my test dataset created earlier.

In [35]:
# Use trained model to predict missing values of bare_nuclei in test data set
preds <- predict(lm_model2, test_predictors)

# Add predictions column to the test dataset
test_predictors$bare_nuclei <- preds
head(test_predictors)

code_number,clump_thickness,cell_size_uniformity,cell_shape_uniformity,marginal_adhesion,single_epithelial_cell_size,bland_chromatin,normal_nucleoli,mitosis,class,bare_nuclei
1057013,8,4,5,1,2,7,3,1,4,4.944304
1096800,6,6,6,9,6,7,8,1,2,2.208742
1183246,1,1,1,1,1,2,1,1,2,2.427998
1184840,1,1,3,1,2,2,1,1,2,2.427998
1193683,1,1,2,1,3,1,1,1,2,2.427998
1197510,5,1,1,1,2,3,1,1,2,2.427998


In [42]:
# Add test + predictions back into the original data set
training$bare_nuclei <- as.integer(training$bare_nuclei)
imputed_df_regression <- training %>%
    bind_rows(test_predictors)

# Test to make sure no missing values in this dataset
imputed_df_regression %>% filter(bare_nuclei == '?') %>% nrow()
nrow(imputed_df_regression)

While the RSquared value is relatively low at 0.2751, we really don't know the strength of this model in context. Thsi could actually be a strong RSquared depending on how many other factors may introduce variability into the data set. We will see in part 4 if this method produces better values than the other imputation methods

### 14.1 Part 3
In part 3 of question 14.1, we are instructed to using regression with perturbation to impute the missing data points. This method is similar to the method used in part 2 in that we are using regression to calculate the missing values, however we are increasing the variability in the data by perturbing the predictors normally. From the lectures, this method may or may not produce more accurate predictions, but it does better mirror real life situations by representing a normal distribution in the data. In the steps below, I will use the Mice package to execute regression with perturbation.

In [47]:
# Import mice library
library(mice)

perturb_df <- df
# Mice requires NA values vs. ? to work
perturb_df$bare_nuclei[perturb_df$bare_nuclei == '?'] <- NA

# Convert bare_nuclei to integer
perturb_df$bare_nuclei <- as.integer(perturb_df$bare_nuclei)

# Build model with perturbation
perturb_model <- mice(perturb_df, method = "norm.nob", m = 1)

# fill in missing values with predictions
imputed_df_perturb <- complete(perturb_model)

# print number of NAs - should be 0
imputed_df_perturb %>% filter(is.na(bare_nuclei)) %>% nrow()
nrow(imputed_df_perturb)


 iter imp variable
  1   1  bare_nuclei
  2   1  bare_nuclei
  3   1  bare_nuclei
  4   1  bare_nuclei
  5   1  bare_nuclei


In the code block above, I trained the model using the function Mice. I used the method "norm.nob" which executes Linear regression ignoring model error. Additionally I set the m value = to 1 for the Number of multiple imputations. Mice has a built in function complete, which uses the regression model to predict the missing values in the data set. You can see from the output above that there are no missing values after running the code, and the dataset is the same length as the original data set.

### 14.1 Part 4
In part 4 of question 14.1, I am going to take the output from the 3 methods of missing data imputation and use those as input into a support vector model. I am going to testeach of the datasets to see which one has the highest prediction accuracy, and is thus the best method of imputation. In the first steps below, I loaded the ksvm library, made sure the datatypes were correct, and then split all of the data into training and test sets.

In [62]:
# Load library
library(kernlab)

# Fix data types
imputed_df_mean$bare_nuclei <- as.integer(imputed_df_mean$bare_nuclei)
imputed_df_regression$bare_nuclei <- as.integer(imputed_df_regression$bare_nuclei)
imputed_df_perturb$bare_nuclei <- as.integer(imputed_df_perturb$bare_nuclei)

# Create value for dividing data into train and test
sample_size = floor(0.75*nrow(df))
set.seed(123) # Set seed

# Randomly identifies the rows equal to sample size
train_ind = sample(seq_len(nrow(df)),size = sample_size)

# Creates the training datasets with row numbers stored in train_ind
train_mean = imputed_df_mean[train_ind,]
train_regression = imputed_df_regression[train_ind,]
train_perturb = imputed_df_perturb[train_ind,]

# Creates the test datasets excluding the row numbers mentioned in train_ind
test_mean = imputed_df_mean[-train_ind,]
test_regression = imputed_df_regression[-train_ind,]
test_perturb = imputed_df_perturb[-train_ind,]

After splitting
each dataframe, I then ran each of them through a function that tests a range of critical values
in the support vector machine function, builds a range of models, and then figures out what the
highest prediction accuracy was. As can be seen in the output below, the method of missing
value replacement that had the highest predictive accuracy when building a model was regression
imputation (the second printed line) with a predictive accuracy of .97.

After splitting all of the data sets, I next trained 3 iterations of support vector models using each of the training data sets. I also tested multiple model argument values to find the highest prediction accuracy for each data set, and then output the final prediction accuracies.

In [69]:
# KSVM function to evaluate multiple input values of K, 
# and output highest prediction accuracy
set.seed(23)
ksvm_impute_accuracy <- function(train, test) {
    
    # Create df to store prediction accuracricies for each C value
    calibration_df <- data.frame(c_value = double(),
                                prediction_accuracy = double())
    
    # Loop through C values
    for (i in 1:15) {
        # Create test model
        trained_model <- ksvm(as.matrix(train[,1:10]), as.matrix(train[,11]),
                            type = 'C-svc', C = 10**i, scale = TRUE)
        
        # Use trained model to output predictions
        test_preds <- predict(trained_model, as.matrix(test[,1:10]))
        
        # Evaluate predictions
        test_accuracy <- sum(test_preds == test[,11]) / nrow(test)
        
        # Store accuracy results for given c value
        calibration_df <- rbind(calibration_df, c(10**i, test_accuracy))
    }
    
    # Rename df cols
    names(calibration_df) <- c('c_value', 'prediction_accuracy')
    
    # Output top prediction accuracy
    print(max(calibration_df$prediction_accuracy))
}

ksvm_impute_accuracy(train_mean, test_mean)
ksvm_impute_accuracy(train_regression, test_regression)
ksvm_impute_accuracy(train_perturb, test_perturb)



[1] 0.9485714
[1] 0.9714286
[1] 0.9428571


As you can see from the output from each of the KSVM models, Regression is the most accurate model with a final prediction accuracy of 0.971. Mean imputation came in second place with 0.948, and regression with perturbation in a close 3rd with 0.942. All seem to be relatively good methods of replacing missing values, with regression performing the best.

## Question 15.1
Describe a situation or problem from your job, everyday life, current events, etc., for which optimization
would be appropriate. What data would you need? 

A situation from one of my other classes in the OMSA program for which optimization is appropriate is my group project in CSE 6242. We are looking at a sentiment analysis of every presidential speech going back to George Washington, and trying to use those metrics as predictors for approval ratings. Approval ratings weren't widely collected until 1940 with Franklin D. Roosevelt, so I would want to use imputation to back-fill the missing data for presidents before 1940. Admittedly, this is a lot of data to fill in, so it would make sense to try a few different ways. First, I would build the model with the data I have, so only looking at data from 1940 to present. Next, I would try all 3 methods of imputation across the entire data set, and finally, I would try and group presidents by similar sentiment scores and then find the average ratings within those groupings for imputation.
