A Survey on Technology Choice
======
## Hypothesis
My response variable is **Issue backlog** (PG5_2BNUI). I think the priority of this variable is going to be affected by the survey question: *In how many many software projects have you been involved?* (PG9Resp). The reasoning behind this is that developers who have been on more projects have probably used a larger variety of packages. As a result, they are likely to have more insight into which packages they prefer vs. packages they have used in the past that were not as successful for their project(s). Thus, I think survey takers who have been involved more projects will place **Issue backlog** as a higher priority than others. Also, PG8Resp (Type of developer) might also play a role because Software Developers might have a lot more exposure to VCS systems like GitHub while Data Scientists might not have to interact with them as much.


In [124]:
# For nicer printing
options(digits=2);

## Data Extraction
We begin by extracting all of the raw data and converting and calculating time differences between survey questions. We then look at our dataset.


In [125]:
# Read in the data
data <- read.csv("TechSurvey - Survey.csv",header=T);

#convert date to unix second
for (i in c("Start", "End")) 
    data[,i] = as.numeric(as.POSIXct(strptime(data[,i], "%Y-%m-%d %H:%M:%S")))
for (i in 0:12){
    vnam = paste(c("PG",i,"Submit"), collapse="")
    data[,vnam] = as.numeric(as.POSIXct(strptime(data[,vnam], "%Y-%m-%d %H:%M:%S")))
}
#calculate differences in time    
for (i in 12:0){
    pv = paste(c("PG",i-1,"Submit"), collapse="");
    if (i==0) 
        pv="Start";
    vnam = paste(c("PG",i,"Submit"), collapse="");
    data[,vnam] = data[,vnam] -data[,pv];
}
#now explore variables
summary(data);

     Device    Completed       Start               End               PG0Dis   
        :  2   0    :  2   Min.   :1.54e+09   Min.   :1.54e+09   Min.   :  0  
 Bot    :  1   FALSE:546   1st Qu.:1.54e+09   1st Qu.:1.54e+09   1st Qu.:  0  
 PC     :955   TRUE :805   Median :1.54e+09   Median :1.54e+09   Median :  1  
 Phone  :376               Mean   :1.54e+09   Mean   :1.54e+09   Mean   : 44  
 Tablet : 16               3rd Qu.:1.54e+09   3rd Qu.:1.54e+09   3rd Qu.: 24  
 Unknown:  3               Max.   :1.54e+09   Max.   :1.54e+09   Max.   :168  
                           NA's   :2          NA's   :548        NA's   :73   
    PG0Shown      PG0Submit                                     PG1PsnUse  
 Min.   :   0   Min.   :    2   For personal work and/or research use:727  
 1st Qu.:   0   1st Qu.:    6                                        :613  
 Median : 102   Median :    9   Chapter book                         :  1  
 Mean   : 249   Mean   :  299   dissertation research           

## Interpret  and Clean Raw Data pt1
Since we are dealing with the ranking of a question on PG5, we are going to want to ignore all other PG5 columns because they will heavily bias our model. We keep our response variable (PG5_2BNUI). This is the first step of cleaning our data.

In [126]:
# Get the column with our response variables (PG5_2BNUI)
cols = names(data)
print(cols)
pg5_2bnui = 25

# Get a list of columns to remove (all other PG5 variables)
pg5 = c(grep("PG5", col_names))
remove_indexes = pg5[-grep(pg5_2bnui, pg5)]

# Remove all other columns
data = data[,-remove_indexes]
summary(data)

 [1] "Device"        "Completed"     "Start"         "End"          
 [5] "PG0Dis"        "PG0Shown"      "PG0Submit"     "PG1PsnUse"    
 [9] "PG1WdAuth"     "PG1Trn"        "PG1Other"      "PG1Submit"    
[13] "PG2Resp"       "PG2Submit"     "PG2Resp.1"     "PG3Submit"    
[17] "PG4Dtr0_6"     "PG4Psv7_8"     "PG4Prm9_10"    "PG4AllResp"   
[21] "PG4Submit"     "PG5_1RRPQ"     "PG5_1Order"    "PG5_1Time"    
[25] "PG5_2BNUI"     "PG5_2Order"    "PG5_2Time"     "PG5_3HDS"     
[29] "PG5_3Order"    "PG5_3Time"     "PG5_4VGP"      "PG5_4Order"   
[33] "PG5_4Time"     "PG5_5PHR"      "PG5_5Order"    "PG5_5Time"    
[37] "PG5_6SSYOP"    "PG5_6Order"    "PG5_6Time"     "PG5_7NDYP"    
[41] "PG5_7Order"    "PG5_7Time"     "PG5_8CP"       "PG5_8Order"   
[45] "PG5_8Time"     "PG5_9FRP"      "PG5_9Order"    "PG5_9Time"    
[49] "PG5_10RPA"     "PG5_10Order"   "PG5_10Time"    "PG5_11NSG"    
[53] "PG5_11Order"   "PG5_11Time"    "PG5_12NWG"     "PG5_12Order"  
[57] "PG5_12Time"    "PG5_13NFG"  

     Device    Completed       Start               End               PG0Dis   
        :  2   0    :  2   Min.   :1.54e+09   Min.   :1.54e+09   Min.   :  0  
 Bot    :  1   FALSE:546   1st Qu.:1.54e+09   1st Qu.:1.54e+09   1st Qu.:  0  
 PC     :955   TRUE :805   Median :1.54e+09   Median :1.54e+09   Median :  1  
 Phone  :376               Mean   :1.54e+09   Mean   :1.54e+09   Mean   : 44  
 Tablet : 16               3rd Qu.:1.54e+09   3rd Qu.:1.54e+09   3rd Qu.: 24  
 Unknown:  3               Max.   :1.54e+09   Max.   :1.54e+09   Max.   :168  
                           NA's   :2          NA's   :548        NA's   :73   
    PG0Shown      PG0Submit                                     PG1PsnUse  
 Min.   :   0   Min.   :    2   For personal work and/or research use:727  
 1st Qu.:   0   1st Qu.:    6                                        :613  
 Median : 102   Median :    9   Chapter book                         :  1  
 Mean   : 249   Mean   :  299   dissertation research           

## Interpret  and Clean Raw Data pt2
We begin by looking at all the numeric fields remaining in our data set. We use these fields to look at their correlation. However, as you will notice, our response variable and the predictors we want to use are not shown here because they are not numeric fields. Let's work on the data some more in order to get better insight.

In [127]:
#get numeric fields only for correlation
sel = c()
for (i in 1:dim(data)[2]) if (is.numeric(data[,i])) sel = c(sel, i);


cor(data[,sel],method="spearman",use="pairwise.complete.obs"); #OK for any: uses ranks

Unnamed: 0,Start,End,PG0Dis,PG0Shown,PG0Submit,PG1Submit,PG2Submit,PG3Submit,PG4Dtr0_6,PG4Psv7_8,PG4Prm9_10,PG4AllResp,PG4Submit,PG6Submit,PG7Submit,PG8Submit,PG9Submit,PG10Submit,PG11Submit,PG12Submit
Start,1.0,0.9952,-0.0417,-0.11507,0.135,0.1156,0.0791,0.0384,0.0121,0.00371,-0.027,0.0063,0.019,0.0054,0.0776,0.044,0.04101,0.047,0.079,0.075
End,0.9952,1.0,-0.0415,-0.09879,0.114,0.155,0.0791,0.0511,-0.05185,-0.04576,-0.027,-0.0158,0.017,0.0051,0.0759,0.044,0.04071,0.052,0.079,0.077
PG0Dis,-0.0417,-0.0415,1.0,0.8722,0.015,0.0065,0.0041,0.0567,0.16368,0.02668,-0.0092,0.0018,-0.054,0.0277,0.0097,0.035,0.00995,-0.029,-0.045,0.055
PG0Shown,-0.1151,-0.0988,0.8722,1.0,0.036,0.0205,0.0023,0.0497,0.08226,0.00036,0.033,-0.0209,-0.06,0.0401,0.0121,0.026,0.00056,-0.045,-0.071,0.044
PG0Submit,0.135,0.1142,0.0153,0.03596,1.0,0.1088,0.1037,0.1273,-0.00802,-0.03763,-0.094,-0.0236,0.219,0.1518,0.1365,0.126,0.17579,0.225,0.11,0.11
PG1Submit,0.1156,0.155,0.0065,0.02047,0.109,1.0,0.1452,0.2688,-0.06852,0.05661,0.012,0.0297,0.165,0.2414,0.1133,0.107,0.10895,0.17,0.074,0.114
PG2Submit,0.0791,0.0791,0.0041,0.00235,0.104,0.1452,1.0,0.2045,0.00146,0.00897,-0.059,0.0293,0.152,0.2696,0.1245,0.157,0.20127,0.099,0.11,0.107
PG3Submit,0.0384,0.0511,0.0567,0.04968,0.127,0.2688,0.2045,1.0,0.00865,0.04424,-0.0062,-0.0193,0.196,0.2706,0.1316,0.182,0.2745,0.161,0.14,0.164
PG4Dtr0_6,0.0121,-0.0518,0.1637,0.08226,-0.008,-0.0685,0.0015,0.0087,1.0,,,1.0,-0.143,-0.1618,0.156,0.07,-0.07292,0.044,0.00084,-0.027
PG4Psv7_8,0.0037,-0.0458,0.0267,0.00036,-0.038,0.0566,0.009,0.0442,,1.0,,1.0,-0.083,-0.0146,-0.0363,0.053,0.05977,0.069,-0.049,-0.022


## Interpret  and Clean Raw Data pt3
Since some of our fields are non-numeric, lets transform our dataset to contain all numeric fields. This is possible using the is.factor() to classify our ranges/rankings as numbers for most of the survey responses. Using this transformed data set, lets look at the highly correlated ( > 0.7 ) fields using the function described in Dr. Mockus' lecture.

In [128]:
# Transform our non-numeric data into numeric
for (i in 1:dim(data)[2]) if (is.factor(data[,i])) data[,i] = as.numeric(data[,i]);
for(i in 1:ncol(data)){
  data[is.na(data[,i]), i] <- mean(data[,i], na.rm = TRUE)
}

In [129]:
# Define a function - same as fdacState.ipynb lecture
hiCor <- function(x, level){
  res <- cor(x,method="spearman", use='pairwise.complete.obs');
  res1 <- res; res1[res<0] <- -res[res < 0];
  for (i in 1:dim(x)[2]){
    res1[i,i] <- 0;
  }
  sel <- apply(res1,1,max) > level;
  res[sel,sel];
}
hiCor(data,.7)

Unnamed: 0,Completed,PG0Dis,PG0Shown,PG2Resp.1,PG6Resp,PG6Submit,PG7R,PG8Resp,PG9Resp,PG9Submit,PG10Resp,PG10Submit,PG11Resp,PG12Resp
Completed,1.0,-0.01554,0.012,0.73,0.875,-0.78,0.8714,0.8598,0.86,-0.82428,0.8783,-0.7599,0.9081,0.85
PG0Dis,-0.016,1.0,0.85,0.025,0.011,0.02,-0.0084,-0.0149,-0.007,-0.00026,0.0028,-0.0075,-0.0196,-0.015
PG0Shown,0.012,0.85391,1.0,0.041,0.02,0.01,0.0138,0.0059,0.021,-0.02597,0.0128,-0.0319,0.0082,6e-06
PG2Resp.1,0.73,0.02535,0.041,1.0,0.65,-0.56,0.6511,0.6349,0.657,-0.60038,0.656,-0.5578,0.6794,0.64
PG6Resp,0.875,0.01114,0.02,0.65,1.0,-0.71,0.7755,0.7746,0.733,-0.71933,0.7836,-0.64,0.7888,0.74
PG6Submit,-0.777,0.02013,0.01,-0.564,-0.708,1.0,-0.6908,-0.6688,-0.67,0.76868,-0.6331,0.7137,-0.6992,-0.62
PG7R,0.871,-0.00842,0.014,0.651,0.775,-0.69,1.0,0.7457,0.763,-0.74757,0.7657,-0.6867,0.8036,0.75
PG8Resp,0.86,-0.01494,0.0059,0.635,0.775,-0.67,0.7457,1.0,0.742,-0.70798,0.7733,-0.6259,0.805,0.74
PG9Resp,0.86,-0.00695,0.021,0.657,0.733,-0.67,0.7632,0.7417,1.0,-0.71107,0.7419,-0.6789,0.8093,0.77
PG9Submit,-0.824,-0.00026,-0.026,-0.6,-0.719,0.77,-0.7476,-0.708,-0.711,1.0,-0.6864,0.7278,-0.7546,-0.71


## Correlation Analysis
To get a better look at the correlation between all of the remaining predictors/survey questions, lets look at all of their adjusted R^2 values. I use the same method described in the fdacStats.ipynb example.

In [130]:
#regress each predictor on the remaining predictors
# eliminate with the highest adjR^2
res <- c();
vnam <- names(data);
for (i in 2:dim(data)[2]){
  fmla <- as.formula(paste(vnam[i],paste(vnam[-c(1,i)],collapse="+"),sep="~"));
  res <- rbind(res,c(i,round(summary(lm(fmla,data=data))$r.squared,2)));
}
row.names(res) <- vnam[res[,1]];
res[order(-res[,2]),];

0,1,2
Completed,2,0.93
PG11Resp,40,0.85
PG7R,25,0.8
PG12Resp,42,0.7
PG9Resp,36,0.69
PG6Resp,23,0.63
Start,3,0.57
End,4,0.56
PG8Resp,34,0.54
PG10Resp,38,0.5


## Correlation Analysis cont.
We drop all of the columns with an adjR^2 value >= 0.85. This results in dropping Completed and PG11Resp. Now that we have our desired predictors (all of the remaining columns), lets fit our model. Note, this measure might not be the best because it has a lot of noise since we are using every other column in the survey to model our response variable. However, we can use this as inisght into fitting our model.

In [131]:
data = data[, -which(names(data) %in% c("Completed", "PG11Resp"))]
summary(data)

     Device        Start               End               PG0Dis   
 Min.   :1.0   Min.   :1.54e+09   Min.   :1.54e+09   Min.   :  0  
 1st Qu.:3.0   1st Qu.:1.54e+09   1st Qu.:1.54e+09   1st Qu.:  0  
 Median :3.0   Median :1.54e+09   Median :1.54e+09   Median :  1  
 Mean   :3.3   Mean   :1.54e+09   Mean   :1.54e+09   Mean   : 44  
 3rd Qu.:4.0   3rd Qu.:1.54e+09   3rd Qu.:1.54e+09   3rd Qu.: 44  
 Max.   :6.0   Max.   :1.54e+09   Max.   :1.54e+09   Max.   :168  
    PG0Shown      PG0Submit       PG1PsnUse      PG1WdAuth       PG1Trn    
 Min.   :   0   Min.   :    2   Min.   : 1.0   Min.   :1.0   Min.   :1.00  
 1st Qu.:   0   1st Qu.:    6   1st Qu.: 1.0   1st Qu.:1.0   1st Qu.:1.00  
 Median : 130   Median :   10   Median : 4.0   Median :1.0   Median :1.00  
 Mean   : 249   Mean   :  299   Mean   : 2.7   Mean   :1.3   Mean   :1.14  
 3rd Qu.: 399   3rd Qu.:   28   3rd Qu.: 4.0   3rd Qu.:1.0   3rd Qu.:1.00  
 Max.   :1190   Max.   :76226   Max.   :15.0   Max.   :5.0   Max.   :3.00  

## Fitting the Model and Coefficient Analysis
Since I have little experience with machine learning and models, I followed the fdacStats lecture when doing my modeling and used a **generalized linear model**. For this first run, I simply used every remaining column in the dataset as a predictor. The results showed that a majority of the coefficients did not have a signficant impact on my response variable. Because this measure is quite noisy, I tried to fit the model better a second time.

In [132]:
mod <- glm(PG5_2BNUI ~ .,data=data);
summary(mod);


Call:
glm(formula = PG5_2BNUI ~ ., data = data)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-3.232  -1.508  -0.112   0.916   4.984  

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    1.74e+02   3.46e+02    0.50  0.61535    
Device        -1.04e-01   9.32e-02   -1.11  0.26601    
Start         -8.27e-08   2.45e-07   -0.34  0.73616    
End           -2.85e-08   3.28e-07   -0.09  0.93067    
PG0Dis        -3.21e-04   9.34e-04   -0.34  0.73139    
PG0Shown       1.26e-04   2.07e-04    0.61  0.54421    
PG0Submit     -1.22e-05   1.32e-05   -0.92  0.35557    
PG1PsnUse     -1.80e-02   3.26e-02   -0.55  0.58099    
PG1WdAuth      1.14e-01   6.67e-02    1.71  0.08775 .  
PG1Trn        -8.58e-02   1.51e-01   -0.57  0.57039    
PG1Other       1.69e-03   6.27e-03    0.27  0.78748    
PG1Submit     -3.10e-04   2.04e-04   -1.52  0.12929    
PG2Resp        7.15e-02   5.69e-02    1.26  0.20872    
PG2Submit      6.69e-04   7.64e-04    0.88  0.38

## Modeling and Coefficient Analysis cont.
Using our first model above, with all columns as predictors, I took all the resulting coefficients with a p value < 1 and used those as the predictors for this model. The model is still a generalized linear model, just with specific predictors this time. We get a lower AIC score with our second model, so it is the one I ended up using for final results.

In [133]:
mod2 <- glm(PG5_2BNUI ~ PG1WdAuth+PG4AllResp+PG6Resp+PG6Submit+PG7R+PG8Resp+PG9Resp+PG9Submit,data=data)
summary(mod2);


Call:
glm(formula = PG5_2BNUI ~ PG1WdAuth + PG4AllResp + PG6Resp + 
    PG6Submit + PG7R + PG8Resp + PG9Resp + PG9Submit, data = data)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.985  -1.548  -0.136   0.850   4.924  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.94e-01   2.56e-01    1.54  0.12357    
PG1WdAuth    1.23e-01   6.43e-02    1.91  0.05615 .  
PG4AllResp   4.44e-02   2.77e-02    1.60  0.10966    
PG6Resp      1.33e-01   3.06e-02    4.35  1.5e-05 ***
PG6Submit   -3.81e-04   2.26e-04   -1.68  0.09229 .  
PG7R         4.20e-02   7.43e-03    5.65  2.0e-08 ***
PG8Resp      3.32e-03   9.92e-04    3.34  0.00085 ***
PG9Resp      7.25e-02   2.73e-02    2.65  0.00810 ** 
PG9Submit    1.53e-04   8.41e-05    1.82  0.06889 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for gaussian family taken to be 2.9)

    Null deviance: 5166.2  on 1352  degrees of freedom
Residual deviance: 385

## Results
The results of our second model above show that my hypothesis is not supported because the PG9Resp question was not the most significant when predicting my response variable. In fact, PG6Resp was one of the most signficant predictors (development experience), along with PG7R and PG8Resp. To me, these other predictors make a lot of sense and do not seem arbitrary (i.e. The more development experience someone has, the more likely they have been exposed to a variety of packages and libraries. This is a similar premise I had with my hypothesis and PG9Resp (number of projects involved with)). However, I did only use a linear model, which might not be the best for a categorical survey like this, and could have seen different results using another model like Random Forest.