A Survey on Technology Choice
======


In [3]:
# For nicer printing
options(digits=2);

library(dplyr)
library(tidyr)
library(caret)
library(ggplot2)
library(datasets)
library(MASS)

In [4]:
# Read in the data
data <- read.csv("TechSurvey - Survey.csv",header=T);

#convert date to unix second
for (i in c("Start", "End")) 
    data[,i] = as.numeric(as.POSIXct(strptime(data[,i], "%Y-%m-%d %H:%M:%S")))
for (i in 0:12){
    vnam = paste(c("PG",i,"Submit"), collapse="")
    data[,vnam] = as.numeric(as.POSIXct(strptime(data[,vnam], "%Y-%m-%d %H:%M:%S")))
}
#calculate differences in time    
for (i in 12:0){
    pv = paste(c("PG",i-1,"Submit"), collapse="");
    if (i==0) 
        pv="Start";
    vnam = paste(c("PG",i,"Submit"), collapse="");
    data[,vnam] = data[,vnam] -data[,pv];
}

In [5]:
#now explore variables
summary(data);

     Device    Completed       Start               End               PG0Dis   
        :  2   0    :  2   Min.   :1.54e+09   Min.   :1.54e+09   Min.   :  0  
 Bot    :  1   FALSE:546   1st Qu.:1.54e+09   1st Qu.:1.54e+09   1st Qu.:  0  
 PC     :955   TRUE :805   Median :1.54e+09   Median :1.54e+09   Median :  1  
 Phone  :376               Mean   :1.54e+09   Mean   :1.54e+09   Mean   : 44  
 Tablet : 16               3rd Qu.:1.54e+09   3rd Qu.:1.54e+09   3rd Qu.: 24  
 Unknown:  3               Max.   :1.54e+09   Max.   :1.54e+09   Max.   :168  
                           NA's   :2          NA's   :548        NA's   :73   
    PG0Shown      PG0Submit    
 Min.   :   0   Min.   :    2  
 1st Qu.:   0   1st Qu.:    6  
 Median : 102   Median :    9  
 Mean   : 249   Mean   :  299  
 3rd Qu.: 428   3rd Qu.:   15  
 Max.   :1190   Max.   :76226  
 NA's   :73     NA's   :199    
                                       PG1PsnUse  
 For personal work and/or research use      :727  
          

### Interpret  basic summaries

In [6]:
#get numeric fields only for correlation
sel = c()
for (i in 1:dim(data)[2]) if (is.numeric(data[,i])) sel = c(sel, i);


cor(data[,sel],method="spearman",use="pairwise.complete.obs"); #OK for any: uses ranks

Unnamed: 0,Start,End,PG0Dis,PG0Shown,PG0Submit,PG1Submit,PG2Submit,PG3Submit,PG4Dtr0_6,PG4Psv7_8,...,PG5_12Order,PG5_13Order,PG5Submit,PG6Submit,PG7Submit,PG8Submit,PG9Submit,PG10Submit,PG11Submit,PG12Submit
Start,1.0,0.9952,-0.0417,-0.11507,0.135,0.1156,0.0791,0.0384,0.0121,0.00371,...,-0.0369,0.0598,0.08512,0.0054,0.0776,0.0441,0.04101,0.047,0.079,0.0746
End,0.9952,1.0,-0.0415,-0.09879,0.1142,0.155,0.0791,0.0511,-0.05185,-0.04576,...,-0.0359,0.0661,0.09088,0.0051,0.0759,0.0435,0.04071,0.052,0.079,0.0772
PG0Dis,-0.0417,-0.0415,1.0,0.8722,0.0153,0.0065,0.0041,0.0567,0.16368,0.02668,...,0.0151,0.0384,0.00601,0.0277,0.0097,0.0354,0.00995,-0.029,-0.045,0.0546
PG0Shown,-0.1151,-0.0988,0.8722,1.0,0.036,0.0205,0.0023,0.0497,0.08226,0.00036,...,0.0074,0.0407,-0.00888,0.0401,0.0121,0.0264,0.00056,-0.045,-0.071,0.0436
PG0Submit,0.135,0.1142,0.0153,0.03596,1.0,0.1088,0.1037,0.1273,-0.00802,-0.03763,...,-0.0161,-0.028,0.17671,0.1518,0.1365,0.1258,0.17579,0.225,0.11,0.1096
PG1Submit,0.1156,0.155,0.0065,0.02047,0.1088,1.0,0.1452,0.2688,-0.06852,0.05661,...,0.0512,-0.0651,0.2467,0.2414,0.1133,0.1069,0.10895,0.17,0.074,0.1137
PG2Submit,0.0791,0.0791,0.0041,0.00235,0.1037,0.1452,1.0,0.2045,0.00146,0.00897,...,0.021,-0.0047,0.21851,0.2696,0.1245,0.1567,0.20127,0.099,0.11,0.1073
PG3Submit,0.0384,0.0511,0.0567,0.04968,0.1273,0.2688,0.2045,1.0,0.00865,0.04424,...,0.0464,-0.0222,0.26048,0.2706,0.1316,0.1822,0.2745,0.161,0.14,0.1642
PG4Dtr0_6,0.0121,-0.0518,0.1637,0.08226,-0.008,-0.0685,0.0015,0.0087,1.0,,...,0.1774,-0.1289,-0.05214,-0.1618,0.156,0.0695,-0.07292,0.044,0.00084,-0.0272
PG4Psv7_8,0.0037,-0.0458,0.0267,0.00036,-0.0376,0.0566,0.009,0.0442,,1.0,...,-0.0008,-0.0218,0.08974,-0.0146,-0.0363,0.0526,0.05977,0.069,-0.049,-0.0217


Interpret correlations: onlys start vs End, calculate differene instead


### Simple questions

- Time to take entire survey?
- Question that took the longest to complete?
- Question that took the least time?
- Top-ranked criteria?
- Demographic distribution by age?


# Time to take entire survey?

In [7]:
data_complete = filter(data, Completed == TRUE)
survey_time = data_complete$End - data_complete$Start
print(paste0("Completed Survey: ", length(data$End)))
print(paste0("Opened Survey: ", length(data_complete$End)))


[1] "Completed Survey: 1353"
[1] "Opened Survey: 805"


In [8]:
print(paste0("Max Survey: ", max(survey_time), ' Seconds'))
print(paste0("Min Survey: ", min(survey_time), ' Seconds'))
print(paste0("Average Survey: ", round(mean(survey_time)), ' Seconds'))
print(paste0("Median Survey: ", median(survey_time), ' Seconds'))

[1] "Max Survey: 87551 Seconds"
[1] "Min Survey: 51 Seconds"
[1] "Average Survey: 680 Seconds"
[1] "Median Survey: 225 Seconds"


# Question that took the longest to complete?

In [39]:
#get mean for each variable
means <- c()

q0 <- data[,"PG0Submit"]
means <- c(means, mean(q0, na.rm = TRUE))

q1 <- data[,"PG1Submit"]
means <- c(means, mean(q1, na.rm = TRUE))

q2 <- data[,"PG2Submit"]
means <- c(means, mean(q2, na.rm = TRUE))

q3 <- data[,"PG3Submit"]
means <- c(means, mean(q3, na.rm = TRUE))

q4 <- data[,"PG4Submit"]
means <- c(means, mean(q4, na.rm = TRUE))

q5 <- data[,"PG5Submit"]
means <- c(means, mean(q5, na.rm = TRUE))

q6 <- data[,"PG6Submit"]
means <- c(means, mean(q6, na.rm = TRUE))

q7 <- data[,"PG7Submit"]
means <- c(means, mean(q7, na.rm = TRUE))

q8 <- data[,"PG8Submit"]
means <- c(means, mean(q8, na.rm = TRUE))

q9 <- data[,"PG9Submit"]
means <- c(means, mean(q9, na.rm = TRUE))

q10 <- data[,"PG10Submit"]
means <- c(means, mean(q10, na.rm = TRUE))

q11 <- data[,"PG11Submit"]
means <- c(means, mean(q11, na.rm = TRUE))

q12 <- data[,"PG12Submit"]
means <- c(means, mean(q12, na.rm = TRUE))

most_time = format(round(max(means), 2), nsmall = 2)
print(paste0("PG0 took the longest: " ,most_time, " Seconds"))


[1] "PG0 took the longest: 255.32 Seconds"


# Question that took the least time?

In [38]:
least_time = format(round(min(means), 2), nsmall = 2)
print(paste0("PG11 took the shortest: " ,least_time, " Seconds"))

[1] "PG11 took the shortest: 3.79 Seconds"


In [10]:
crit_responses = dplyr::select(data_complete, starts_with('PG5'), -ends_with('Time'), -ends_with('Order'), -ends_with('Submit')) 
print(summary(crit_responses))

           PG5_1RRPQ             PG5_2BNUI              PG5_3HDS  
                :335                  :381                  :225  
 Essential      : 59   Essential      :  3   Essential      :103  
 High Priority  :102   High Priority  : 25   High Priority  :199  
 Low Priority   : 85   Low Priority   :120   Low Priority   : 68  
 Medium Priority:130   Medium Priority: 91   Medium Priority:161  
 Not a Priority : 94   Not a Priority :185   Not a Priority : 49  
            PG5_4VGP              PG5_5PHR             PG5_6SSYOP 
                :311                  :213                  :310  
 Essential      : 22   Essential      : 79   Essential      : 61  
 High Priority  :109   High Priority  :247   High Priority  :136  
 Low Priority   : 87   Low Priority   : 63   Low Priority   : 83  
 Medium Priority:162   Medium Priority:160   Medium Priority:109  
 Not a Priority :114   Not a Priority : 43   Not a Priority :106  
           PG5_7NDYP              PG5_8CP               PG5_9F

# Demographic distribution by age?

In [19]:
age <- select(data_complete, PG12Resp) %>%
    filter(PG12Resp != '')

ageNum <- factor(age[,])

print(summary(ageNum))
barplot(table(ageNum), main = 'Age Distribution')

pie(table(ageNum), main = 'Distribution by Age')

ERROR: Error in select(data_complete, PG12Resp): unused argument (PG12Resp)


## Hypothesis

My project was Package's Historic Reputation there are a few factors I think that might play into the relation. 

- Your age (PG12Resp)
- Your Primary Programing Lanuage (PG6Resp)
- Number of years Developing (PG6Resp)


People who took the survay were between 25 and 44 years old. 

In [40]:
for(i in colnames(data))
{
    data[,i] <- as.numeric (data[,i]);
}

data <- replace(data, is.na(data), 0)
sel = c() 
for (i in 1:dim(data)[2]) if (is.numeric(data[,i])) sel = c(sel, i);
cor(data[,sel], data[,'PG6Resp'], method="spearman",use="pairwise.complete.obs"); #OK for any: uses ranks

0,1
Device,-0.078
Completed,0.875
Start,-0.015
End,0.777
PG0Dis,0.021
PG0Shown,0.026
PG0Submit,0.337
PG1PsnUse,0.345
PG1WdAuth,0.082
PG1Trn,0.196


# Calculations

Using the data from the CSV we were able to compare the Age of the person, the number of years they have been a developer and their primary lanuage. We then compare all of these to see if there is any postive correlation between all of the fields we are anazling to see if there is a trend which would lead us to prove there is a postitive correlation. 

# Proposed Evaluation and Cleaning Data

There are a few different methods to approve this problem. One thing we need to make sure we are aware of is bad data and outliers. This requires us to clean our data that way we can make sure we are comparing the wanted values.

One of the first things is to remove the responses from question 5 because we were unable to use those for our models. It was also necessary to remove the start and end times since we computed an average time.  

# Correlation Statement

Since I only removed a limited amount of data there was an effect of those low correlations on the outcome of the model. Something that would be intresting to do in the future is strip out even more of the data that is low correlation that way we can see if the model becomes more accurate. 

# Results

When we look through the results we can see that adjR^2 is very low. Which means that the model didn't have a strong correlation with the data that we were evaluating.

# Looking to the future

There is a lot of data here, I think one of the biggest obsticals to overcome was getting through all the data and finding the correct fields. Once we had the fields and what they were it made it much easier to manipulate the data. I think this project would be very intresting in Python but that might just be a bias towards the lanuage! 